The correct answer is C because Amazon Bedrock's model evaluation feature allows users to compare outputs from different foundation models using human evaluation or automatic metrics. It enables the creation of structured evaluations where human reviewers (in this case, scientists) can assess model responses based on custom criteria like relevance, coherence, or accuracy.
From AWS documentation:
"Amazon Bedrock provides model evaluation capabilities that support both automatic and human evaluation. You can define custom evaluation prompts and collect assessments from reviewers to compare foundation model outputs for tasks such as summarization, text generation, and more."
This solution is ideal for research workflows requiring domain experts to provide feedback on LLM-generated content.
Explanation of other options:
A. Amazon Personalize is used for building recommendation systems, not for evaluating model output.
B. Amazon Rekognition is used for analyzing images and videos (e.g., moderation, facial recognition), not textual output.
D. Amazon Comprehend provides NLP services like sentiment analysis, but sentiment is not sufficient for full quality evaluation of research paper generation.
Referenced AWS AI/ML Documents and Study Guides:
Amazon Bedrock Developer Guide – Model Evaluation Overview
AWS Generative AI Best Practices
AWS ML Specialty Study Guide – Evaluation and Feedback Loops in LLMs