The correct answer is C because BERTScore is a commonly used metric to evaluate the semantic similarity between generated text (like summaries) and reference text. It uses contextual embeddings from BERT to compare generated and reference sentences, making it highly suitable for evaluating generative tasks like summarization.
From AWS documentation:
"Amazon Bedrock supports BERTScore for evaluating generative text tasks, such as summarization or translation, by comparing the semantic similarity between the output and a reference."
Explanation of other options:
A. AUC is used for binary classification models, not generative text.
B. F1 score is also used for classification problems (precision/recall balance).
D. Real World Knowledge (RWK) score is not a standard or supported evaluation metric in Amazon Bedrock.
Referenced AWS AI/ML Documents and Study Guides:
Amazon Bedrock Documentation – Model Evaluation Metrics
AWS ML Specialty Guide – Evaluating Generative Models
AWS Generative AI Developer Tools