Option B is the correct evaluation configuration because it enables end-to-end assessment of both retrieval and generation quality while supporting direct comparison of chunking strategies and foundation models. Amazon Bedrock evaluation jobs are designed to support RAG workflows by evaluating how well retrieved context supports accurate and high-quality model outputs.
A retrieve-and-generate evaluation job evaluates the complete RAG pipeline, not just retrieval. This is essential for medical information use cases, where both the relevance of retrieved content and the correctness of generated responses directly impact user safety and trust. Including multiple chunking strategies in the evaluation dataset allows side-by-side comparison under identical prompts and conditions.
Custom precision-at-k metrics measure how effectively the retrieval component surfaces relevant chunks, while an LLM-as-a-judge metric provides qualitative scoring of generated responses. Using a numeric scale enables consistent, repeatable evaluation and supports automated quality gates. Amazon Bedrock supports LLM-based evaluators to score dimensions such as accuracy, completeness, and relevance.
Using the same evaluator model to assess outputs from both FMs ensures consistent scoring and eliminates evaluator bias. This configuration allows the company to define quantitative thresholds that must be met before deployment, enabling automated promotion through CI/CD pipelines.
Option A evaluates retrieval only and cannot assess generation quality. Option C introduces manual review, which does not scale and delays deployment. Option D separates retrieval and generation evaluation, making it harder to correlate chunking strategies with final output quality.
Therefore, Option B best meets the requirements for systematic evaluation, comparison, and quality enforcement in an Amazon Bedrock–based RAG system.