The verified answer is A. Balance class representation in the dataset . The requirement is to train an AI model on data that reflects the diversity of a global customer base. In AWS responsible AI guidance, dataset quality is directly connected to fairness, bias reduction, and model behavior across different user groups. AWS identifies dataset characteristics such as inclusivity, diversity, curated data sources, and balanced datasets as part of responsible AI knowledge for AI practitioners. This directly supports the need for a dataset that represents diverse users instead of overrepresenting one group and underrepresenting another.
AWS also explains that bias can result from imbalances in data or from differences in model performance across groups. Amazon SageMaker Clarify is designed to help detect and mitigate potential bias during data preparation, after training, and in deployed models by examining specified attributes. This supports the principle that the training dataset should be examined and balanced before model training so that the model does not preferentially learn patterns from majority groups while performing poorly for minority groups.
Option B is incorrect because using a regional dataset, even if complete, does not satisfy a global diversity requirement. A complete regional dataset may still exclude languages, demographics, cultural contexts, behaviors, and usage patterns from other regions. Option C is incorrect because oversampling majority class data worsens imbalance. It increases the dominance of already overrepresented groups and can amplify biased model behavior. Option D is also incorrect because dropping minority class records removes exactly the user groups that must be represented for a global customer base.
AWS SageMaker Clarify documentation specifically describes class imbalance bias as a situation where one facet has fewer training samples than another. AWS notes that models may preferentially fit larger facets at the expense of smaller facets, leading to higher training error for underrepresented groups. Therefore, balancing class representation is the best action because it improves dataset representativeness and supports responsible AI fairness for a diverse global user base.