According to the Microsoft Azure AI Fundamentals (AI-900) official study guide and Microsoft Learn module “Identify features of Azure Machine Learning”, the Split Data module in an Azure Machine Learning pipeline is used to divide a dataset into two or more subsets—typically a training dataset and a testing (or validation) dataset. This is a fundamental step in the supervised machine learning workflow because it allows for accurate evaluation of the model’s performance on data it has not seen during training.
In a typical workflow, the data flows as follows:
The dataset is first preprocessed (cleaned, normalized, or transformed).
The Split Data module divides this dataset into two parts — one for training the model and another for testing or scoring the model’s accuracy.
The Train Model module uses the training data output from the Split Data module to learn patterns and build a predictive model.
The Score Model module then takes the trained model and applies it to the test data output to measure how well the model performs on unseen data.
The Split Data module typically uses a defined ratio (such as 0.7:0.3 or 70% for training and 30% for testing). This ensures that the trained model can generalize well to new, real-world data rather than simply memorizing the training examples.
Now, addressing the incorrect options:
A. Selecting columns that must be included in the model is done by the Select Columns in Dataset module.
C. Diverting records that have missing data is handled by the Clean Missing Data module.
D. Scaling numeric variables is done using the Normalize Data or Edit Metadata modules.
Therefore, based on the official AI-900 learning objectives, the verified and most accurate answer is B. creating training and validation datasets.