The best option for reducing the sensitivity of the dataset before training the model is to use the Cloud Data Loss Prevention (DLP) API to scan for sensitive data, and use Dataflow with the DLP API to encrypt sensitive values with Format Preserving Encryption. This option allows you to keep every column in the dataset, while protecting the sensitive data from unauthorized access or exposure. The Cloud DLP API can detect and classify various types of sensitive data, such as names, email addresses, phone numbers, credit card numbers, and more 1 . Dataflow can create scalable and reliable pipelines to process large volumes of data from BigQuery and other sources 2 . Format Preserving Encryption (FPE) is a technique that encrypts sensitive data while preserving its original format and length, which can help maintain the utility and validity of the data 3 . By using Dataflow with the DLP API, you can apply FPE to the sensitive values in the dataset, and store the encrypted data in BigQuery or another destination. Yo u can also use the same pipeline to decrypt the data when needed, by using the same encryption key and method 4 .
The other options are not as suitable as option B, for the following reasons:
Option A: Using Dataflow to ingest the columns with sensitive data from BigQuery, and then randomize the values in each sensitive column, would reduce the sensitivity of the data, but also the utility and accuracy of the data. Randomization is a technique that replaces sensitive data with random values, which can prevent re-identification of t he data, but also distort the distribution and relationships of the data 3 . This can affect the performance and quality of the ML model, especially if every column is critical to the model.
Option C: Using the Cloud DLP API to scan for sensitive data, and use Dataflow to replace all sensitive data by using the encryption algorithm AES-256 with a salt, would reduce the sensitivity of the data, but also the utility and validity of the data. AES-256 is a symmetric encryption algorithm that uses a 256-bit key to encrypt and decrypt data. A salt is a random value that is added to the data before encryption, to increase the randomness and security of the encrypted data. However, AES-256 does not preserve the format or length of the original data, which can cause problems when storing or processing the data. For example, if the original data is a 10-digit phone number, AES-256 would produce a much longer and different string, which can break the schema or logic of the dataset 3 .
Option D: Before training, using BigQuery to select only the columns that do not contain sensitive data, and creating an authorized view of the data so that sensitive values cannot be accessed by unauthorized individuals, would reduce the exposure of the sensitive data, but also the completeness and relevance of the data. An authorized view is a BigQuery view that allows you to share query results with particular users or groups, without giving them access to the underlying tables. However, this option assumes that you can identify the columns that do not contain sensitive data, which may not be easy or accurate. Moreover, this option would remove some columns from the dataset, which can affect the performance and quality of the ML model, especially if every column is critical to the model.
[References:, Preparing for Google Cloud Certification: Machine Learning Engineer, Course 5: Responsible AI, Week 2: Privacy, Google Cloud Professional Machine Learning Engineer Exam Guide, Section 5: Developing responsible AI solutions, 5.2 Implementing privacy techniques, Official Google Cloud Certified Professional Machine Learning Engineer Study Guide, Chapter 9: Responsible AI, Section 9.4: Privacy, De-identification techniques, Cloud Data Loss Prevention (DLP) API, Dataflow, Using Dataflow and Sensitive Data Protection to securely tokenize and import data from a relational database to BigQuery, [AES encryption], [Salt (cryptography)], [Authorized views], ]