Compliance

Generative AI and Data Privacy: The Challenge of PII Use in Training Data Sets

by Bill Tolson

Subscribe to the Smarsh Blog Digest

Subscribe to receive a monthly digest of articles exploring regulatory updates, news, trends and best practices in electronic communications capture and archiving.

Smarsh handles information you submit to Smarsh in accordance with its Privacy Policy. By clicking "submit", you consent to Smarsh processing your information and storing it in accordance with the Privacy Policy and agree to receive communications from Smarsh and its third-party partners regarding products and services that may be of interest to you. You may withdraw your consent at any time by emailing privacy@smarsh.com.

The rapid advancement of generative AI models, such as large language models and image generators, has ushered in a new era of technological capabilities. However, this progress also raises significant concerns regarding data privacy compliance when personally identifiable information (PII) is included to train these powerful AI systems.

Generative AI and data privacy

At the core of this issue is the fundamental tension between the requirement for vast amounts of data needed for the training of generative AI models, the rights of individual data subjects (including the right to have their PII deleted), and the principles of data minimization and purpose limitation enshrined in many data privacy regulations, such as the EU's GDPR, California's CCPA/CPRA, and many more new state data privacy laws.

Generative AI models are trained on massive datasets, often containing billions or trillions of data points, to learn patterns and relationships to generate human-like text, images, or other outputs. While this training data can come from various sources, including publicly available sources, it is not uncommon for organizations to incorporate PII, such as personal names, addresses, or other identifiable information, into their training sets.

Using PII in AI training sets raises several concerns from a data privacy perspective. First, it may violate the principle of purpose limitation — if the personal data was initially collected for a different purpose than training AI models. Additionally, it could conflict with the data minimization principle, which encourages organizations to limit the collection and processing of personal data to what is strictly necessary.

Data privacy law rights

Moreover, many data privacy laws grant individuals the right to access, rectify, or delete their personal data held by organizations. However, when PII is deeply embedded within the parameters of a trained generative AI model, it becomes incredibly challenging, if not impossible, to isolate and remove that specific data point without significantly degrading the model's performance.

This predicament poses a significant compliance challenge for organizations operating under data privacy regulations. Failing to effectively remove requested PII from trained AI models could be considered a violation, potentially leading to substantial fines, legal actions, and reputational damage.

To mitigate these risks, organizations should explore techniques such as synthetic data generation, differential privacy, and federated learning, which can help obfuscate or anonymize individual PII while preserving the overall patterns and distributions in the training data.

A best practice would include maintaining detailed data lineage and documentation of training data sources, preprocessing steps, and model versioning to help identify and manage PII better within AI systems. Additionally, companies should consider changing their consent form requests to include individual PII in generative AI training data sets.

Furthermore, using PII in generative AI training raises broader ethical and legal concerns beyond compliance with data deletion requests. Organizations must carefully evaluate the potential risks, such as algorithmic bias, discrimination, and privacy violations, and implement appropriate safeguards and governance processes.

As generative AI continues to advance and permeate various industries, regulatory bodies and policymakers may need to provide additional guidance or legal frameworks explicitly addressing the challenges posed by AI systems and personal data. Striking the right balance between innovation and data privacy protection will be crucial for fostering public trust and enabling the responsible development of these transformative technologies.

Implications of generative AI models and the new data privacy laws

If a data subject exercises their right to have their personally identifiable information (PII) deleted under various data privacy laws like the GDPR, CCPA/CPRA, or the numerous other state data privacy laws, and that PII was used in training a generative AI model, there could be significant challenges in fully complying with the deletion request.

Key implications and considerations include:

Difficulty in Isolating and Removing specific PII: Generative AI models, especially large language models or image generators, are trained on massive datasets containing vast data points. Identifying and removing the specific PII of an individual data subject from the already trained model can be extremely difficult - if not impossible, due to the distributed nature of the training data within the model's parameters.

Potential Model Degradation: If the PII of a data subject is successfully identified and removed from the training data, retraining the generative AI model without that data could potentially degrade its performance and accuracy and potentially change the model's conclusions and recommendations. The extent of this degradation would depend on the size of the overall training dataset and the significance of the removed data point.

Compliance Challenges: Failing to effectively remove the requested PII from the trained AI model could be considered a violation of data privacy regulations, potentially leading to fines, legal actions, and damage to the organization's reputation.

Synthetic Data and Differential Privacy: One potential solution could be using synthetic data generation techniques or differential privacy methods during the initial training process. These approaches could help obscure or anonymize the individual PII while preserving the overall patterns and distributions in the data, making it easier to comply with deletion requests without significantly impacting the AI model's performance.

Data Lineage and Documentation: Maintaining detailed data lineage and documentation of the training data sources, preprocessing steps, and model versioning could help organizations better identify and manage PII within their AI systems, facilitating compliance with data subject rights.

Ethical and Legal Considerations: Using PII in generative AI training raises ethical and legal concerns beyond compliance with data deletion requests.

One last question:

If a specific data subject requests deletion of their PII and that PII was used for generative AI content creation, would all the associated content generation output from that model also need to be deleted and the model retrained with the new training set?

I have not reviewed the various new AI use laws emerging worldwide, but so far, I have not encountered these related questions or concerns.

Ultimately, addressing data subject deletion requests for generative AI models trained on PII may require a combination of technical solutions, robust data governance practices, and, potentially, new regulatory guidance or legal frameworks specifically addressing the challenges posed by AI systems and using PII.

Share this post!

Bill Tolson
Smarsh Blog

Our internal subject matter experts and our network of external industry experts are featured with insights into the technology and industry trends that affect your electronic communications compliance initiatives. Sign up to benefit from their deep understanding, tips and best practices regarding how your company can manage compliance risk while unlocking the business value of your communications data.

Ready to enable compliant productivity?

Join the 6,500+ customers using Smarsh to drive their business forward.

Contact Us

Tell us about yourself, and we’ll be in touch right away.