Is Your LLM Training Data Legal? The AB 2013 High-Level Summary Guide
Your training data is no longer a secret. Here's how to survive California's AB 2013. 📂
The New Transparency Mandate
AB 2013 forces developers to pull back the curtain on their training data. It is no longer acceptable to simply say "internet data." You must provide a high-level summary that identifies the sources and categories of data used.
What Needs to be Disclosed?
- Data Sources: Where did the data come from? (e.g., Common Crawl, licensed datasets, user-generated content).
- Data Categories: What kind of data is it? (e.g., text, images, code, medical records).
- Copyright Status: Does the data include copyrighted material?
- Personal Information: Does it contain PII?
Creating Your Summary
The key is to be descriptive without being exhaustive. You don't need to list every URL, but you must characterize the dataset accurately.
Conclusion
Start auditing your data pipelines now. Retrospective documentation is painful and prone to errors.