Training data is the massive dataset used to teach LLMs language understanding and generation. Includes web pages, books, articles that shape AI knowledge and biases.
What is Training Data?
Training data is the collection of text used to teach LLMs to understand and generate language.
Major Sources
- Common Crawl (web archives)
- Books and literature
- Wikipedia
- GitHub (code)
- Scientific papers
Implications
Content in training data influences AI knowledge. Outdated info can persist, making quality and recency important.