Training data is the massive dataset used to teach LLMs language understanding and generation. Includes web pages, books, articles that shape AI knowledge and biases.

What is Training Data?

Training data is the collection of text used to teach LLMs to understand and generate language.

Major Sources

  • Common Crawl (web archives)
  • Books and literature
  • Wikipedia
  • GitHub (code)
  • Scientific papers

Implications

Content in training data influences AI knowledge. Outdated info can persist, making quality and recency important.

Only 16% of brands appear when their customers ask AIs. Does yours?

Every question asked to ChatGPT without your name in the answer is a competitor recommended instead of you — measured across 6,820 real AI answers.

Go further

Discover our in-depth article on this topic

Read article