AI Modules Features Steps Pricing FAQ Blog Tutorial Videos Glossary About Us Agencies
Technical

Training Data

Training data is the massive dataset used to teach LLMs language understanding and generation. Includes web pages, books, articles that shape AI knowledge and biases.

What is Training Data?

Training data is the collection of text used to teach LLMs to understand and generate language.

Major Sources

  • Common Crawl (web archives)
  • Books and literature
  • Wikipedia
  • GitHub (code)
  • Scientific papers

Implications

Content in training data influences AI knowledge. Outdated info can persist, making quality and recency important.

Go further

Discover our in-depth article on this topic

Read article