In the dynamic landscape of artificial intelligence, there has been a long-standing debate about the need for copyrighted materials to train the best AI models. OpenAI shocked the industry when it made a bold claim to the UK Parliament in 2023 that it was ‘impossible’ to train such models without leveraging copyrighted content, sparking a legal battle and ethical quandary. However, recent developments have challenged this conventional wisdom, providing compelling evidence that large-scale language models can be trained without the controversial use of copyrighted materials.
The Common Corpus initiative has emerged as the largest public domain dataset for LLM education. This international collaboration, led by Pleias and involving researchers in the fields of pre-LLM training, AI ethics and cultural heritage, has challenged the status quo and sparked a new era in AI practice. This multilingual and diverse data set demonstrates the potential for teaching LLMs without copyright issues and marks a significant change in the AI landscape.
Fairly Trained, the AI industry’s leading nonprofit organization, has taken decisive steps toward fairer AI practices. We have received the first certification for an LLM model created without copyright infringement, known as KL3M. Developed by Chicago-based legal technology consulting startup 273 Ventures, KL3M is not just a model, but a beacon of hope for fair AI. The rigorous certification process overseen by Ed Newton-Rex, CEO of Fairly Trained, instills confidence in the potential of fair AI, saying “there is no fundamental reason why someone cannot fairly train an LLM.”
Kelvin Legal DataPack, Fairly Trained’s carefully crafted training dataset, contains thousands of legal documents that have been reviewed for compliance with copyright law. Despite its size of approximately 350 billion tokens, this dataset demonstrates the power of curation. It may be smaller than the compilations made by OpenAI or others who scrape the Internet, but it performs well. Company founder Jillian Bommarito attributes the success of the KL3M model to the rigorous screening process applied to the data. The potential for curated data sets like these to power AI models and precisely tailor them to a given task is truly exciting. 273 Ventures now offers a coveted spot on our waiting list for clients eager to access this valuable resource.
Researchers developing Common Corpus took a bold step by leveraging a collection of texts equal to the data size used to train OpenAI’s GPT-3 model. They made it available on Hugging Face, an open source AI platform. Fairly Trained has only certified 273 Ventures’ LLM, but the emergence of Common Corpus and KL3M marks a change in the AI environment. Advocates for fairer AI, especially for artists affected by data scraping, see these initiatives as pivotal in challenging the norm. Recent certifications from Fairly Trained, including Spanish voice modulation startup VoiceMod and heavy metal AI band Frostbite Orckings, demonstrate diversity beyond the LLM and hint at a broader scope for AI certification.
The Kelvin Legal DataPack, a training dataset created by Fairly Trained, has advantages but also limitations. This dataset contains thousands of legal documents that have been reviewed for compliance with copyright law and is a valuable resource. However, it is important to remember that much of the public domain data available is out of date. This is especially true in places like the United States, where copyright protection extends for more than 70 years after the creator’s death. Therefore, this dataset may not be suitable for basing AI models on current events.
Please confirm Blogs, reference articles and projects. All credit for this study goes to the researchers of this project. Also, don’t forget to follow us Twitter. Our telegram channel, discord channeland LinkedIn GrWhoop.
If you like our work, you will love us Newsletter..
Don’t forget to join us 39,000+ ML subreddits
Sajjad Ansari is a final year undergraduate student at IIT Kharagpur. A technology enthusiast, he explores the real-world applications of AI with a focus on understanding the implications and real-world implications of AI technology. He aims to express complex AI concepts in a clear and accessible way.
🐝 Join the fastest-growing AI research newsletter read by researchers at Google, NVIDIA, Meta, Stanford, MIT, Microsoft, and many others.