In their desperate pursuit of AI training data, tech giants OpenAI, Google, and Meta have reportedly circumvented corporate policies, changed rules, and discussed circumventing copyright laws.
all New York Times investigation It shows just how long these companies have gone to gathering online information to feed their data-intensive AI systems.
In late 2021, OpenAI researchers developed a speech recognition tool called Whisper that can transcribe YouTube videos when reputable English text data is lacking.
Despite internal discussions about possible violations of YouTube’s rules prohibiting the use of YouTube’s videos for “independent” purposes,
The NYT found that OpenAI ultimately copied over 1 million hours of YouTube content. Greg Brockman, president of OpenAI, personally helped collect the footage. The copied text was then entered into GPT-4.
Google has also reportedly copied YouTube videos to collect text for its AI models, potentially infringing on the copyrights of video creators.
This comes just days after YouTube’s CEO said such activity would violate its rules. Company Terms of Service It undermines the creator.
In June 2023, Google’s legal department requested changes to the company’s privacy policy to allow content from Google Docs and other Google apps to be made publicly available for a broader range of AI products.
Faced with a shortage of its own data, Meta considered a variety of options to obtain more training data.
Executives discussed paying for book licenses, acquiring publisher Simon & Schuster, and even risking potential lawsuits by collecting copyrighted material from the Internet without permission.
Meta’s lawyers argued that using the data to train AI systems should fall under “fair use,” citing a 2015 court ruling involving Google’s book scanning project.
Ethical concerns and the future of AI training data
The collective actions of these technology companies highlight the importance of online data in the rapidly growing AI industry.
This practice has raised concerns about copyright infringement and fair compensation for creators.
Filmmaker and author Justine Bateman told the Copyright Office that AI models were taking her content, including her writing and films, without permission or payment.
“This is the biggest theft in America,” she said in an interview.
In the visual arts, MidJourney and other image models It has been proven to create copyright. This content is like a scene from a Marvel movie.
As some experts predict that high-quality online data could run out by 2026, companies are exploring alternative methods, such as directly using AI models to generate synthetic data. However, synthetic training data has its own risks and challenges and can have negative consequences. Affects model quality.
OpenAI CEO Sam Altman directly acknowledged the finite nature of online data in a May 2023 tech conference speech, saying, “It’s going to run out.”
Sy Damle, an attorney representing Silicon Valley venture capital firm Andreessen Horowitz, also discussed these challenges: “The data required is so massive that even group licensing doesn’t really work.”
The NYT and OpenAI are locked in a bitter copyright lawsuit, with the Times seeking millions of dollars in damages.
OpenAI hit back, accusing the Times of ‘hacking’ its model to detect instances of copyright infringement.
Undoubtedly, these internal investigations further emphasize that Big Tech’s data theft practices are ethically and legally unacceptable.
As lawsuits increase The legal environment surrounding the use of online data for AI training is very unstable.