Inside Big Tech's fight over AI training data

In their desperate pursuit of AI training data, tech giants OpenAI, Google, and Meta have reportedly circumvented corporate policies, changed rules, and discussed circumventing copyright laws.

all New York Times investigation It shows just how long these companies have gone to gathering online information to feed their data-intensive AI systems.

In late 2021, OpenAI researchers developed a speech recognition tool called Whisper that can transcribe YouTube videos when reputable English text data is lacking.

Despite internal discussions about possible violations of YouTube’s rules prohibiting the use of YouTube’s videos for “independent” purposes,

The NYT found that OpenAI ultimately copied over 1 million hours of YouTube content. Greg Brockman, president of OpenAI, personally helped collect the footage. The copied text was then entered into GPT-4.

Google has also reportedly copied YouTube videos to collect text for its AI models, potentially infringing on the copyrights of video creators.

This comes just days after YouTube’s CEO said such activity would violate its rules. Company Terms of Service It undermines the creator.

In June 2023, Google’s legal department requested changes to the company’s privacy policy to allow content from Google Docs and other Google apps to be made publicly available for a broader range of AI products.

Faced with a shortage of its own data, Meta considered a variety of options to obtain more training data.

Executives discussed paying for book licenses, acquiring publisher Simon & Schuster, and even risking potential lawsuits by collecting copyrighted material from the Internet without permission.

Meta’s lawyers argued that using the data to train AI systems should fall under “fair use,” citing a 2015 court ruling involving Google’s book scanning project.

Ethical concerns and the future of AI training data

The collective actions of these technology companies highlight the importance of online data in the rapidly growing AI industry.

This practice has raised concerns about copyright infringement and fair compensation for creators.

Filmmaker and author Justine Bateman told the Copyright Office that AI models were taking her content, including her writing and films, without permission or payment.

“This is the biggest theft in America,” she said in an interview.

In the visual arts, MidJourney and other image models It has been proven to create copyright. This content is like a scene from a Marvel movie.

As some experts predict that high-quality online data could run out by 2026, companies are exploring alternative methods, such as directly using AI models to generate synthetic data. However, synthetic training data has its own risks and challenges and can have negative consequences. Affects model quality.

OpenAI CEO Sam Altman directly acknowledged the finite nature of online data in a May 2023 tech conference speech, saying, “It’s going to run out.”

Sy Damle, an attorney representing Silicon Valley venture capital firm Andreessen Horowitz, also discussed these challenges: “The data required is so massive that even group licensing doesn’t really work.”

The NYT and OpenAI are locked in a bitter copyright lawsuit, with the Times seeking millions of dollars in damages.

OpenAI hit back, accusing the Times of ‘hacking’ its model to detect instances of copyright infringement.

Undoubtedly, these internal investigations further emphasize that Big Tech’s data theft practices are ethically and legally unacceptable.

As lawsuits increase The legal environment surrounding the use of online data for AI training is very unstable.

What is a network engineer?

MathPrompt: A new AI method for evading AI safety mechanisms through mathematical encoding

Study: AI Could Lead to Inconsistent Results in Home Surveillance | MIT News

GoM on GST rate rationalization will meet on September 25 to discuss slabs, rate adjustments.

DJI OSMO Action 5 Pro camera features new 40-megapixel sensor and longer battery life

What is a network engineer?

Space Marine 2, Opens the Xbox 360 Era, Brothers Enthusiasts in Steam Reviews

Texas Court Dismisses Consensys Lawsuit Against SEC Regarding Ethereum Investigation

Wheels of Change: Self-Balancing Technologies for Urban Mobility

Discord CEO sheds light on future of gamer communication as users cross 200M

Most Popular

In the world after Armored Core 6, will Mechabreak be able to break the curse of the mecha?

According to new information, Concord “cost around $400 million to build” and was internally considered Sony’s version of Star Wars. – WGB

Accelerate green energy solutions with AI

Our Picks

GoM on GST rate rationalization will meet on September 25 to discuss slabs, rate adjustments.

DJI OSMO Action 5 Pro camera features new 40-megapixel sensor and longer battery life

What is a network engineer?

Inside Big Tech’s fight over AI training data

Ethical concerns and the future of AI training data

Related Posts