skip to Main Content
AI Web Scraper

Google Proposes Controversial ‘Opt-out’ Model for AI Training Data

Google has sparked controversy with a proposal that publishers must ‘opt out’ if they don’t want their content scraped for AI training purposes.

The tech giant submitted this plan to the Australian government amid a regulatory review of ‘high-risk’ AI like deepfakes and disinformation.

Google argues that AI developers need broad access to data to advance systems like its Bard chatbot. However, critics say this opt-out approach upends copyright laws and forces the burden onto publishers.

Seeking Copyright Changes

In its submission, Google said, “Copyright law should enable appropriate and fair use of copyrighted content to enable the training of AI models in Australia on a broad and diverse range of data.”

The company points to its robots.txt protocol that lets sites specify sections crawlers can’t access.

Yet Google offered no details on how opting out would work in practice.

A recent blog post vaguely mentioned developing new “standards and protocols” so web creators can choose their AI participation level.

Since debuting Bard in Australia in May, Google has lobbied to relax copyright rules.

However, it’s not alone in seeking more data. AI leader OpenAI aims to expand its ChatGPT dataset using a new web crawler with the same opt-out approach.

Training Data Concerns

The clamour for content comes as AI popularity has exploded. Systems like ChatGPT and Bard rely on ingesting massive volumes of text, images and video to function.

According to OpenAI, “GPT-4 has learned from a variety of licensed, created, and publicly available data sources.”

Google’s proposal essentially tells publishers to “hand over your work for our AI or take action to opt-out.” Experts argue this raises ethical issues and violates copyright laws putting the onus on rights holders rather than users.

Also read:

Backlash From Publishers

Publishers are pushing back against big tech’s data ambitions. News Corp is already in talks seeking payment from AI firms using its content.

AFP released an open letter blasting the practice, stating:
“Generative AI and large language models are often trained using proprietary media content, which publishers invest large amounts of time and resources to produce. Such practices undermine the media industry’s core business models.”

The media agency said this “violates copyright law” and reduces media diversity by undermining investment in coverage.

Striking a Balance

The debate epitomises the tension between advancing AI through unlimited data access versus respecting ownership rights.

More content consumed means more capable systems. But companies also profit from others’ work without sharing the gains.

In its national AI regulatory review, Australia is examining how to shape the technology’s future trajectory.

If discourse gives way to data-hungry tech giants’ self-interest, it could establish an AI ecosystem where creations get freely absorbed unless creators actively opt-out.

Striking the appropriate balance won’t be straightforward. For smaller publishers with limited resources, opting out may prove difficult.

More debate is needed to develop fair copyright standards that help AI progress while upholding ownership rights.

Rebecca Taylor

Rebecca is our AI news writer. A graduate of Leeds University with an International Journalism MA, she possesses a keen eye for the latest AI developments. Rebecca’s passion for AI, and with her journalistic expertise, brings insightful news stories for our readers.

Recent AI News Articles
Amazon - Anthropic
Back To Top