Reddit, Tumblr, X: The 2024 Race for Training Data Source

Published May 10, 2026 · By CampaignsLive · Industry

Through 2024, the major social and content platforms that had historically been informally scraped to train AI models signed a series of formal licensing deals with AI labs. Reddit signed with Google for a reported $60 million annually. Automattic (Tumblr, WordPress) signed with Midjourney and OpenAI. Stack Overflow signed with OpenAI. Photobucket reportedly entered into discussions with multiple labs. The Financial Times, Le Monde, Axel Springer, News Corp, and several major publishers signed similar deals on the text side.

The deals reshaped what “trained on the open internet” actually meant. By the end of 2024, the dominant AI models were no longer trained on what was technically scrape-able from public web pages; they were trained on what was contractually licensed from rights-holders who had the leverage to extract licensing fees. The shift mattered for brand creative production in specific ways that the trade-press coverage mostly missed.

What actually got licensed, and what did not

The publicly-announced deals through 2024 covered specific content corpora.

Reddit submissions and comments. Google’s deal with Reddit gave Google access to the platform’s content corpus for training Gemini and other models. The deal explicitly did not include user-private content (DMs, private subreddits) and included revenue-sharing arrangements with Reddit that the platform passed to users only in extremely limited form.

Tumblr and WordPress posts. Automattic’s deals gave AI labs access to publicly-posted content on Tumblr and WordPress.com properties. The deal included an opt-out mechanism for individual users, though the default was opt-in. The implementation of the opt-out was contested in the months following the deal.

Stack Overflow questions and answers. OpenAI’s deal gave access to Stack Overflow’s structured Q&A corpus for code-generation training. The deal triggered a noticeable user response on the platform; some users began deleting their contributions, others modified them to include explicit anti-training language.

Major publisher archives. The deals between OpenAI, Anthropic, Google, and the major publishers (Financial Times, News Corp, Axel Springer, Le Monde, Vox, The Atlantic, AP, Reuters, others) gave the AI labs access to text content for training and for retrieval-augmented generation. The terms varied; most included multi-year periods, revenue-sharing arrangements, and provisions for content updates.

What did not get licensed, in most cases, was image content. The image-corpus licensing market through 2024 remained more fragmented than the text side. Adobe Stock’s licensed-training arrangement remained the most notable case. Shutterstock’s continued licensing arrangements with OpenAI and others extended. Most of the rest of the image-training corpus remained either openly-scraped or in legal contestation.

Why the deals happened in 2024 specifically

Three pressures came together in the window.

The training-data scarcity story became real. Through 2023 and into 2024, AI labs were openly discussing the prospect that the supply of high-quality training data was becoming a binding constraint on model improvement. The narrative was probably overstated — there is more data available than the discourse implied — but the perception drove labs to lock down access to specific high-quality corpora through licensing rather than relying on scraping that might be cut off.

The legal exposure on scraping became harder to ignore. The series of lawsuits filed through 2023 and 2024 — Getty against Stability AI, the New York Times against OpenAI, Authors Guild and others, the various class actions on behalf of artists and writers — made the legal risk of training-corpus-by-scraping a real consideration on labs’ balance sheets. Licensing converted an uncertain legal exposure into a known operating cost.

The platforms realized they had a sellable asset. Reddit, Tumblr, Stack Overflow, and the major publishers had been giving away their content corpora for free for years. The realization that AI labs would pay material money for access — and that there was no obvious downside to being paid for what was already happening — moved quickly through the platform executive suites through 2023 and 2024.

What this changed for brand creative

The training-data licensing wave affected brand creative production in three indirect but consequential ways.

The “what is this model trained on?” question got better answers. Brand teams evaluating AI tools could increasingly point to specific named corpora — Adobe Stock, Reddit-via-Google, Stack-Overflow-via-OpenAI, the major publishers’ libraries via various labs — rather than the vague “trained on internet data” formulations that had been the default a year earlier. The provenance was, for the leading tools, more traceable.

The training-corpus question became a procurement question. Brand procurement and legal teams began including training-corpus provenance as a standard line item in AI tool evaluations. Tools whose training corpora were licensed rather than scraped became preferable on procurement grounds, particularly in regulated industries and in brands with active rights-management infrastructure.

The output character started to reflect the licensed corpora. Models trained heavily on Reddit content produced output with a specific Reddit-flavored cultural register. Models trained heavily on the major publishers’ archives produced output with a specific journalistic register. The character was visible in the work, and brand teams began choosing tools partly based on which corpus’s flavor was a closer fit for the brand voice.

What did not change

Two things the licensing wave did not address.

The first was image-side training corpus questions. The 2024 deals were heavily skewed toward text training data. The image-side licensing market remained fragmented, with Adobe Stock’s arrangement as the standout case and most of the major image models continuing to train on corpora whose provenance was contested. The brand-creative-specific case — where the training corpus character matters more than in text generation, because brand visual register is more specific — continued to depend on specialized tooling rather than on the broadly-licensed alternatives.

The second was the question of upstream creator compensation. The licensing deals were between platforms and AI labs. The individual contributors whose content sat inside those platforms — Reddit posters, Tumblr bloggers, Stack Overflow answerers, freelance writers contributing to publishers — generally received nothing or token compensation. The platforms captured the licensing value; the creators did not. This generated significant downstream user response on the affected platforms through 2024, including content deletion campaigns, modified-license attempts, and the broader conversation about who actually owns the value in user-contributed content corpora.

The longer trajectory

By the end of 2024, the question of what an AI model was trained on had become a contractual question rather than a technical one. The leading commercial models, for most categories of training data, could point to specific licensed corpora rather than to scraped open-web data. The model providers’ legal exposure had narrowed; the training data scarcity story had been at least partially defused; the platforms had monetized a new asset class.

The structural shift continues to play out. By 2025, the question of training-corpus provenance was a standard part of brand-side AI tool evaluation. The brands that have done well with AI creative production have been the ones that paid attention to the provenance question early and chose tools where the answers were defensible. The brands that did not are spending 2025 working through procurement reviews retroactively.

For the related discussion of how training data shapes output character, see AI Creative vs. AI Slop. For the IP-ownership side of the brand-AI question, see Why Your Brand Should Own Its AI Creative.

Tagged

training data licensing Reddit platforms

Reddit, Tumblr, X: The 2024 Race for Training Data Source

What actually got licensed, and what did not

Why the deals happened in 2024 specifically

What this changed for brand creative

What did not change

The longer trajectory

Tagged

Related reading

Digital Advertising

Print and Out-of-Home

The platform, end to end

Start building campaigns that matter.