This is a newsletter. If you would like to receive future newsletters via email, please use the subscribe button below.
I’ve been thinking a lot the past few weeks about a recent edition of Tomasz Tunguz’s newsletter - “What if LLMs Change the Business Model of the Internet?”
Tunguz notes that Reddit’s recent S-1 reveals that 10% of its revenue comes from selling data for the training of LLMs, and wonders whether this might spell a new business model for the internet. Instead of selling ads experienced by users via their front-end interactions with the sites they visit, he thinks the model could evolve to one where sites are primarily focused on product interactions that create the types of data most valuable to the LLM developers.
It’s an interesting take (and there’s good back in forth in the comments on whether it’s likely). To apply the marketing lens to that world, paid digital advertising becomes less about buying the top search result or a contextual display ad, and more about becoming the “sponsor” of answers that ChatGPT and its competitors provide on topics related to your product. Or maybe even the sponsor of a set amount of usage of those products, akin to when Mercedes, Rolex and company sponsor getting to watch an hour of The Masters with only 4 minutes of advertising.
There will be a lot of regulatory focus on ensuring that the AI owners are sufficiently transparent with the users of their products as to which responses, experiences, etc… are sponsored by advertisers, as their should be. But what about the training data? In this new internet landscape, “SEO” becomes less about creating the best content on one’s own website, the right data hierarchy, and high authority links, and a lot more about increasing mentions of one’s product or brand in the right context in the source data. If brands know that Google’s LLM, for instance, is heavily influenced by discussions on Reddit, would it be possible for them to hack the unsponsored, “organic” outputs of AI products?
In the early days of search, the algorithms were unsophisticated enough that one could gain top placements by simply stuffing many mentions of a keyword onto a page or paying “link farms” to give a page backlinks related to a target topic. After the search engine algos plugged those holes, SEOs tried every other trick in the book to stay ahead of them - creating networks of sites for the sole purpose of providing backlinks, using exact-match anchor text, and faking user engagement like clicks or time on site. And Google has made update after update to punish hacks and reward great content.
What does that battle look like in world where the dominant digital librarians don’t provide pages of links but rather conversations? A race to seed the LLM’s source data with the “right answer.” You might think that’s no different than the current SEO strategy of seeding great content on public websites to garner ranking #1. But there is far more data in the private datasets of large companies than on the publicly available internet, and it seems likely that models will increasingly seek to train on that data.1 And while it’s difficult to imagine a successful campaign of Reddit, Quora, or Facebook comments to manipulate Gemini’s answer to “Who is the current US president?”, it seems less far-fetched on topics as niche as the “best yada yada software for b2b startups.”
https://kyleake.medium.com/data-behind-the-large-language-models-llm-gpt-and-beyond-8b34f508b5de