Zuckerberg’s Secret Weapon for AI Is Your Facebook Data

For many people, Facebook is the internet, and the number of its users is still growing, according to Meta Platforms Inc.’s latest financial results. But Mark Zuckerberg isn’t just celebrating that continuing growth. He wants to take advantage of it by using data from Facebook and Instagram to create powerful, general-purpose artificial intelligence. Sounds great and Meta is well positioned to do it, but his billions of users may end up paying the price with their privacy and more.

Here’s how Zuckerberg teased his next move in AI on Thursday:

“The next key part of our playbook is learning from unique data and feedback loops in our products… On Facebook and Instagram, there are hundreds of billions of publicly shared images and tens of billions of public videos, which we estimate is greater than the Common Crawl dataset and people share large numbers of public text posts in comments across our services as well.”

The point that Zuck makes here about “Common Crawl” startled observers in the tech press, because that archive is already huge: 250 billion web pages spanning 17 years. It’s one of the biggest and most popular repositories of the public internet used for training AI systems today. When OpenAI launched its GPT-3 language model in 2020, close to 60% of the text used to train the system came from Common Crawl.

But Meta’s data mountain is even bigger, which means it could theoretically build “smarter” AI. That’s because research has shown that training AI models on more data tends to make them more capable and accurate. That formula has worked wonders for OpenAI, which over the years has increased the amount of data used to create models like ChatGPT.

If Zuckerberg wants to make a more powerful chatbot, the pile of data he’s sitting on is especially valuable because so much of it comes from comment threads. Any text that represents human dialogue is critical for training so-called conversational agents, which is why OpenAI heavily mined the internet forum Reddit Inc. to build its own popular chatbot.