Secretive Chatbot Developers Are Making a Big Mistake

Tired of seeing their hard work pilfered by the tech sector’s artificial intelligence giants, the creative industry is starting to fight back. While on the surface its argument is about the principle of copyright, what the clash reveals is just how little we know about the data behind breakthrough tech like ChatGPT. The lack of transparency is getting worse, and it stands in the way of creatives being fairly paid and, ultimately, of AI being safe.

A trickle of legal challenges against AI companies could soon become a downpour. Media conglomerate IAC is reported to be teaming up with large publishers including the New York Times in a lawsuit alleging the improper use of their content to build AI-powered chatbots.

One reading of this is that publishers are running scared. The threat AI poses to their businesses is obvious: People who might have once read a newspaper’s restaurant reviews may now choose to ask an AI chatbot where to go to dinner, and so on.

But the bigger factor is that publishers are beginning to understand their value in the age of AI, albeit somewhat after the horse has bolted. AI models are only as good as the data put in them. Text and images produced by leading media organizations in theory should be of high quality and help AI tools like ChatGPT generate better results. If AI companies want to use great articles and photography, created by real people, they should be paying for the privilege. So far, for the most part, they haven’t been.

Forcing them to change is going to prove difficult, thanks to some willful acts of obfuscation. As AI has grown more sophisticated, transparency has taken a back seat. In a distinct departure from the early days of machine-learning research, when teams of computer scientists, such as the Transformer 8, went into intricate detail over the training data, leading AI developers are now using vague language about their sources.

OpenAI’s GPT-4 is trained “using publicly available data [such as internet data] as well as data we’ve licensed,” the company explained in its release notes for the model, revealing little else. Meta’s equivalent, the newly released Llama 2, was similarly vague. The company said it had been trained on a “new mix of data from publicly available sources.”