Artificial intelligence isn’t magical – it’s in the name: “artificial.” We know the content is originating from somewhere. An investigation showed that some of the big names in tech, including Apple, trained their AI technology on transcripts from YouTube videos – all without permission.
Investigation Shows YouTube Transcripts Used
Proof News conducted an investigation that included a search tool to look for YouTube in the dataset. The investigation determined that the subtitles from nearly 175,000 YouTube videos from more than 48,000 channels were used by tech companies.
The videos that were used included late-night TV episodes from The Late Show with Stephen Colbert and Jimmy Kimmel Live. Also showing up in the investigation were videos by MrBeast, PewDiePie, and Marques Brownlee.
The dataset came from “the Pile.” In 2020, the Pile was described as a mix of 22 datasets from EleutherAI, a nonprofit.
A Google spokesperson said in an email to CNET that the company stands by what it has said previously, going back to a comment from April. CEO Neal Mohan said at that time that he didn’t know whether OpenAI used YouTube videos. But if it did, he recognized that it would be a violation of YouTube’s TOS.
Where Else Does the AI Content Come From?
Nearly every tech company has announced recently that it is developing or has developed an AI system. As stated initially, we know it’s not magical and that the content comes from somewhere. It just wasn’t expected that the AI was coming from YouTube transcripts.
OpenAI, the creators of ChatGPT, has mentioned previously that it was getting more difficult to find datasets to train AI, and that led it to make deals with Reddit and News Corp. for their content. Google has said it has an agreement with content creators that allows it to use YouTube content in its AI training. AI Overview was recently added to Google Search. Learn how to turn AI Overview off if it isn’t your cup of tea.
Yet, an Anthropic spokesperson acknowledged to Proof News that it used the Pile to train Claude, it’s AI assistant. The spokesperson also acknowledged that there are some YouTube subtitles in the Pile.
Whether you use Claude, ChatGPT, or another AI technology, it was trained on a dataset. The question is whether it was trained on willing content providers, like Reddit, or whether the search for providers expanded to content that was used without the creators’ knowledge. It’s definitely something you should be considering the next time you use an AI chatbot.
Image credit: Unsplash
Our latest tutorials delivered straight to your inbox