Harvard Unveils Access to Over a Million Texts for AI Model Education
Data being likened to 'new oil' these days, and if that's the case, then Harvard University could be seen as the modern-day Exxon. They declared the release of a dataset on Thursday, comprising almost a million public domain books, perfect for training AI models. This initiative, named the Institutional Data Initiative, has benefited from funding from both Microsoft and OpenAI. The dataset incorporates books scanned by Google Books, past their copyright expiration date.
According to Wired, this project houses a diverse collection of books, including titles from renowned writers like Shakespeare, Dickens, and Dante alongside obscure Czech math textbooks and Welsh pocket dictionaries. Generally, copyright protection lasts for the author's lifetime plus 70 years.
Forming language models, like ChatGPT, that mimic human-like interactions necessitates an immense quantity of top-notch text for training. More data ingested often leads to improved human-like performance and dispensing knowledge. However, this insatiable hunger for data has posed problems for AI companies, such as OpenAI, who are hitting barriers when it comes to discovering fresh information, sans plagiarism.
Publishers like the Wall Street Journal and the New York Times have filed lawsuits against OpenAI and competitors for utilizing their data without permission. Supporters of AI companies have put forward various arguments to justify their actions. They often assert that humans themselves create new works by referencing and synthesizing materials from other sources, and AI isn't any different. Students, readers, and writers alike synthesize knowledge and produce new work. Remixing is legally deemed 'fair use' if the new product demonstrates substantial changes. But this theory does not take into account the fact that humans aren't capable of consuming billions of pieces of text at the speed of a computer. In their lawsuit against Perplexity, the Wall Street Journal accused the startup of mass-copying content.
Participants in the field have also argued that any content available on the open web is essentially 'fair game'. They argue that the user of the chatbot is the one accessing copyrighted content by requesting it through a prompt. Essentially, a chatbot like Perplexity acts like a web browser, triggering copyrighted content only by user request. It'll be a while before these arguments make their way to the courts.
In response to criticisms, OpenAI has negotiated deals with content providers, while Perplexity has collaborated with publishers to launch an ad-supported partner program. However, it's clear these moves were undertaken reluctantly.
While AI companies are running out of fresh content to utilize, commonly used sources are limiting access to their training data. Companies like Reddit and X, who recognize the immense value of their data, particularly in real-time updates, have been aggressive in restricting use of their data. Reddit makes hundreds of millions from licensing its corpus of subreddits and comments to Google for model training. Elon Musk's X has an exclusive arrangement with its sister company, xAI, to grant its models access to the social network's content for training and information retrieval. It's a bit ironic that these companies protect their own data fiercely, while they deem content from media publishers to have no value and be free.
A million books won't suffice to fulfill an AI company's training requirements, especially considering these books are outdated and lacking in modern information, such as Gen Z slang. To differentiate themselves from competitors, AI companies will continue to require other data, particularly exclusive data, to avoid producing models identical to their competitors. The Institutional Data Initiative's dataset can at least assist AI companies in training their initial foundational models without infringing on any legal issues.
The future of tech and artificial-intelligence is heavily reliant on the availability of high-quality data for training. The release of Harvard University's dataset, containing a million public domain books, is a significant step towards meeting this need. This tech-driven endeavor, named the Institutional Data Initiative, has also received support from tech giants like Microsoft and OpenAI.
As we move further into the era of artificial-intelligence, the demand for diverse and up-to-date data sources will only grow. While the Institutional Data Initiative provides a valuable foundation, tech companies will continue to seek exclusive data to differentiate their AI models and stay ahead in the competitive tech-artificial-intelligence landscape.