A high-quality dataset of nearly one million books in the public domain, which can be used by anyone to teach huge language models and various AI tools, was released by Harvard University on Thursday. With money from both Microsoft and OpenAI, Harvard’s recently established Institutional Data Initiative was able to create the data. It contains ebooks that were scanned as part of the Google Books job but are no longer protected by copyright.
Around five times the size of the famous Books3 data that was used to teach AI designs like Meta’s Llama, the Institutional Data Initiative’s collection spans genres, decades, and languages, with classics from Shakespeare, Charles Dickens, and Dante included alongside mysterious Bohemian math textbooks and Welsh bag dictionaries. The initiative, according to Greg Leppert, executive chairman of the Institutional Data Initiative, is an effort to “level the playing field” by granting the general public, including smaller players in the AI industry and personal researchers, access to the kind of highly-refined and carefully curated information repositories that typically only established tech giants have the resources to build. ” It’s gone through rigorous review”, he says.
Leppert thinks that creating artificial intelligence models may be combined with other registered materials in the public domain. He points out that firms would still need to use more training data to distinguish their models from those of their competitors.” I think about it a little like how Linux has become the basic functioning system for so much of the earth,” he says.
Burton Davis, Microsoft’s vice president and deputy general counsel for intellectual property, emphasized that the company’s support for the project was in line with its wider convictions about the value of using “pools of accessible data” for AI startups that are “managed in the public’s interest.” In other words, Microsoft isn’t necessarily going to replace all of the AI training data it has used in its own models with books like the ones in the new Harvard database. According to Davis,” We train our models using publicly available data.”
The future of how artificial intelligence tools are built hangs in the balance as dozens of lawsuits against the use of copyrighted data for training AI wind their way through the courts. If AI businesses win their cases, they won’t have to enter licensing agreements with copyright holders. Instead, they can continue to scrap the internet. However, if they lose, AI companies might have to change the way their models are created. Under the premise that, regardless of what happens, there will be an appetite for public domain datasets, a wave of projects like the Harvard database are moving forward.
The Institutional Data Initiative says it’s open to form similar collaborations in the future in addition to the trove of books that the Boston Public Library has in the works. The precise method for publishing the books dataset is still a mystery. Google has been asked by the Institutional Data Initiative to collaborate on public distribution, but Harvard says it’s optimistic that it will happen. ( Google did not respond to WIRED’s requests for comment. )
There are also other new public-domain projects. Last spring, the French AI startup Pleis rolled out its own public-domain dataset, Common Corpus, which contains an estimated 3 to 4 million books and periodical collections, according to project coordinator Pierre-Carl Langlais. The Common Corpus, which is supported by the French Ministry of Culture, has been downloaded over 60, 000 times in one month on the open source AI platform Hugging Face. The first large language models that Pléis ‘ recently announced that it is releasing its first set of large language models based on this dataset, which Langlais told WIRED were” the first models ever trained exclusively on open data and in compliance with the]EU] AI Act.”
There are also efforts being made to create comparable mage datasets. This summer, Spawning, an AI startup, released its own, Source. Plus, which includes public domain photos from various museums and archives as well as Wikimedia Commons images. The Metropolitan Museum of Art is one of the few notable cultural institutions that has made its own archives accessible to the general public as standalone projects.
The rise of these datasets, according to Ed Newton-Rex, a former executive at Stability AI who now runs a nonprofit that certifies ethically trained AI tools, demonstrates that it is not necessary to steal copyrighted materials to create high-performing and high-quality AI models. It would be “impossible” to produce products like ChatGPT without the use of copyrighted works, according to OpenAI’s previous statement to British lawmakers. Large public domain datasets like these further undermine the “need defense” that some AI companies use to defend scraping copyrighted work to train their models, according to Newton-Rex.
He continues to have doubts about whether the IDI and similar initiatives will actually alter the status quo in terms of training. These datasets” will only have a positive impact if they are used to replace scraped copyrighted work,” they say, probably in conjunction with licensing other data. If they’re just added to the mix, one part of a dataset that also includes the unlicensed life’s work of the world’s creators, they’ll overwhelmingly benefit AI companies”, he says.