Close Menu
Alan C. Moore
    What's Hot

    Mexico’s president defends first judicial election despite low turnout, democratic criticism

    June 2, 2025

    Remember When Rolling Stone Called Pro-Hamas Terrorists Entering The U.S. A ‘Conspiracy Theory?’

    June 2, 2025

    Antisemitism Is Here. So Is the Bear Jew Option.

    June 2, 2025
    Facebook X (Twitter) Instagram
    Trending
    • Mexico’s president defends first judicial election despite low turnout, democratic criticism
    • Remember When Rolling Stone Called Pro-Hamas Terrorists Entering The U.S. A ‘Conspiracy Theory?’
    • Antisemitism Is Here. So Is the Bear Jew Option.
    • Usha Vance Issues an Important Summer Challenge for Your Children
    • Penn Medicine won’t provide ‘gender-affirming care’ to kids anymore
    • Pro-Palestinian protesters rip off hundreds of flower heads at U. Michigan garden
    • WWDC 2025: Apple Employees Say Conference Will Be AI News ‘Letdown’
    • WWDC 2025: Apple Employees Say Conference Will Be AI News ‘Letdown’
    Alan C. MooreAlan C. Moore
    Subscribe
    Monday, June 2
    • Home
    • US News
    • Politics
    • Business & Economy
    • Video
    • About Alan
    • Newsletter Sign-up
    Alan C. Moore
    Home » Blog » Meta Secretly Trained Its AI on a Notorious Piracy Database, Newly Unredacted Court Docs Reveal

    Meta Secretly Trained Its AI on a Notorious Piracy Database, Newly Unredacted Court Docs Reveal

    January 9, 2025Updated:January 9, 2025 Tech 0 Comments
    Share
    Facebook Twitter LinkedIn Pinterest Email

    Meta just lost a major fight in its ongoing legal battle with a group of authors suing the company for copyright infringement over how it trained its artificial intelligence models. Against the company’s wishes, a court unredacted information alleging that Meta used Library Genesis (LibGen), a notorious so-called shadow library of pirated books that originated in Russia, to help train its generative AI language models.

    The case, Kadrey et al v. Meta Platforms, was one of the earliest copyright lawsuits filed against a tech company over its AI training practices. Its outcome, along with those of dozens of similar cases working their way through courts in the United States alone, will determine whether technology companies can legally use creative works to train AI moving forward, and could either entrench AI’s most powerful players or derail them.

    Vince Chhabria, a judge for the United States District Court for the Northern District of California, ordered both Meta and the plaintiffs on Wednesday to file full versions of a batch of documents after calling Meta’s approach to redacting them “preposterous,” adding that, for the most part, “there is not a single thing in those briefs that should be sealed.” Chhabria ruled that Meta was not pushing to redact the materials in order to protect its business interests, but instead to “avoid negative publicity.” The documents were originally filed late last year, but remained publicly unavailable until now.

    In his order, Chhabria referenced an internal quote from a Meta employee included in the documents, in which they speculated that “If there is media coverage suggesting we have used a dataset we know to be pirated, such as LibGen, this may undermine our negotiating position with regulators on these issues.” Meta declined to comment.

    Novelists Richard Kadrey and Christopher Golden, along with comedian Sarah Silverman, first filed the class-action lawsuit against Meta in July 2023, alleging the tech giant trained its language models using their copyrighted work without permission. Meta has argued that using publicly available materials to train AI tools is shielded by the “fair use” doctrine, which holds that using copyrighted works without permission is legal in certain cases, one of which, the company argues, is “using text to statistically model language and generate original expression,” the company’s lawyers wrote in a motion to dismiss the authors’ lawsuit in November 2023. In this particular lawsuit, Meta has also argued that the plaintiffs’ claims are without merit.

    Before these documents were made public, Meta previously disclosed in a research paper that it had trained its Llama large language model on portions of Books3, a dataset of around 196,000 books scraped from the internet. It had not previously publicly indicated, however, that it had torrented data directly from LibGen.

    These newly unredacted documents reveal exchanges between Meta employees unearthed in the discovery process, like a Meta engineer telling a colleague that they hesitated to access LibGen data because “torrenting from a [Meta-owned] corporate laptop doesn’t feel right 😃”. They also allege that internal discussions about using LibGen data were escalated to Meta CEO Mark Zuckerberg (referred to as “MZ” in the memo handed over during discovery) and that Meta’s AI team was “approved to use” the pirated material.

    “Meta has treated the so-called ‘public availability’ of shadow datasets as a get out of jail free card, notwithstanding that internal Meta records show every relevant decision-maker at Meta, up to and including its CEO, Mark Zuckerberg, knew LibGen was ‘a dataset we know to be pirated,’” the plaintiffs allege in this motion. (Originally filed in late 2024, the motion is a request to file a third amended complaint.)

    In addition to the plaintiffs’ briefs, another filing was unredacted in response to Chhabria’s order—Meta’s opposition to the motion to file an amended complaint. It argues that the authors’ attempts to add additional claims to the case are an “eleventh-hour gambit based on a false and inflammatory premise,” and denies that Meta waited to reveal crucial information in discovery. Instead, Meta argues it first revealed to the plaintiffs that it used a LibGen dataset in July 2024. (As much of the discovery materials remain confidential, it is difficult for WIRED to confirm that claim.)

    Meta’s argument hinges on its claim that the plaintiffs already knew about the LibGen use and shouldn’t be granted additional time to file a third amended claim when they had ample time to do so before discovery ended in December 2024. “Plaintiffs knew of Meta’s downloading and use of LibGen and other alleged ‘shadow libraries’ since at least mid-July 2024,” the tech giant’s lawyers argue.

    In November 2023, Chhabria granted Meta’s motion to dismiss some of the lawsuit’s claims, including its claim Meta’s alleged use of the authors’ work to train AI violated the Digital Millennium Copyright Act, a US law introduced in 1998 to stop people from selling or duplicating copyrighted works on the internet. At the time, the judge agreed with Meta’s stance that the plaintiffs had not provided sufficient evidence to prove that the company had removed what’s known as “copyright management information” (CMI), like the author’s name and title of the work.

    The unredacted documents argue that the plaintiffs should be allowed to amend their complaint, alleging that the information Meta revealed is evidence that the DMCA claim was warranted. They also say the discovery process has unearthed reasons to add new allegations. “Meta, through a corporate representative who testified on November 20, 2024, has now admitted under oath to uploading (aka ‘seeding’) pirated files containing Plaintiffs’ works on ‘torrent’ sites,” the motion alleges. (Seeding is when torrented files are then shared with other peers after they have finished downloading.)

    “This torrenting activity turned Meta itself into a distributor of the very same pirated copyrighted material that it was also downloading for use in its commercially available AI models,” one of the newly unredacted documents claims, alleging that Meta, in other words, had not just used copyrighted material without permission but also disseminated it.

    LibGen, an archive of books uploaded to the internet that originated in Russia around 2008, is one of the largest and most controversial “shadow libraries” in the world. In 2015, a New York judge ordered a preliminary injunction against the site, a measure designed in theory to temporarily shut the archive down, but its anonymous administrators simply switched its domain. In September 2024, a different New York judge ordered LibGen to pay $30 million to the rightsholders for infringing on their copyrights, despite not knowing who actually operates the piracy hub.

    Meta’s discovery woes for this case aren’t over, either. In the same order, Chhabria warned the tech giant against any overly-sweeping redaction requests in the future: “If Meta again submits an unreasonably broad sealing request, all materials will simply be unsealed,” he wrote.

    Source credit

    Keep Reading

    WWDC 2025: Apple Employees Say Conference Will Be AI News ‘Letdown’

    WWDC 2025: Apple Employees Say Conference Will Be AI News ‘Letdown’

    What Is Google One? A Breakdown of Plans, Pricing, and Included Services

    What Is Google One? A Breakdown of Plans, Pricing, and Included Services

    Salesforce ‘Reduced Some Hiring Needs’ With AI: What It Means

    AI-Powered Video Insights Are Coming to Google Drive — Here’s When You’ll Get Them

    Editors Picks

    Mexico’s president defends first judicial election despite low turnout, democratic criticism

    June 2, 2025

    Remember When Rolling Stone Called Pro-Hamas Terrorists Entering The U.S. A ‘Conspiracy Theory?’

    June 2, 2025

    Antisemitism Is Here. So Is the Bear Jew Option.

    June 2, 2025

    Usha Vance Issues an Important Summer Challenge for Your Children

    June 2, 2025

    Penn Medicine won’t provide ‘gender-affirming care’ to kids anymore

    June 2, 2025

    Pro-Palestinian protesters rip off hundreds of flower heads at U. Michigan garden

    June 2, 2025

    WWDC 2025: Apple Employees Say Conference Will Be AI News ‘Letdown’

    June 2, 2025

    WWDC 2025: Apple Employees Say Conference Will Be AI News ‘Letdown’

    June 2, 2025

    ‘You’re killing my people so I kill you’: Witness recounts Colorado attacker Mohamed Sabry Soliman shouted before setting fire

    June 2, 2025

    Mali army base and airport in Timbuktu targeted in attack

    June 2, 2025
    • Home
    • US News
    • Politics
    • Business & Economy
    • About Alan
    • Contact

    Sign up for the Conservative Insider Newsletter.

    Get the latest conservative news from alancmoore.com [aweber listid="5891409" formid="902172699" formtype="webform"]
    Facebook X (Twitter) YouTube Instagram TikTok
    © 2025 alancmoore.com
    • Privacy Policy
    • Terms
    • Accessibility

    Type above and press Enter to search. Press Esc to cancel.