Court docs allege Meta trained LLM models using pirated book trove
This is pretty massive:
The [court] document claims that Meta decided to download documents from Library Genesis — aka. “LibGen” — to train its models. LibGen is the subject of a lawsuit brought by textbook publishers who believe it happily hosts and distributes [pirated] works [….]
The filing from plaintiffs in the Kadrey case claims that documents produced by Meta […] describe internal debate about accessing LibGen, a little squeamishness about using BitTorrent in the office to do so, and eventual escalation to “MZ” [Mark Zuckerberg himself], who approved use of the contentious resource. […]
Another filing claims that a Meta document describes how it removed copyright notifications from material downloaded from LibGen, and suggests the company did so because it realized including such text could mean a model’s output would reveal it was trained on copyrighted material.
US District Court Judge Vince Chhabria also noted that in one of the documents Meta wants to seal, an employee wrote the following:
“If there is media coverage suggesting we have used a dataset we know to be pirated, such as LibGen, this may undermine our negotiating position with regulators on these issues.”
No shit.
Tags: piracy meta copyright mark-zuckerberg law llama training libgen books