Skip to content

Archives

PleIAs/common_corpus

Published February 11, 2025

PleIAs/common_corpus

This is great to see:

Common Corpus is the largest open and permissible licensed text dataset, comprising 2 trillion tokens (1,998,647,168,282 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, government and legal documents, code, and more. Common Corpus has been created by Pleias in association with several partners and contributed in-kind to Current AI initiative.

The dataset in its entirety meets the requirements of the Code of Conduct of the AI Act and goes further than the current requirements for data transparency. It aims to set a new standard of openness in AI, showing that detailed provenance at a granular document level is a realistic objective, even at the scale of 2 trillion tokens.

Tags: ai llms open-data open-source pleias common-corpus corpora training ai-act