-
OK, this is quite cool: “the first ever [language] models trained exclusively on open data, meaning data that are either non-copyrighted or are published under a permissible license. These are the first fully EU AI Act compliant models. In fact, Pleias sets a new standard for safety and openness.”
Training large language models required copyrighted data until it did not. Today we release Pleias 1.0 models, a family of fully open small language models. Pleias 1.0 models include three base models: 350M, 1.2B, and 3B parameters. They feature two specialized models for knowledge retrieval with unprecedented performance for their size on multilingual Retrieval-Augmented Generation, Pleias-Pico (350M parameters) and Pleias-Nano (1.2B parameters). […] Our models are: * multilingual, offering strong support for multiple European languages; * safe, showing the lowest results on the toxicity benchmark; * performant for key tasks, such as knowledge retrieval; * able to run efficiently on consumer-grade hardware locally (CPU-only, without quantisation) Pleias 1.0 family embodies a new approach to specialized small language models, for end applications: wound-up models. We have implemented a set of ideas and solutions during pretraining that produce a frugal yet powerful language model specifically optimized for further RAG implementations. We release two wound-up models further trained for Retrieval Augmented Generation (RAG): Pleias-pico-350m-RAG and Pleias-nano-1B-RAG. These models are designed to be implemented locally, so we prioritized frugal implementation. As our models are small, they can run smoothly, even on devices with limited RAM.
And here’s their fully open training set: https://huggingface.co/datasets/PleIAs/common_corpus
(tags: llms models huggingface ai pleias rag ai-act open-data)