Mozilla, EleutherAI launch toolkits to help AI builders create open datasets

Easy-to-follow guides on how to transcribe audio files into text using privacy friendly tools and how to convert different documents into a singular format. Watch the live demo here.
As concerns around AI transparency grow, datasets remain one of the least visible and least standardized parts of the pipeline. Many are assembled behind closed doors, with little documentation or clarity around sourcing. Independent developers are often left without the infrastructure or tools needed to do things differently.
Mozilla and EleutherAI’s year-long collaboration aims to change that. They’re releasing two toolkits that help developers build large-scale datasets from scratch—whether that means extracting content from PDFs, structuring web archives, or simply documenting what they’re using in a clear and reusable way.
These toolkits help developers get started with creating open datasets. The code and demos will be available on the Mozilla.ai Blueprints hub, a platform that helps developers prototype with open-source AI using out of the box workflows.
Toolkit 1: Transcribing Audio Files with Open-Source Whisper Models
This Blueprint guides developers through transcribing audio using open-source Whisper models via Speaches, a self-hosted server similar to the OpenAI Whisper API. Designed for local use, this privacy focused setup offers a secure alternative to commercial APIs, making it ideal for handling sensitive or private audio data. Inspired by real-world use cases, the toolkit features an easy to follow setup using either Docker or the CLI.
Toolkit 2: Converting Unstructured Documents into Markdown Format
This toolkit helps developers convert diverse document formats (PDFs, DOCX, HTML, etc.) into Markdown using Docling, a command-line tool with powerful Optical Character Recognition and image-handling capabilities. Ideal for building open-text datasets for use in downstream applications, this toolkit emphasizes accessibility and versatility, including batch-processing capabilities.
Mozilla and EleutherAI’s partnership included an AI dataset convening, which brought together 30 leading scholars and practitioners from prominent open-source AI startups, nonprofit AI labs, and civil society organizations to discuss emerging practices for a new focus within the open LLM community, culminating with the publication of the research paper: “Towards Best Practices for Open Datasets for LLM Training”. The new toolkits are a final milestone in this partnership and are a resource to help builders action the best practices previously shared.
“As AI development continues to move at warp speed, we must ask ourselves ‘how can we responsibly curate and govern data so that the AI ecosystem becomes more equitable and transparent’ says Ayah Bdeir, Mozilla Foundation Senior Advisor, AI Strategy “Today’s open data ecosystem depends on the community sharing its expertise and our partnership with EleutherAI is part of our commitment to support incredible builders who are iterating and experimenting on the front lines of open source AI.
Currently, the threat of litigation is often cited as a reason for minimizing dataset transparency, hindering transparency and innovation. Building open access data is the antidote. Building a future of responsibly curated, openly licensed datasets requires collaboration across legal, technical, and policy fields, along with investments in standards and digitization. In short, open-access data can address many AI challenges, but creating it is difficult. The toolkits from EleutherAI and Mozilla are a crucial step in making this process easier.
“Creating high-quality, large-scale datasets is one of the biggest bottlenecks in AI development,” says Stella Biderman, Executive Director, EleutherAI. “ Developers—especially those outside of major tech firms—often resort to whatever data is easiest to access, even when more valuable sources are trapped in PDFs or audio. These tools make it easier for open-source developers to unlock that data and build stronger, more diverse datasets.”
Update: On April 28, EleutherAI and Mozilla hosted an event to demo the two blueprints. Watch the demo here.