{"id":75322,"date":"2025-01-21T08:55:36","date_gmt":"2025-01-21T16:55:36","guid":{"rendered":"https:\/\/blog.mozilla.org\/?p=75322"},"modified":"2025-01-21T10:54:58","modified_gmt":"2025-01-21T18:54:58","slug":"dataset-convening","status":"publish","type":"post","link":"https:\/\/blog.mozilla.org\/en\/mozilla\/dataset-convening\/","title":{"rendered":"Mozilla, EleutherAI publish research on open datasets for LLM training"},"content":{"rendered":"\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" loading=\"lazy\" width=\"1024\" height=\"683\" src=\"https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2024\/06\/20240611_MOZFEST24_DSC01469_HQ-1024x683.jpg\" alt=\"A group photo of 27 people standing together in a room with a colorful cityscape mural on the wall behind them.\" class=\"wp-image-75323\" srcset=\"https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2024\/06\/20240611_MOZFEST24_DSC01469_HQ-1024x683.jpg 1024w, https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2024\/06\/20240611_MOZFEST24_DSC01469_HQ-300x200.jpg 300w, https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2024\/06\/20240611_MOZFEST24_DSC01469_HQ-768x512.jpg 768w, https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2024\/06\/20240611_MOZFEST24_DSC01469_HQ-1536x1024.jpg 1536w, https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2024\/06\/20240611_MOZFEST24_DSC01469_HQ-2048x1365.jpg 2048w, https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2024\/06\/20240611_MOZFEST24_DSC01469_HQ-1000x667.jpg 1000w, https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2024\/06\/20240611_MOZFEST24_DSC01469_HQ-1280x853.jpg 1280w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Participants of the Dataset Convening in Amsterdam.<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading has-text-align-center\"><em>Update: Following the 2024 Mozilla AI Dataset Convening, AI builders and researchers publish best practices for creating open datasets for LLM training.&nbsp;<\/em><\/h3>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p>Training datasets behind large language models (LLMs) often lack transparency, a research paper published by Mozilla and EleutherAI explores how openly licensed datasets that are responsibly curated and governed can make the AI ecosystem more equitable. The study is co-authored with thirty leading scholars and practitioners from prominent open source AI startups, nonprofit AI labs, and civil society organizations who attended the Dataset Convening on open AI datasets in June 2024.<br><br>Many AI companies rely on data crawled from the web, frequently without the explicit permission of copyright holders. While some jurisdictions like the EU and Japan permit this under specific conditions, the legal landscape in the United States remains murky. This lack of clarity has led to lawsuits and a trend toward secrecy in dataset practices\u2014stifling transparency, accountability, and limiting innovation to those who can afford it.<\/p>\n\n\n\n<p>For AI to truly benefit society, it must be built on foundations of transparency, fairness, and accountability\u2014starting with the most foundational building block that powers it: data.&nbsp;<\/p>\n\n\n\n<p>The research, <strong>\u201cTowards Best Practices for Open Datasets for LLM Training,<\/strong>\u201d outlines possible tiers of openness, normative principles, and technical best practices for sourcing, processing, governing, and releasing open datasets for LLM training, as well as opportunities for policy and technical investments to help the emerging community overcome its challenges.&nbsp;<\/p>\n\n\n\n<p><a href=\"https:\/\/foundation.mozilla.org\/en\/research\/library\/towards-best-practices-for-open-datasets-for-llm-training\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>READ THE RESEARCH HERE<\/strong><\/a><\/p>\n\n\n\n<p>Building toward a responsible AI future requires collaboration across legal, technical, and policy domains, along with investments in metadata standards, digitization, and fostering a culture of openness.&nbsp;<\/p>\n\n\n\n<p>To help advance the field, the paper compiles best practices for LLM builders, including guidance on <strong>Encoding preferences in metadata<\/strong>, <strong>Data sourcing, Data Processing, Data Governance\/Release, and Terms of Use.<\/strong><\/p>\n\n\n\n<p>To explore the recommendations check the <a href=\"https:\/\/foundation.mozilla.org\/towards-best-practices-for-open-datasets-for-llm-training\/\">full paper<\/a> (also available on <a href=\"https:\/\/arxiv.org\/abs\/2501.08365\">arXiv<\/a>).&nbsp;<\/p>\n\n\n\n<p>We are grateful to our collaborators \u2013 273 Ventures, Ada Lovelace Institute, Alan Turing Institute, Cohere For AI, Common Voice, Creative Commons, Data Nutrition Project, Data Provenance Initiative, First Languages AI Reality (Mila), Gretel, HuggingFace, LLM360, Library Innovation Lab (Harvard), Open Future, Pleias, Spawning, The Distributed AI Research Institute, Together AI, and Ushahidi\u2013 for their leadership in this work, as well as Computer Says Maybe for their facilitation support.&nbsp;<\/p>\n\n\n\n<p>We look forward to the conversations it will spark.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p><em>Previous post published on July 2, 2024:<\/em><\/p>\n\n\n\n<h3 class=\"wp-block-heading has-text-align-center\"><em>Mozilla and EleutherAI brought together experts to discuss a critical question: How do we create openly licensed and open-access LLM training datasets and how do we tackle the challenges faced by their builders?<\/em><\/h3>\n\n\n\n<p>On June 11, on the eve of&nbsp;<a href=\"https:\/\/web.archive.org\/web\/20241126202554\/https:\/\/www.mozillafestival.org\/en\/highlights\/mozfest-house-amsterdam\/\" target=\"_blank\" rel=\"noreferrer noopener\">MozFest House in Amsterdam<\/a>,&nbsp;<a href=\"https:\/\/web.archive.org\/web\/20241126202554\/https:\/\/blog.mozilla.org\/en\/mozilla\/ai\/next-steps-for-mozilla-and-trustworthy-ai\/\">Mozilla<\/a>&nbsp;and&nbsp;<a href=\"https:\/\/web.archive.org\/web\/20241126202554\/https:\/\/www.eleuther.ai\/\" target=\"_blank\" rel=\"noreferrer noopener\">EleutherAI<\/a>&nbsp;convened an exclusive group of 30 leading scholars and practitioners from prominent open-source AI startups, nonprofit AI labs and civil society organizations to discuss emerging practices for a new focus within the open LLM community: creating open-access and openly licensed LLM training datasets.&nbsp;<\/p>\n\n\n\n<p>This work is timely. Although sharing training datasets was once common practice among many AI actors, increased competitive pressures and legal risks have made it almost unheard of nowadays for pre-training datasets to be shared or even described by their developers. However, just as open-source software has made the internet safer and more robust, we at Mozilla and EleutherAI believe open-access data is a public good that can empower developers worldwide to build upon each other\u2019s work. It fosters competition, innovation and transparency, providing clarity around legal standing and an ability to stand up to scrutiny.<\/p>\n\n\n\n<p>Leading AI companies&nbsp;<a href=\"https:\/\/web.archive.org\/web\/20241126202554\/https:\/\/www.wired.com\/story\/proof-you-can-train-ai-without-slurping-copyrighted-content\/\" target=\"_blank\" rel=\"noreferrer noopener\">want us to believe<\/a>&nbsp;that training performant LLMs without copyrighted material is impossible. We refuse to believe this. An emerging ecosystem of open LLM developers have created LLM training datasets \u2014such as&nbsp;<a href=\"https:\/\/web.archive.org\/web\/20241126202554\/https:\/\/huggingface.co\/collections\/PleIAs\/common-corpus-65d46e3ea3980fdcd66a5613\" target=\"_blank\" rel=\"noreferrer noopener\">Common Corpus<\/a>,&nbsp;<a href=\"https:\/\/web.archive.org\/web\/20241126202554\/https:\/\/huggingface.co\/datasets\/PleIAs\/YouTube-Commons\" target=\"_blank\" rel=\"noreferrer noopener\">YouTube-Commons<\/a>,&nbsp;<a href=\"https:\/\/web.archive.org\/web\/20241126202554\/https:\/\/huggingface.co\/datasets\/HuggingFaceFW\/fineweb\" target=\"_blank\" rel=\"noreferrer noopener\">Fine Web<\/a>,&nbsp;<a href=\"https:\/\/web.archive.org\/web\/20241126202554\/https:\/\/allenai.github.io\/dolma\/\" target=\"_blank\" rel=\"noreferrer noopener\">Dolma<\/a>,&nbsp;<a href=\"https:\/\/web.archive.org\/web\/20241126202554\/https:\/\/cohere.com\/research\/aya\" target=\"_blank\" rel=\"noreferrer noopener\">Aya<\/a>,&nbsp;<a href=\"https:\/\/web.archive.org\/web\/20241126202554\/https:\/\/www.together.ai\/blog\/redpajama-data-v2\" target=\"_blank\" rel=\"noreferrer noopener\">Red Pajama<\/a>&nbsp;and many more\u2014that could provide blueprints for more transparent and responsible AI progress. We were excited to invite many of them to join us in Amsterdam for a series of discussions about the challenges and opportunities of building an alternative to the current status quo that is open, legally compliant and just.&nbsp;<br>During the event, we drew on the learnings from assembling \u201cCommon Pile\u201d (the soon-to-be-released dataset by EleutherAI composed only of openly licensed and public domain data) which incorporates many learnings from its hugely successful predecessor, \u201c<a href=\"https:\/\/web.archive.org\/web\/20241126202554\/https:\/\/huggingface.co\/datasets\/EleutherAI\/pile\">The Pile<\/a>.\u201d At the event, EleutherAI released a technical briefing and an invitation to public consultation on Common Pile.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" loading=\"lazy\" width=\"1024\" height=\"768\" src=\"https:\/\/web.archive.org\/web\/20241126202554im_\/https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2024\/06\/IMG_8199-1024x768.jpg\" alt=\"A speaker holding a microphone gestures while speaking, with a screen displaying &quot;The Dataset Convening&quot; in the background.\" class=\"wp-image-75335\" srcset=\"https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2024\/06\/IMG_8199-1024x768.jpg 1024w, https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2024\/06\/IMG_8199-300x225.jpg 300w, https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2024\/06\/IMG_8199-768x576.jpg 768w, https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2024\/06\/IMG_8199-1536x1152.jpg 1536w, https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2024\/06\/IMG_8199-2048x1536.jpg 2048w, https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2024\/06\/IMG_8199-1000x750.jpg 1000w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Participants engaged in a discussion at \u201cThe Dataset Convening,\u201d hosted by Mozilla and EleutherAI on June 11, 2024 to explore creating open-access and openly licensed LLM training datasets.<\/figcaption><\/figure>\n\n\n\n<p>Our goal with the convening was to bring in the experiences of open dataset builders to develop normative and technical recommendations and best practices around openly licensed and open-access datasets. Below are some highlights of our discussion:<\/p>\n\n\n\n<ul>\n<li>Openness alone does not guarantee legal compliance or ethical outcomes, we asked which decision points can contribute to datasets being more just and sustainable in terms of public good and data rights.&nbsp;<\/li>\n\n\n\n<li>We discussed what \u201cgood\u201d looks like, what we want to avoid, what is realistic and what is already being implemented in the realm of sourcing, curating, governing and releasing open training datasets.&nbsp;<\/li>\n\n\n\n<li>Issues such as the cumbersome nature of sourcing public domain and openly licensed data (e.g. extracting text from PDFs), manual verification of metadata, legal status of data across jurisdictions, retractability of consent, preference signaling, reproducibility and data curation and filtering were recurring themes in almost every discussion.<\/li>\n\n\n\n<li>To enable more builders to develop open datasets and unblock the ecosystem, we need financial sustainability and smart infrastructural investments that can unblock the ecosystem.<\/li>\n\n\n\n<li>The challenges faced by open datasets today bear a resemblance to those encountered in the early days of open source software (data quality, standardization and sustainability). Back then, it was the common artifacts that united the community and provided some shared understanding and language. We saw the Dataset Convening as an opportunity to start exactly there and create shared reference points that, even if not perfect, will guide us in a common direction.<\/li>\n\n\n\n<li>The final insight round underscored that we have much to learn from each other: we are still in the early days of solving this immense challenge, and this nascent community needs to collaborate and think in radical and bold ways.<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" loading=\"lazy\" width=\"1920\" height=\"2560\" src=\"https:\/\/web.archive.org\/web\/20241126202554im_\/https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2024\/06\/IMG_0181-scaled.jpg\" alt=\"A group of four people sitting around a table with laptops and documents, engaged in a discussion. One person types on a laptop, while others look at papers and a phone. A colorful graffiti mural is on the wall behind them.\" class=\"wp-image-75355\" srcset=\"https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2024\/06\/IMG_0181-scaled.jpg 1920w, https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2024\/06\/IMG_0181-225x300.jpg 225w, https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2024\/06\/IMG_0181-768x1024.jpg 768w, https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2024\/06\/IMG_0181-1152x1536.jpg 1152w, https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2024\/06\/IMG_0181-1536x2048.jpg 1536w, https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2024\/06\/IMG_0181-1000x1333.jpg 1000w\" sizes=\"(max-width: 1920px) 100vw, 1920px\" \/><figcaption class=\"wp-element-caption\">Participants at the Mozilla and EleutherAI event collaborating on best practices for creating open-access and openly licensed LLM training datasets.<\/figcaption><\/figure>\n\n\n\n<p>We are immensely grateful to the participants in the Dataset Convening (including some remote contributors):<\/p>\n\n\n\n<ul>\n<li>Stefan Baack \u2014 Researcher and Data Analyst, Insights, Mozilla<\/li>\n\n\n\n<li>Mitchell Baker \u2014 Chairwoman, Mozilla Foundation<\/li>\n\n\n\n<li>Ayah Bdeir \u2014 Senior Advisor, Mozilla<\/li>\n\n\n\n<li>Julie Beli\u00e3o \u2014 Senior Director of Product Innovation, Mozilla.ai<\/li>\n\n\n\n<li>Jillian Bommarito \u2014 Chief Risk Officer, 273 Ventures<\/li>\n\n\n\n<li>Kasia Chmielinski \u2014 Project Lead, Data Nutrition Project<\/li>\n\n\n\n<li>Jennifer Ding \u2014 Senior Researcher, Alan Turing Institute<\/li>\n\n\n\n<li>Alix Dunn \u2014 CEO, Computer Says Maybe<\/li>\n\n\n\n<li>Marzieh Fadaee \u2014 Senior Research Scientist, Cohere For AI<\/li>\n\n\n\n<li>Maximilian Gahntz \u2014 AI Policy Lead, Mozilla<\/li>\n\n\n\n<li>Paul Keller \u2014 Director of Policy and Co-Founder, Open Future<\/li>\n\n\n\n<li>Hynek Kydl\u00ed\u010dek \u2014 Machine Learning Engineer, HuggingFace<\/li>\n\n\n\n<li>Pierre-Carl Langlais \u2014 Co-Founder, Pleias<\/li>\n\n\n\n<li>Greg Leppert \u2014 Director of Product and Research, the Library Innovation Lab, Harvard<\/li>\n\n\n\n<li>EM Lewis-Jong \u2014 Director, Common Voice, Mozilla<\/li>\n\n\n\n<li>Shayne Longpre \u2014 Project Lead, Data Provenance Initiative<\/li>\n\n\n\n<li>Angela Lungati \u2014 Executive Director, Ushahidi<\/li>\n\n\n\n<li>Sebastian Majstorovic \u2014 Open Data Specialist, EleutherAI<\/li>\n\n\n\n<li>Cullen Miller \u2014 Vice President of Policy, Spawning<\/li>\n\n\n\n<li>Victor Miller \u2014 Senior Product Manager, LLM360<\/li>\n\n\n\n<li>Kasia Odrozek \u2014 Director, Insights, Mozilla<\/li>\n\n\n\n<li>Guilherme Penedo \u2014 Machine Learning Research Engineer, HuggingFace<\/li>\n\n\n\n<li>Neha Ravella \u2014 Research Project Manager, Insights Mozilla<\/li>\n\n\n\n<li>Michael Running Wolf \u2014 Co-Founder and Lead Architect, First Languages AI Reality, Mila<\/li>\n\n\n\n<li>Max Ryabinin \u2014 Distinguished Research Scientist, Together AI&nbsp;<\/li>\n\n\n\n<li>Kat Siminyu \u2014 Researcher, The Distributed AI Research Institute<\/li>\n\n\n\n<li>Aviya Skowron \u2014 Head of Policy and Ethics, EleutherAI<\/li>\n\n\n\n<li>Andrew Strait \u2014 Associate Director, Ada Lovelace Institute<\/li>\n\n\n\n<li>Mark Surman \u2014 President, Mozilla Foundation<\/li>\n\n\n\n<li>Anna Tumad\u00f3ttir \u2014 CEO, Creative Commons<\/li>\n\n\n\n<li>Marteen Van Segbroeck \u2014 Head of Applied Science, Gretel<\/li>\n\n\n\n<li>Leandro von Werra \u2014 Chief Loss Officer, HuggingFace<\/li>\n\n\n\n<li>Maurice Weber \u2014 AI Researcher, Together AI<\/li>\n\n\n\n<li>Lee White \u2014 Senior Full Stack Developer, Ushahidi<\/li>\n\n\n\n<li>Thomas Wolf \u2014 Chief Science Officer and Co-Founder, HuggingFace<\/li>\n<\/ul>\n\n\n\n<p>In the coming weeks, we will be working with the participants to develop common artifacts that will be released to the community, along with an accompanying paper. These resources will help researchers and practitioners navigate the definitional and executional complexities of advancing open-access and openly licensed datasets and strengthen the sense of community.&nbsp;<\/p>\n\n\n\n<p>The event was part of the Mozilla Convening Series, where we bring together leading innovators in open source AI to tackle thorny issues and help move the community and movement forward. Our first convening was the&nbsp;<a href=\"https:\/\/web.archive.org\/web\/20241126202554\/https:\/\/blog.mozilla.org\/en\/mozilla\/ai\/new-framework-for-ai-openness-and-innovation\/\">Columbia Convening<\/a>&nbsp;where we invited 40 leading scholars and practitioners to develop a framework for defining what openness means in AI. We are committed to continuing the efforts to support communities invested in openness around AI and look forward to helping grow and strengthen this movement.&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Update: Following the 2024 Mozilla AI Dataset Convening, AI builders and researchers publish best practices for creating open datasets for LLM training.&nbsp; Training datasets behind large language models (LLMs) often lack transparency, a research paper published by Mozilla and EleutherAI explores how openly licensed datasets that are responsibly curated and governed can make the AI [&hellip;]<\/p>\n","protected":false},"author":1889,"featured_media":75376,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[464197,5],"tags":[317823,4708],"coauthors":[464242,464243],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v22.5 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Mozilla, EleutherAI publish research on open datasets for LLM training<\/title>\n<meta name=\"description\" content=\"Mozilla and EleutherAI brought together experts to discuss creating openly licensed and open-access LLM training datasets.\u00a0\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/blog.mozilla.org\/en\/mozilla\/dataset-convening\/\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/blog.mozilla.org\/en\/mozilla\/dataset-convening\/\",\"url\":\"https:\/\/blog.mozilla.org\/en\/mozilla\/dataset-convening\/\",\"name\":\"Mozilla, EleutherAI publish research on open datasets for LLM training\",\"isPartOf\":{\"@id\":\"https:\/\/blog.mozilla.org\/en\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/blog.mozilla.org\/en\/mozilla\/dataset-convening\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/blog.mozilla.org\/en\/mozilla\/dataset-convening\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2024\/06\/Screenshot-2024-06-26-at-09-37-38-dataset-convening-1.pdf.png\",\"datePublished\":\"2025-01-21T16:55:36+00:00\",\"dateModified\":\"2025-01-21T18:54:58+00:00\",\"author\":{\"@id\":\"https:\/\/blog.mozilla.org\/en\/#\/schema\/person\/ff2a2684ab8dcbe5372151857748455d\"},\"description\":\"Mozilla and EleutherAI brought together experts to discuss creating openly licensed and open-access LLM training datasets.\u00a0\",\"breadcrumb\":{\"@id\":\"https:\/\/blog.mozilla.org\/en\/mozilla\/dataset-convening\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/blog.mozilla.org\/en\/mozilla\/dataset-convening\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/blog.mozilla.org\/en\/mozilla\/dataset-convening\/#primaryimage\",\"url\":\"https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2024\/06\/Screenshot-2024-06-26-at-09-37-38-dataset-convening-1.pdf.png\",\"contentUrl\":\"https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2024\/06\/Screenshot-2024-06-26-at-09-37-38-dataset-convening-1.pdf.png\",\"width\":2026,\"height\":1216,\"caption\":\"The image features a large purple semi-circle on the left, intersecting with concentric purple arcs. On the right is a white semi-circle with a purple center, emitting white and light gray rays. The background is light lavender.\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/blog.mozilla.org\/en\/mozilla\/dataset-convening\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/blog.mozilla.org\/en\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Mozilla, EleutherAI publish research on open datasets for LLM training\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/blog.mozilla.org\/en\/#website\",\"url\":\"https:\/\/blog.mozilla.org\/en\/\",\"name\":\"The Mozilla Blog\",\"description\":\"News and Updates about Mozilla\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/blog.mozilla.org\/en\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/blog.mozilla.org\/en\/#\/schema\/person\/ff2a2684ab8dcbe5372151857748455d\",\"name\":\"Kristina Bravo\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/blog.mozilla.org\/en\/#\/schema\/person\/image\/cd320165a9224f3c60c912bf4086a89f\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/22fa545a3c48bc13cc1d84d5e09ffbff?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/22fa545a3c48bc13cc1d84d5e09ffbff?s=96&d=mm&r=g\",\"caption\":\"Kristina Bravo\"},\"url\":\"https:\/\/blog.mozilla.org\/en\/author\/kbravo\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Mozilla, EleutherAI publish research on open datasets for LLM training","description":"Mozilla and EleutherAI brought together experts to discuss creating openly licensed and open-access LLM training datasets.\u00a0","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/blog.mozilla.org\/en\/mozilla\/dataset-convening\/","schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/blog.mozilla.org\/en\/mozilla\/dataset-convening\/","url":"https:\/\/blog.mozilla.org\/en\/mozilla\/dataset-convening\/","name":"Mozilla, EleutherAI publish research on open datasets for LLM training","isPartOf":{"@id":"https:\/\/blog.mozilla.org\/en\/#website"},"primaryImageOfPage":{"@id":"https:\/\/blog.mozilla.org\/en\/mozilla\/dataset-convening\/#primaryimage"},"image":{"@id":"https:\/\/blog.mozilla.org\/en\/mozilla\/dataset-convening\/#primaryimage"},"thumbnailUrl":"https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2024\/06\/Screenshot-2024-06-26-at-09-37-38-dataset-convening-1.pdf.png","datePublished":"2025-01-21T16:55:36+00:00","dateModified":"2025-01-21T18:54:58+00:00","author":{"@id":"https:\/\/blog.mozilla.org\/en\/#\/schema\/person\/ff2a2684ab8dcbe5372151857748455d"},"description":"Mozilla and EleutherAI brought together experts to discuss creating openly licensed and open-access LLM training datasets.\u00a0","breadcrumb":{"@id":"https:\/\/blog.mozilla.org\/en\/mozilla\/dataset-convening\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/blog.mozilla.org\/en\/mozilla\/dataset-convening\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/blog.mozilla.org\/en\/mozilla\/dataset-convening\/#primaryimage","url":"https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2024\/06\/Screenshot-2024-06-26-at-09-37-38-dataset-convening-1.pdf.png","contentUrl":"https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2024\/06\/Screenshot-2024-06-26-at-09-37-38-dataset-convening-1.pdf.png","width":2026,"height":1216,"caption":"The image features a large purple semi-circle on the left, intersecting with concentric purple arcs. On the right is a white semi-circle with a purple center, emitting white and light gray rays. The background is light lavender."},{"@type":"BreadcrumbList","@id":"https:\/\/blog.mozilla.org\/en\/mozilla\/dataset-convening\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/blog.mozilla.org\/en\/"},{"@type":"ListItem","position":2,"name":"Mozilla, EleutherAI publish research on open datasets for LLM training"}]},{"@type":"WebSite","@id":"https:\/\/blog.mozilla.org\/en\/#website","url":"https:\/\/blog.mozilla.org\/en\/","name":"The Mozilla Blog","description":"News and Updates about Mozilla","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/blog.mozilla.org\/en\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/blog.mozilla.org\/en\/#\/schema\/person\/ff2a2684ab8dcbe5372151857748455d","name":"Kristina Bravo","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/blog.mozilla.org\/en\/#\/schema\/person\/image\/cd320165a9224f3c60c912bf4086a89f","url":"https:\/\/secure.gravatar.com\/avatar\/22fa545a3c48bc13cc1d84d5e09ffbff?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/22fa545a3c48bc13cc1d84d5e09ffbff?s=96&d=mm&r=g","caption":"Kristina Bravo"},"url":"https:\/\/blog.mozilla.org\/en\/author\/kbravo\/"}]}},"_links":{"self":[{"href":"https:\/\/blog.mozilla.org\/en\/wp-json\/wp\/v2\/posts\/75322"}],"collection":[{"href":"https:\/\/blog.mozilla.org\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.mozilla.org\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/en\/wp-json\/wp\/v2\/users\/1889"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/en\/wp-json\/wp\/v2\/comments?post=75322"}],"version-history":[{"count":0,"href":"https:\/\/blog.mozilla.org\/en\/wp-json\/wp\/v2\/posts\/75322\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/en\/wp-json\/wp\/v2\/media\/75376"}],"wp:attachment":[{"href":"https:\/\/blog.mozilla.org\/en\/wp-json\/wp\/v2\/media?parent=75322"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.mozilla.org\/en\/wp-json\/wp\/v2\/categories?post=75322"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.mozilla.org\/en\/wp-json\/wp\/v2\/tags?post=75322"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/blog.mozilla.org\/en\/wp-json\/wp\/v2\/coauthors?post=75322"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}