Evaluating LLM models at scale

(To read the complete Mozilla.ai learnings on LLM evaluation, please visit the Mozilla.ai blog)

Large language models (LLMs) have rapidly advanced, but determining their real-world performance remains a complex challenge in AI. Mozilla.ai participated in NeurIPS 2023, one of the most prominent machine learning conferences, by co-sponsoring a challenge designed to address evaluating models by focusing on efficient fine-tuning of LLMs and developing robust evaluation techniques. 

The competition emphasized fine-tuning LLMs under precise hardware constraints. Fine-tuning involves updating specific parts of an existing LLM with curated datasets to specialize its behavior. The goal was to fine-tune models within 24 hours on a single GPU, making this process more accessible to those without access to high-performance computational clusters.

Mozilla.ai played a key role in evaluating the results of these fine-tuning experiments. We used tools like HELM, a framework developed at Stanford for running various tasks to assess LLM performance. However, evaluating LLMs is hard due to the stochastic nature of the responses of transformer models: a model can give different answers every time it is provided with a given prompt and there are many ways to measure these responses. This complexity makes it challenging to compare models objectively and decide which models are truly “best”.

The competition highlighted the rapidly evolving nature of LLMs.  New models, fine-tuning techniques, and evaluation methods are being constantly introduced so reliable and standardized evaluation of LLMs will be crucial for understanding their capabilities and ensuring they are trustworthy.

Open source plays a big role in this area because evaluation is such a multifaceted problem. Being able to work in a collaborative manner and with open-source systems is crucial for moving forward toward a better framework that could eventually be used in the field by many people.

At Mozilla.ai we believe in the importance of establishing robust and transparent foundations for the entire evaluation landscape. This is why we are working on several tracks of work to support this. On the experimentation side, we are focused on research approaches that allow for a clear definition of metrics and transparency and run repeatable evaluations. On the infrastructure side, we’re developing reliable and replicable infrastructure to evaluate models and store and introspect model results.


Read the whole set of learnings in the Mozilla.ai blog.


Share on Twitter