The cost of cutting-edge: Scaling compute and limiting access

(To read the complete publication on LLM evaluation, please visit the blog)

In a year marked by extraordinary advancements in artificial intelligence, largely driven by the evolution of large language models (LLMs), one factor stands out as a universal accelerator: the exponential growth in computational power.

Over the last few years, researchers have continuously pushed the boundaries of what a ‘large’ language model means, increasing both the size of these models and the volume of data they were trained on. This exploration has revealed a consistent trend: as the models have grown, so have their capabilities.

Fast forward to today, and the landscape has radically transformed. Training state-of-the-art LLMs like ChatGPT has become an immensely costly endeavor. The bulk of this expense stems from the staggering amount of computational resources required. To train an LLM, researchers process enormous datasets using the latest, most advanced GPUs (graphics processing units). The cost of acquiring just one of these GPUs, such as the H100, can reach upwards of $30,000. Moreover, these units are power-hungry, contributing to significant electricity usage. 

While there have been substantial efforts by researchers and organizations towards openness in the development of large language models, in addition to the compute challenges, three major hurdles have also hampered efforts to level the playing field:

  • Limited Transparency: The specifics of model architecture, data sources, and tuning parameters are often closely guarded.
  • Challenging Reproducibility: Due to the swift pace of innovation, independently replicating these models on your own infrastructure can be very difficult.
  • Disjointed Evaluation: The absence of a universally accepted benchmark for LLM evaluation complicates direct comparisons. The multitude of available tasks and frameworks, each assessing different capabilities, means comparisons are often inconclusive.

The high compute requirements, coupled with the opaqueness and the complexities of developing and evaluating LLMs, are hampering progress for researchers and smaller organizations striving for openness in the field. This not only threatens to reduce the diversity of innovation but also risks centralizing control of powerful models in the hands of a few large entities. The critical question arises: are these entities prepared to shoulder the ethical and moral responsibilities this control entails? Moreover, what steps can we take to bridge the divide between open innovators and those who hold the keys to the leading LLM technology?

Why LLMs make model evaluation harder than ever

Evaluation of LLMs involves assessing their performance and capabilities across various tasks and benchmarks and provides a measure of progress and highlights areas where models excel or need improvement. 

‘Traditional’ machine learning evaluation is quite straightforward: if we develop a model to predict lung cancer from X-ray images, we can test its accuracy by using a collection of X-rays that doctors have already diagnosed as either having cancer (YES) or not (NO). By comparing the model’s predictions with the doctor-diagnosed cases, we can assess how well it matches the expert classifications.

In contrast, LLMs can complete an almost endless number of tasks: summarization, autocompletion, reasoning, generating recommendations for movies and recipes, writing essays, telling stories, generating good code, and so on. Evaluation of performance therefore becomes much, much harder.

At, we believe that making open-source model fine-tuning and evaluation as accessible as possible is an important step to helping people own and trust the technology they use. Currently, this still requires expertise and the ability to navigate a rapidly evolving ecosystem of techniques, frameworks, and infrastructure requirements. We want to make this less overwhelming for organizations and developers, which is why we’re building tools that:

  • Help them find the right open-source LLM for their needs
  • Make it simple to fine-tune an open-source LLM on their own data
  • Make language model evaluation and validation more accessible

We think there will be a significant opportunity for many organizations to use these tools to develop their own small and specialized language models that address a range of meaningful use cases in a cost-effective way. We strive to empower developers and organizations to engage with and trust the open-source ecosystem and minimize their dependence on using closed-source models over which they have no ownership or control.

Read the whole publication and subscribe to future ones in the blog.

Share on Twitter