Image

It’s getting tougher to inform which firm is profitable the AI race, Hugging Face co-founder says



  • Hugging Face’s Thomas Wolf says that it’s getting harder to tell which AI model is the best as traditional AI benchmarks become saturated. Going forward, Wolfe said the AI industry could rely on two new benchmarking approaches—agency‑based and use‑case‑specific.

Thomas Wolf, co‑founder and chief scientist at Hugging Face, thinks we may need new ways to measure AI models.

Wolf told the audience at Brainstorm AI in London that as AI models get more advanced, it’s becoming increasingly difficult to tell which one is performing the best.

“It’s getting hard to tell what the best model is,” he said, pointing to the nominal differences between recent releases from OpenAI and Google. “They all seem to be, actually, very close.”

“The world of benchmarks has evolved a lot. We used to have this very academic benchmark that we mostly measured the knowledge of the model on—I think the most famous was MMLU (Massive Multitask Language Understanding), which was basically a set of graduate‑level or PhD‑level questions that the model had to answer,” he said. “These benchmarks are mostly all saturated right now.”

Over the past year, there has been a growing chorus of voices from academia, industry, and policy claiming that common AI benchmarks, such as MMLU, GLUE, and HellaSwag, have reached saturation, can be gamed, and no longer reflect real‑world utility.

In a study published in February, researchers at the European Commission’s Joint Research Centre, published a paper called “Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation” that found “systemic flaws in current benchmarking practices”—including misaligned incentives, construct‑validity failures, gaming of results and data‑contamination.

Going forward, Wolf said the AI industry should rely on two main types of benchmarks going into 2025: one for assessing the agency of the models, where LLMs are expected to do tasks, and the other tailored to each use case for models.

Hugging Face is already working on the latter.

The company’s new program, “Your Bench,” aims to help users determine which model to use for a specific task. Users feed a few documents into the program, which then automatically generates a specific benchmark for the type of work that users can apply to different models to see which one is best for the use case.

“Just because these models are all working the same on this academic benchmark doesn’t really mean that they’re all exactly the same,” Wolf said.

Open‑source’s ‘ChatGPT moment’

Founded by Wolf, Clément Delangue, and Julien Chaumond in 2016, Hugging Face has long been a champion of open‑source AI.

Often referred to as the GitHub of machine learning, the company provides an open‑source platform that enables developers, researchers, and enterprises to build, share, and deploy machine‑learning models, datasets, and applications at scale. Users can also browse models and datasets that others have uploaded.

Wolfe told the Brainstorm AI audience that Hugging Face’s “business model is really aligned with open source” and the company’s “goal is to have the maximum number of people participating in this kind of open community and sharing models.”

Wolfe predicted that open‑source AI would continue to thrive, especially after the success of DeepSeek earlier this year.

After its launch late last year, the Chinese‑made AI model DeepSeek R1 sent shockwaves through the AI world when testers found that it matched or even outperformed American closed‑source AI models.

Wolf said DeepSeek was a “ChatGPT moment” for open‑source AI.

“Just like ChatGPT was the moment the whole world discovered AI, DeepSeek was the moment the whole world discovered there was kind of this open society,” he said.

This story was originally featured on Fortune.com

SHARE THIS POST