Better AI Benchmarks

AI Agents for Shopping

Posted on December 6th, 2024

Summary

On the research front, a paper from Stanford University is highly critical of the benchmarks currently used to evaluate the performance of large language models. There are issues such as out-dated code, an inability to distinguish signal from noise, quick saturation (where advancements in AI mean that tests of the benchmark become easily solvable), and contamination (where benchmark data gets merged into training data). Elsewhere, the test-time compute technique for training and model inference phase could make it possible to envisage more efficient models with significantly reduced size and training times.

On the application side, there is significant research on using AI agents for shopping. An AI shopping agent is given a pot of money by its owner, along with a description of items to buy, and then searches the Web for the best offer.

Among the technical systems making the news this week, PydanticAI is a new framework for deploying agents that use large language models. The framework uses Pydantic to enforce type safety on data exchanges with models. AWS announced a tool that checks for hallucinations when distilling a model with AWS Bedrock. Distillation is the process of transferring knowledge from a bigger (teacher) model to a smaller (learner) model. Mistral AI has released two small models, collectively named les Ministraux, whose performances compare favorably to the Llama and Gemma models, notably for the HumanEval coding benchmark.

AI is having an impact in political and international affairs. The CEO of Hugging Face has expressed his concern about the increasing prevalence of Chinese AI models. Several Chinese models for instance refuse to answer questions about the 1989 Tiananmen Square massacre. Meta has named Russia as the top source of disinformation campaigns on Meta platforms in 2024, and that they took down 20 so-called covert influence operations in 2024. Overall, Meta believes that the use of AI has not yet had an impact on influencing the outcome of elections, but this could change in the future.

OpenAI is now working on a defense project that makes AI-based drones, missiles and radar systems for the US military and its allies. The move is seen as a turnaround since the company’s original charter prohibited the use of AI for “weapons development”.

MIT Technology Review has an interview with Arati Prabhakar, Chief Tech Advisor to President Biden. Since the signing of President Biden’s 2023 executive order on AI in 2023, she notes the evolution of deepfakes, the creation of images for sexual abuse, and the inappropriate use of facial recognition as the main risks to have emerged.

1. What the departing White House chief tech advisor has to say on AI

Arati Prabhakar, Chief Tech Advisor to President Biden, has given an interview to MIT Technology News. Prabhakar was a key actor in President Biden’s 2023 executive order on AI which set guidelines for technology companies to make AI safer and more transparent. However, the order is in danger with the return of Trump to the presidency, as the Republican party’s election manifesto wrote that the order “hinders AI innovation and imposes radical leftwing ideas on the development of this technology”. Regarding the risks envisaged in 2023, Prabhakar cites the evolution of deepfakes, the creation of images for sexual abuse, and the inappropriate use of facial recognition (citing the case of a black man arrested in a store for a crime he did not commit after being wrongly identified). On the veto of the AI Safety Bill in California, Prabhakar expresses little surprise. Fundamentally, the problem is that there are no firm guidelines to defining safety: while we “want to know how safe, effective and trustworthy a model is, we actually have very limited capacity to answer those questions”. Finally, Prabhakar is concerned about the increase in skepticism around science. For instance, one danger is the loss of “herd immunity for some of the diseases for which we right now have fairly high levels of vaccination”.

2. BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices

New research from Stanford University is highly critical of the benchmarks currently used to evaluate the performance of large language models. Benchmarks compare model performance, and are also used by safety organizations which seek to evaluate the model from a regulatory perspective – like for the EU’s AI Act. Poor model benchmark quality can skew regulatory evaluations. Overall, general criteria are missing to evaluate benchmarks. The authors evaluated 24 well-known AI benchmarks and found issues such as out-dated code, an inability to distinguish signal and noise (14 out of 24 benchmarks did not do multiple evaluations of the same model for statistical significance), a lack of benchmark reproducibility and scrutiny (17 out of 24 benchmarks do not provide easy-to-run scripts to replicate results), as well as non-respect of the FAIR principles for research data (Findability, Accessibility, Interoperability, and Reuse). Other challenges for benchmarks are quick saturation, where advancements in AI mean that tests of the benchmark become easily solvable, and contamination, where benchmark data gets merged into training data.

The authors provide an extensive list of criteria for benchmarks. The design criteria include points such as having domain experts involved, addressing input sensitivity, having use cases and personas described, and explaining how the benchmark score should, and should not be used. Implementation criteria include providing scripts to replicate benchmark results, making evaluation code available, supporting model evaluation via APIs and adding warnings for sensitive and harmful content. Maintenance criteria include listing a contact person and keeping a feedback channel open. Finally, documentation criteria include having an accompanying paper accepted at a peer-reviewed venue, documenting evaluation metrics, as well as describing data sources and the pre-processing and annotation processes.

3. The race is on to make AI agents do your online shopping for you

The run-up to Christmas generally sees a large increase in on-line shopping, and AI companies are keen to develop AI agents for this. Perplexity has released an AI shopping agent for paying customers. An AI shopping agent is given a pot of money by its owner, along with a description of items to buy, and then searches the Web for the best offer. The agent can even execute the purchase. (Perplexity is collaborating with Stripe for payments). The pot of money is limited in case the agent hallucinates and buys anything. One issue for the buyer is that the credit card will indicate a purchase from the AI company rather from the shop where the item is bought, and this can impact on refund and return policies. The shops will also have issues with AI shopping agents because the human client never visits the shop’s website – and this harms the shop’s ability to offer promotions or encourage impulse purchases by the client. Also, the AI agent works by scraping the Web for data, which means that it can be working with out of date information on promotions and stock volumes. The article suggests that shopping websites are already blocking AI company web crawlers.

4. Mistral AI Releases Two Small Language Model: Les Ministraux

Mistral AI has released two small models - Ministral 3B and Ministral 8B – that are collectively named les Ministraux. As small models, they are designed for execution within the enterprise. According to Mistral, the goal is to “provide a compute-efficient and low-latency solution for [local, on-device and privacy-sensitive] scenarios”. Both models require a commercial license, but the 8B model can be used for research ends, and the model weights can be downloaded from Hugging Face. According to the Artificial Analysis benchmarking site, the models’ performances compare favorably to the Llama and Gemma models, notably for the HumanEval coding benchmark. The Ministraux models are also available via APIs, which facilitates the development of agent applications. For Mistral, “les Ministraux are also efficient intermediaries for function-calling in multi-step agentic workflows. They can be tuned to handle input parsing, task routing, and calling APIs based on user intent across multiple contexts at extremely low latency and cost.”.

5. Hugging Face CEO has concerns about Chinese open source AI models

The CEO of Hugging Face, Clement Delangue, has expressed his concern about the increasing prevalence of Chinese AI models. He believes that China will start to dominate the global AI race in 2025. The problem for Delangue is that the Chinese government forces their companies to produce models that “embody core socialist values” and comply with an extensive censorship system. For instance, two Chinese models – DeepSeek and a model from Alibaba’s Qwen family – refuse to answer questions about the 1989 Tiananmen Square massacre. The Alibaba model is available on Hugging Face. That said, another Alibaba model on Hugging Face - Qwen2.5-72B-Instruct – replies to question on Tiananmen Square. Fundamentally, the issue is about preventing a single country dominating AI, thereby avoiding a monopoly on the spread of “cultural aspects”. Employees from OpenAI have expressed similar concerns about Chinese models to those of Hugging Face.

6. Meta says it has taken down about 20 covert influence operations in 2024

Nick Clegg, the president of global affairs at Meta, has named Russia as the top source of disinformation campaigns on Meta platforms. Meta has taken down 20 so-called covert influence operations in 2024. This year was an important test of the impact of covert influence operations with many elections being held worldwide. The article cites two examples of Russian based operations. One operation used AI to create fake websites that looked very similar to those of Fox News and the Telegraph in order or propagate disinformation about the war in Ukraine. A second operation propagated pro-Russian and anti-French sentiment in Africa. In the context of the recent US elections, Clegg said that Meta blocked over 500’000 requests to create AI generated images of the presidential candidates. Meta also noted the frequent creation of inauthentic accounts whose purpose was to publish content to influence opinion. Overall, Meta believes that the use of AI has not yet had an impact on influencing the outcome of elections, but this could change in the future with the increased use of deepfakes and fake content, along with the covert manipulation of social media accounts. A spokesman for the UK-based Alan Turing institute says that AI-generated content is already influencing the political debate, citing the case of a TV report on a Kamala Harris rally which her opponents falsely claimed was made with AI.

7. OpenAI’s new defense contract completes its military pivot

OpenAI is partnering with the defense-tech company Anduril which makes AI-based drones, missiles and radar systems for the US military and their allies. OpenAI will work on tools for real-time data analysis that helps human operators defending against drone attacks make better decisions and create better situational awareness. The move by OpenAI is seen as a complete turnaround in company policy. Only one year ago, the company’s charter prohibited the use of models for “weapons development”. Today, a spokesman for the company stated that “democracies should continue to take the lead in AI development, guided by values like freedom, fairness, and respect for human rights.”. Some observers comment that Russia’s attack on Ukraine is one reason for a general softening of attitudes in Silicon Valley to military cooperation. Another reason for increased cooperation is that defense contracts are quite lucrative. The article reports that venture capital firms have more than doubled investment in defense technology since 2021 to 40 billion USD today. The move by OpenAI also comes at a time when the company is expecting an annual loss of 5 billion USD and is being forced to explore new revenue streams like advertising.

8. PydanticAI

Pydantic is a Python-based framework for defining models and validating data based on these models. For type safety, a model can define that an ID field needs to be an integer, a name field a string, and a nationality field must have a valid 3-character ISO code. These models are required because of the large volume of data being processed from sources where data is not guaranteed to be clean, and for which automated support for checking data quality is needed. Pydantic is used in combination with external APIs such as those provided by AI companies (e.g., OpenAI SDK, the Anthropic SDK, LangChain, etc.). PydanticAI is a new framework for deploying agents thats use large language models. The framework is model agnostic, and uses Pydantic to enforce type safety on data exchanges with models. The framework currently supports OpenAI, Gemini, and Groq; Anthropic will soon be supported. The developers assert that the framework is stable enough for production-grade applications.

9. AWS says new Bedrock Automated Reasoning catches 100% of AI hallucinations

In the context of language models, distillation is the process of transferring knowledge from a bigger (teacher) model to a smaller (learner) model. Large language models from Llama, Nvidia and Claude all have distilled smaller model variants. Bedrock, from AWS, is a framework designed for distillation and works with models from Anthropic, Amazon and Meta. Bedrock does this by generating sample data for the teacher model, and using the responses to fine-tune the learner model. Distillation is important because, even though the smaller learner model has limited knowledge compared to the larger teacher model, it exhibits faster response times which is important for applications like customer chatbots. At the re:Invent 2024 conference, AWS announced an Automated Reasoning tool for Bedrock to help detect hallucinations using “verifiable reasoning”. This attempts to show that an answer given by a distilled model is correct based on proof by facts that can be furnished.

10. Ransomware hackers target NHS hospitals with new cyberattacks

Ransomware attacks on the UK’s National Health Service (NHS) have been intensive recently with a number of leaks of patient and donor data going back to 2018, as well as disruption of emergency services. The Russian ransomware group Inc Ransom claimed responsibility for several attacks. One hospital stated that “emergency treatment is being prioritized but there are still likely to be longer than usual waiting times in our Emergency Department and assessment areas”, urging people to attend the emergency rooms only in the event of acute necessity. In an earlier attack this year on Synnovis, a pathology services provider, 400 gigabytes of personal data was leaked which included sensitive patient medical data. The Qilin ransomware group claimed responsibility for this attack. The U.K. government plans to make the reporting of ransomware attacks obligatory in 2025.

11. New AI training techniques aim to overcome current challenges

It is generally accepted that current training techniques for large language models are reaching certain limits. Training has costs that can be as high as tens of millions of US dollars, require huge quantities of data and put a lot of stress on electricity grids. Based on insights gained developing OpenAI’s o1 model, researchers are looking at the test-time compute technique for training and the model inference phase. Rather than using pre-trained patterns to recognize the input and generate the most likely output based on probabilities, the test-time compute technique gets the model to generate several possible answers, and then use reasoning to determine the best answer. An OpenAI researcher mentions that having a bot think for 20 seconds in a hand of poker achieves the same boost in performance as scaling up the model by 100’000 times and training it for 100’000 times longer. This allows for more efficient models with reduced size and training times, and with a reduced need for specialized chips (which is not necessarily good news for Nvidia). The test-time compute technique is reportedly now being used by xAI, Google DeepMind, and Anthropic.