Summary
Audio Summmary
A key event of the past week was the release of the R1 model family by the Chinese company DeepSeek. The model is open-source and performs better then OpenAI’s o1 model on benchmarks. Perhaps the most significant feature of DeepSeek’s R1 model is that it was trained on low-quality GPUs due to an export ban by the US on high-grade chips to China. The news led to Nvidia’s market cap losing 600 billion USD in two trading days (which corresponds to a 16.9% fall) as it suggests that Nvidia is no longer a necessary actor for training high performing models.
The other big news of the week was the arrival of Donald Trump to the US presidency and the change in policy on AI. He has thrown out President Biden’s executive order on AI safety standards and announced a 500 billion USD joint venture called Stargate with OpenAI, Oracle and SoftBank to build “the largest AI infrastructure project in history”. The move marks a shift in Big Tech thinking from the goal of safe AI two years ago to building major AI infrastructures today in the aim of staying ahead of China.
Inside of Big Tech, Meta spent 24 million USD on lobbying in Washington last year while OpenAI spent 1.76 million USD. The split between OpenAI and Microsoft is reportedly widening as OpenAI claims that it requires more compute time in data centers than Microsoft is able or willing to offer. OpenAI is under criticism for hiding the fact that it financed the development of the benchmark that was used to evaluate its new o3 model, calling into question the validity of the benchmark score that the model obtained.
On the technical side, there have been articles on retrieval-augmented generation (RAG). One argues that in the choice between fine-tuning a model with recent or corporate documents or using a RAG infrastructure to gain access to this external knowledge corpus, the RAG and prompt engineering approach is usually the best option, mainly because of the large and difficult-to-predict costs of fine-tuning a model. Also, it makes it easier to upgrade the model when required. An article on cache-augmented retrieval (CAG) describes how RAG performance can be improved by a pre-calculation of document relevance to different request types since the model knows which of the documents in the prompt to search through.
An article on the new generation of coding agents argues that performance is improved by including in the training data the thought process of developers, e.g., why scrolling was needed, what sections of code needed knowledge of other parts of the code repositories, what documentation was required. Finally, an opinion article in MIT Technology Review argues that, despite notable advances by AI in sciences, only quantum computing will be able to bring real breakthroughs in several disciplines. Classical computers are still too restricted in modeling quantum mechanics properties found in nature, and AI models will always be limited by the training data that they use which is too small in scope to model all possible properties in material sciences, chemistry or medicine.
Table of Contents
1. Some lessons from the OpenAI-FrontierMath debacle
2. The bitter lesson for generative AI adoption
3. OpenAI has upped its lobbying efforts nearly sevenfold
4. The second wave of AI coding is here
6. Beyond RAG: How cache-augmented generation reduces latency, complexity for smaller workloads
8. Useful quantum computing is inevitable – and increasingly imminent
1. Some lessons from the OpenAI-FrontierMath debacle
This blog post by AI scientist, Satvik Golechha, is extremely critical of OpenAI’s claim that its new o3 model scored 25% on the FrontierMath benchmark, compared to a score of only 2% by previous models. The fundamental problem is that OpenAI funded the FrontierMath benchmark, developed at Epoch AI, and asked Epoch AI to hide this fact until after OpenAI announced its new o3 model. Epoch AI paid outside mathematicians up to 1’000 USD to contribute to the benchmark, but these mathematicians were also not informed about the funding coming from OpenAI. The author voices two specific concerns about this situation. First, the FrontierMath benchmark has math problems with three levels of difficulty (25% olympiad level problems, 50% mid-difficulty problems, and 25% expert mathematician level problems requiring several weeks for the expert to solve). The author’s concern here is that OpenAI did not reveal from which sets of problems the 25% benchmark score came from. The second concern is linked to the “thinking” method of the o3 model, where intermediate results are generated, and a reinforcement-learning approach selects the best answer. If the benchmark was used in more than a single run of the evaluation, then it could have impacted this reinforcement-learning, meaning that it is no longer fair as an evaluation measure. Finally, the author proposes that, in future, benchmark meta-data clearly explain how development of the benchmark has been funded, and that people contributing to the benchmark get written explanation of how the benchmark will be used.
2. The bitter lesson for generative AI adoption
An enterprise AI project requires enterprise data. There are two ways of incorporating this data into a large language model’s knowledge: 1) one can fine-tune the base model with the company data, or 2) one can use retrieval-augmented generation (RAG) and/or prompt engineering to furnish the model with the data. This article argues that RAG and prompt engineering is usually the best option, mainly because of the large and difficult-to-predict costs of fine-tuning a model. The article cites Bloomberg’s approach of including a margin of error of 30% when estimating the cost of model training or fine-tuning. Further, fine-tuning locks the enterprise into the use of a model, whereas the RAG approach allows the model to be replaced by a higher-performing model at a later stage. One use case for fine-tuning is when the enterprise is operating in a highly regulated environment, and the model has been validated through some process.
3. OpenAI has upped its lobbying efforts nearly sevenfold
OpenAI is taking a more active part in Washington politics, spending 1.76 million USD on lobbying in 2024 (compared to just 260’000 USD in 2023). Company CEO Sam Altman also donated 1 million USD to Donald Trump’s election campaign. The article underlines a shift in thinking from 2022 when “responsible AI” was a buzzword, Big Tech was promising to fight against deepfakes and other AI risks, and President Biden issued an executive order on the safe use of AI in administrations. Today, the Biden executive order has been thrown out by Trump, and OpenAI is positioning itself for military contracts, access to cheaper subsidized energy to train its models, and less regulatory controls over safety. Altman has already called for the building of five-gigawatt data centers, which would each consume as much energy as New York City. The policy in Washington is to accelerate AI development and to remain ahead of the Chinese. Meta remains the top spender on lobbying, having spent 24 million USD last year. One area of discord between Big Tech and the new Trump administration relates to visas. Big Tech is in favor of expanding the H-1B visa program to catch more talent, whereas Trump supporters are against.
4. The second wave of AI coding is here
Coding agents are perhaps the biggest success story of generative AI. Github’s Copilot, based on OpenAI’s language model, is very popular with developers and, at Google, 25% of code is created by AI. The article cites new companies in this domain like Zencoder, Merly, Cosine, Tessl, and Poolside. Tessl was valued at 750 million USD in the months after its creation and Poolside was valued at 3 billion USD even before it released a product. The next generation of coding agents will not just generate code, but will develop prototypes, write tests for these and fix bugs from descriptions of the errors. They will be able even to develop several prototypes in parallel and suggest the best design. Human programmers will move to a more managerial role. For coding agents to achieve this increased level of intelligence, they will need to be trained differently. Instead of training them uniquely from code samples, the thought process of developers will be included in the training data. At Cosine for instance, developers are asked to leave “breadcrumbs” of their process, e.g., why scrolling was needed, what sections of code needed knowledge of other parts of the code repositories, what documentation was required. Breadcrumb trails will perhaps help create software that is more repository-aware and make clear the steps required to create a particular program. New coding agents will seriously reduce the number of required programmers. For one expert, “it will be like how ATMs transformed banking … Anything you want to do will be determined by compute and not head count.”
5. Microsoft’s relationship with OpenAI cracked when it hired Mustafa Suleyman, rival Marc Benioff says
Salesforce CEO Marc Benioff is claiming that the rift between Microsoft and OpenAI is increasing. Microsoft is one of OpenAI’s earliest and largest investors, contributing 1 billion USD in 2019. Since then, Microsoft has been developing its own AI and hired DeepMind and Inflection co-founder Mustafa Suleyman to lead Microsoft AI. This was interpreted as a snub by Microsoft CEO Satya Nadella, CEO of Microsoft, of OpenAI. For its part, OpenAI is purportedly claiming that it requires more compute time in data centers than Microsoft is able or willing to offer. OpenAI’s agreement to share research results with Microsoft is set to end whenever Artificial General Intelligence (AGI) is achieved. OpenAI has recently been claiming that AGI is close, which is interpreted by many as a desire to close the collaboration with Microsoft as soon as possible. Employees from both companies recently admitted to Business Insider that relations between the companies is poor.
6. Beyond RAG: How cache-augmented generation reduces latency, complexity for smaller workloads
Retrieval-augmented generation (RAG) is a means for a language model to get information that is more recent than the model’s training data. It also permits access to information that was not included in the model’s training, such as proprietary company information. On the flip side, implementing a RAG infrastructure can be expensive and RAG operations are slow. An alternative to RAG is to include documents (of the external knowledge corpus) in the prompt sent to the model. This alternative is increasingly attractive as the context length of models (the number of tokens that can be included in a prompt) increases. For instance, Claude 3.5 Sonnet supports up to 200’000 tokens (which Anthropic says corresponds to 500 pages of text or 100 images), GPT-4o has 128’000 tokens and Gemini supports up to 2 million tokens.
However, adding all documents to the prompt is not the most efficient approach because the model has no indication about which documents are the most relevant. In an approach named cache-augmented retrieval (CAG) developed by the National Chengchi University in Taiwan, a pre-calculation is made of the relevance of documents to different request types. Thus document relevance is computed beforehand instead of during inference. This reduces overall inference costs and request latency. Experiments with Anthropic yielded a 90% reduction in costs and an 85% reduction of latency in some cases. The researchers note that CAG is not an appropriate solution when the documents of the external model corpus change frequently.
7. Trump unveils 500 billion USD Stargate AI project between OpenAI, Oracle and SoftBank Joint venture aim
In his first days as President, Donald Trump has thrown out President Biden’s executive order on AI safety standards and announced a 500 billion USD joint venture with OpenAI, Oracle and SoftBank to build “the largest AI infrastructure project in history”. The project is named Stargate and aims to ensure that the US maintains a lead in AI ahead of China. Stargate will finance the development of data centers across the US and the power stations to support the data centers, which Trump hopes will create 100’000 jobs in the short-term. The financial firm Blackstone has already projected that 1 trillion USD will be spent on data centers in the next 5 years. The support of Big Tech marks a clear shift from two years ago when responsible AI was touted as the key overriding goal for AI.
8. Useful quantum computing is inevitable – and increasingly imminent
This opinion article by start-up investor Peter Barrett argues that, despite notable advances by AI in sciences, only quantum computing will be able to bring real breakthroughs in several disciplines. The heart of the problem is that we still lack a lot of scientific knowledge. For instance, “we still don’t understand why the painkiller acetaminophen works, how type-II superconductors function, or why a simple crystal of iron and nitrogen can produce a magnet with such incredible field strength. We search for compounds in Amazonian tree bark to cure cancer and other maladies, manually rummaging through a pitifully small subset of a design space encompassing 1060 small molecules”. Nature operates on the principles of quantum mechanics, and classical computing approaches are unable to model these principles appropriately. Density functional theory (DFT) is the main approach to model quantum mechanics features but only works well when correlations between elements is weak – making the approach fail for a broad class of problems from nature. Admittedly, AI has produced interesting results. The author cites the example of Google DeepMind’s Graph Networks for Materials Exploration (GNoME) which found 380’000 new potentially stable materials using DFT. However, GNoME is an AI model, so it is fundamentally limited by its training data which restricts the vast design space in the search for new materials.
The author cites Google’s Willow chip as a major breakthrough in the quest for quantum supremacy (which is the ability to execute a task that is effectively impossible for an electronic computer in a reasonable time). Willow ran a benchmark in under five minutes which would have taken the world’s most powerful supercomputer 10 septillion years. Willow has also made significant progress on handling errors – an issue that has plagued the design of earlier quantum machines. The start-up PsiQuantum is hoping to commercialize a quantum computer by the end of the decade that is 10 thousand times more powerful than Willow, making it possible to look for breakthroughs in materials, drugs and medicine. The author admits that building a quantum computer remains a huge engineering and financial challenge, requiring a new generation of silicon photonics components, error correction hardware fast enough to chase photons, and single-photon detectors that have unprecedented sensitivity.
9. How a top Chinese AI model overcame US sanctions
The Chinese firm DeepSeek has released a new open-source model called R1 which performs and even exceeds OpenAI’s o1 model on several key benchmarks while operating at a much reduced cost. The model uses a “chain-of-thought” reasoning model, similar to that of o1, and has been cited for its good performance on coding and complex reasoning tasks. Six smaller versions of o1 were also released to run on standard laptops and which outperform o1 mini on some benchmarks. According to TechCrunch, DeepSeek R1 has already surpassed ChatGPT in downloads from App stores in the US and 51 other countries. Last Monday morning, the DeepSeek R1 App already had nearly 3 million downloads from App Store and Google Play. The Chinese market accounts for only 23% of the App’s downloads. Meta has reportedly created a “war room” to analyze the R1 model’s working.
Perhaps the most significant feature of DeepSeek’s R1 model is that it was trained on low-quality GPUs. The US had placed an export ban on high-grade quality chips to China. For instance, Nvidia could only export GPUs that did not exceed 50% of the speed of their highest quality chips. This restriction has pushed Chinese companies to look for more creative ways of collaborating and resource pooling. For one AI researcher, the “US export control has essentially backed Chinese companies into a corner where they have to be far more efficient with their limited computing resources”. Larger AI companies in China may have stockpiled up to 50’000 Nvidia A100 chips in anticipation of the US ban, but DeepSeek would have had to use lower-grade material. One impact of the announcement of R1 is that Nvidia’s market cap lost 600 billion USD in two trading days (which corresponds to a 16.9% fall) because it is no longer seen as a necessary actor for training high performing models.