OpenAI in Need of High Quality Data for GPT-5

AI Slop Reaches Security Reports in Open-Source Projects

Posted on December 23rd, 2024

Summary

An increasing concern for AI companies is an apparent penury of diverse high-quality training data. A Wall Street journal article claims that this is the reason for OpenAI’s delay in releasing GPT-5 (codenamed Orion). To help compensate, OpenAI is hiring mathematicians and software engineers to solve problems and explain how they solved each problem, though this is a slow process. Elsewhere, the Data Provenance Initiative, whose goal is to audit the origin and quality of data used to train large language models, is warning that AI models are increasingly reliant on Big Tech for data. For instance, 70% of training set videos come from Youtube, a platform owned by Alphabet (Google’s parent company). This dependence is risky also because Big Tech companies have not been clear on the origin of the data being used, with 25%, 33% and 32% of the text, speech and video of common datasets respectively having non-commercial licenses, meaning they can be used for academic or creative purposes, but not commercially.

The test-time compute method, where a model spends more time during the inference phase on testing and verifying partial answers in an effort to reduce errors, has been in the news. The approach was used in OpenAI’s o1 model, and is also used in the o3 model family series that OpenAI has just announced. A variant of the technique is used in OpenAI’s deliberative alignment mechanism for ensuring a language model does not generate dangerous content. Here, the model learns to reason using safety-relevant guidelines and then executes a chain-of-thought (CoT) reasoning on user prompts, using these guidelines. Finally, a Hugging Face article demonstrates how the test-compute approach can be used to get a small language model to perform nearly as well as a large language model for some benchmark.

On issues related to AI content, a security developer-in-residence with the Python Software Foundation is warning that a significant number of poor quality and factually incorrect security reports submitted to open-source projects are being generated using large language models. Meanwhile, in the UK, the government is trying to introduce a copyright exemption in the Law that would permit AI companies to use copyrighted works to train their algorithms unless the owners explicitly opt out. This is facing still opposition from newspapers and artists, including Paul McCartney of the Beatles.

Elsewhere, Nvidia has announced its new processor – the Jetson Orin Nano Super – designed for AI enthusiasts and students. The processor sells at 249 USD. A report on the Lazarus group, a cybercriminal group linked to North Korea, accuses the group of 47 cryptocurrency hacks in 2024 that stole 1.34 billion USD, as well as leading a cyber-espionage campaign against employees at defense-related organizations.

1. This is where the data to build AI comes from

This article notes an evolution of data sets from the early 2010s, when they were principally composed of curated content like encyclopedia articles and material like weather reports, parliamentary publications and company earning reports. Today, data sets increasingly use content scraped from the Internet, which is collected and controlled by Big Tech companies. The Data Provenance Initiative is a volunteer group of over 50 researchers from industry and academia whose goal is to audit the origin and quality of data used to train large language models. The group has audited 4’000 public data sets. These included content in 600 languages from 67 countries, collected over three decades. However, there is a rising use of social media and synthetic data, which is introducing a greater dependency of AI researchers on Big Tech companies. This is particularly striking for multimodal models which get trained on video: 70% of training set videos come from Youtube, a platform owned by Alphabet (Google’s parent company). This dependence is risky also because Big Tech companies have not been clear on the origin of the data being used. Part of the reason cited by the Data Provenance Initiative is that 25%, 33% and 32% of the text, speech and video datasets respectively they analyzed are licensed non-commercially, meaning they can be used for academic or creative purposes, but not commercially. The article warns that the content-access deals signed by AI companies (OpenAI, Google) with content providers (Reddit, Le Monde, etc.) will further concentrate power in the hands of the Big Tech.

A further issue is the Western-centric (particularly US) predominance of this data. Over 90% of the analyzed data sets comes from the US and Europe, compared to only 4% from Africa. Overall Internet content is still estimated to be around 90% in English, so models trained in today’s Internet content inevitably reflect Western-centric cultural views. The article cites the example of asking a generative AI model to create content based on the theme “marriage”; the model will most likely generate content reflecting marriage ceremonies in the West.

2. Lazarus Group Spotted Targeting Nuclear Engineers with CookiePlus Malware

This article reports on a cyber espionage campaign by the group known as Lazarus, purportedly close to the North Korean regime, which targeted employees of nuclear related organizations in January of this year. The Lazarus group is also blamed for 47 cryptocurrency hacks in 2024 that stole 1.34 billion USD. These include a breach of the Japanese cryptocurrency exchange, DMM Bitcoin, in May 2024, where 305 million USD was stolen. (North Korea has been developing ransomware and crypto-stealing techniques for some time as a means of gaining income, as the country is the subject of international sanctions due to its nuclear weapons program). The attacks on the employees of a nuclear organization form part of an operation known as Operation Dream Job, where the employees are enticed by lucrative job opportunities. Malware, known as CookieTime, is passed onto the employees either via the PDF of the job descriptions or via corrupted versions of remote access tools (VNC or PuTTY) that employees use for a fake skills evaluation test. Employees in other defense-related industries have been targeted in the past.

3. UK arts and media reject plan to let AI firms use copyrighted material

The new UK government is trying to introduce a copyright exemption in the Law that would permit AI companies to use copyrighted works to train their algorithms unless the owners explicitly opt out. Supported by an industrial lobby called Tech UK, the UK’s minister for Culture said that “If we were to adopt a too tight a regime based on proactive, explicit permission, the danger is that international developers would continue to train their models using UK content accessed overseas, but may not be able to deploy them in the UK”, which he went on to say, would seriously disadvantage the UK’s content industry. However, the proposal is facing criticism from writers, publishers, musicians (including Paul McCartney of the Beatles), photographers, movie producers and newspapers, who insist that existing copyright laws must be enforced. The newspapers contesting the proposal include the Guardian, Financial Times, Telegraph, Getty Images, the Daily Mail Group and Newsquest. The opponents have launched a petition which includes the statement: “unlicensed use of creative works for training generative AI is a major, unjust threat to the livelihoods of the people behind those works, and must not be permitted”.

4. NVIDIA Unveils Its Most affordable Generative AI Supercomputer

Nvidia has announced its new processor – the Jetson Orin Nano Super – designed for AI enthusiasts and students. The processor sells at 249 USD. The company claims that the processor offers a 1.7 improvement in generative AI performance stemming from a GPU performance of 67 Trillion Operations Per Second (TOPS) when using INT8 (8-bit integer), and a memory bandwidth of 102GB/s. The processor is sold with the Jetson Orin Nano Developer Kit. The kit includes an NVIDIA Ampere architecture GPU with tensor cores and a 6-core ARM CPU, thereby supporting “multiple concurrent AI application pipelines and high-performance inference”. The applications touted by Nvidia for their kit include LLM chatbots, robots, smart cameras, drones, and retrieval-augmented generation. The article also highlights the Nvidia ecosystem. The NVIDIA Jetson AI lab contains tutorials and open-source projects. There are also specialized Nvidia software for robotics, fine-tuning pre-trained AI models, generating synthetic data, and sensor processing.

5. New era of slop security reports for open source

Seth Michael Larson is a security developer-in-residence at the Python Software Foundation where he works on the security posture of the Python ecosystem. He does security report triaging for several Python projects. In a recent blog post, he complains about the significant number of poor quality and factually incorrect security reports submitted to his and other open-source projects that have been generated using large language models. He cites an example of a security report submitted to his own urllib3 project (a user-friendly HTTP client for Python) where the report mentions that a tool was detecting use of SSLv2, whereas in reality, the project explicitly disables SSLv2. The overriding problem is that processing security reports is time-consuming, and most security experts on open-source projects are volunteering their time. Mr. Larson is advising platform owners to discourage automated or abusive creation of security reports by techniques like having reporters of issues solve CAPTCHAs, being able to name-and-shame offenders, and removing incentive to submit reports by anonymizing security advisories on platforms like Github. He notably encourages issue reporters never to use AI to detect vulnerabilities, writing that these platforms “cannot understand code” and “finding security vulnerabilities requires understanding code AND understanding human-level concepts like intent, common usage, and context”.

6. The Next Great Leap in AI Is Behind Schedule and Crazy Expensive

This article reports how OpenAI is running into problems producing GPT-5 (code-named Orion), primarily because there is not enough high-quality data to train the model, with the objective of it being smarter than GPT-4. Microsoft had expected to see the model in the middle of 2024, and OpenAI’s 157 billion USD valuation by investors this year was based on the assumption that GPT-5 would show a significant improvement over GPT-4. An OpenAI executive had summarized this expectation by comparing GPT-4 to a smart high-schooler and GPT-5 to a PhD graduate. Training the model is very costly. It is reported that OpenAI have done two major training runs for GPT-5, where the computing cost of a six-month training run is now estimated at half a billion USD. It seems that the Internet no longer has enough high-quality and diversified data to be make a significant impact on model quality. In order to improve data quality and help training, OpenAI is hiring mathematicians and software engineers to solve problems and explain how they solved each problem. However, this is a relatively slow process – the article estimates that it would take months for one thousand people writing 5’000 words a day to produce 1 billion tokens, while GPT-4 was trained using 13 billion tokens.

7. Scaling Test-Time Compute with Open Models

This blog post reports on how small language models can be trained to perform as well as large models on benchmarks using the “thinking”, or test-time compute, approach used in OpenAI’s o1 model. (It should be noted that OpenAI’s model is closed source, so the authors had to reverse engineer the model to get insights). In test-time compute, a model spends more time during the inference phase on testing and verifying partial answers generated by the model. In this post, the authors describe two components. The first is a reward model, implemented using a distinct language model, that evaluates individual answers as well as the effectiveness of the reasoning path taken by the model. The second component, called beam search, examines the solution space based on the results of the reward model and continues until the correct answer is found or inference-time has expired. The authors used their approach to get the performance of Llama-3.2 1B near to that of Llama-3.2 8B on the MATH-500 benchmark. Despite this achievement, the authors conclude that the holy grail of model self-verification is still a long way off.

8. Deliberative Alignment: Reasoning Enables Safer Language Models

This paper presents OpenAI’s deliberative alignment mechanism for ensuring a language model does not generate dangerous content. The usual approach to controlling dangerous content uses supervised fine tuning and Reinforcement Learning from Human Feedback. As well as being highly manual, the method still allows dangerous content to be generated (false negatives) and it can refuse legitimate requests (false positives). For OpenAI researchers, the problem is that a language model needs to answer quickly, thereby leaving little time to deliberate over a response, and it has to infer safety from labeled examples in a training set rather than consulting directly the model’s safety guidelines. OpenAI’s new deliberative alignment method is a training approach where the model learns to reason through safety-relevant guidelines and then executes a chain-of-thought (CoT) reasoning on user prompts, using these safety guidelines. An example from the article is given in this figure where safety guidelines are highlighted in green. The CoT reasoning leads to the model refusing the user request. Deliberate alignment does not require human intervention during training and OpenAI reports fewer false positives and false negatives for safety benchmarks like StrongREJECT. The deliberative alignment mechanism is included in o3 – OpenAI’s latest family of models.

...
Example Reasoning Behind Deliberate Alignment.
Source OpenAI

9. OpenAI announces new o3 models

OpenAI has announced o3, the successor to its o1 “reasoning” model. As well as o3, there is a smaller distilled version of the model called o3-mini. Both models will be released around the end of January 2025. The models use the test-time compute technique for improved reasoning – this is where the model produces several chains of reasoning in response to a prompt, and then selects what it considers to be the best of the responses of the chains. OpenAI claims the model achieves outstanding scores on several benchmarks: SWE-Bench Verified (programming tasks), Codeforces (a measure of coding skills, which places o3 at the 99.2nd percentile of engineers), the 2024 American Invitational Mathematics Exam, GPQA Diamond (a set of graduate-level biology, physics, and chemistry questions), and EpochAI’s Frontier Math benchmark.

OpenAI has made claims that o3 is approaching artificial general intelligence (AGI). The model scored 87.5% on the ARC-AGI benchmark which measures how well a model can acquire skills outside of the data that the model was trained on. However, François Chollet who created the benchmark, said that o3 is failing on “very easy tasks”, showing that o3 is still a long way from AGI. He also mentioned that his benchmark has flaws when used to measure AGI. Interestingly, OpenAI has its own definition of AGI: “highly autonomous systems that outperform humans at most economically valuable work”. According to the deal OpenAI struck with Microsoft, the company will no longer be obliged to share advanced technologies with Microsoft once AGI has been achieved.