AI Models Still Not Expert Coders

AI Slop generating high revenue

Posted on November 7th, 2024

Summary

On the technical side this week, OpenAI has been in the news with the announcement of ChatGPT Search. This searches the Web to resolve queries that require information that post-dates GPT-4o’s training date (October 2023). An InfoQ podcast with a Small Language Model (SLM) expert argues that a workflow environment composed of several SLMs is more cost-effective than using a single LLM. An SLM has less than eight billion quantized parameters, making it possible to run on standard PCs.

An article from 404 reports how 1 million businesses are creating more than 15 million ads per month using generative AI on Meta platforms. Meta’s recommendation algorithm is promoting these articles to maximize revenue. Elsewhere, OpenAI have announced an updated Realtime API platform that enables developers to build voice features into their applications. This is seen as another step towards support for AI agents though challenges remain around reasoning and integration with existing Apps.

An article from Purdue University argues that language models are still failing in effective code generation. The authors argue that current model coding benchmarks are too simplistic since they lack real-world code completion tasks, realistic task complexity, and reliable correctness evaluation metrics. The authors propose a new benchmark and show state-of-the-art language models are only able to generate correct code around 27% of the time.

In cybersecurity, a WIRED article describes the current criminal ecosystem. There are the malware providers, with infostealer malware for stealing passwords and cookies from user browsers being very popular. Then, “traffers” are responsible for finding criminals to use the malware. These individuals can make their money through a share of the profits on the malware they sell with hackers that use the malware. Also, 2024 is going to be another record-breaking year for ransomware. Among the current trends are that criminals tend less to encrypt the data, since they believe that theft is sufficient incentive to ask for a ransom payment.

Australia will propose a law that bans children under the age of 16 years from using the social media platforms Facebook, Instagram, TikTok, X and Youtube. Meanwhile, LinkedIn has been fined 310 million EUR by the Irish Data Protection Commission for failing to obtain clear consent for processing user data for targeted advertising purposes.

1. OpenAI brings a new web search tool to ChatGPT

OpenAI has announced ChatGPT Search – a search engine extension to ChatGPT. The chatbot automatically searches the Web for multi-media results to any query that requires information later than the chatbot’s training data date (which is October 2023 for GPT-4o). OpenAI’s search engine relies on data from OpenAI’s official content partners (e.g., Reuters, the Atlantic, Le Monde in France, the Financial Times, Axel Springer, Condé Nast, Time) as well as any site that does not block OpenAI’s web-crawler. ChatGPT Search will challenge similar tools from Google, Microsoft, Perplexity and soon Meta. One of the features of ChatGPT Search is that searches initiated by the chatbot can be complemented with information from a user’s chat history in an effort to provide the search engine with better context. For OpenAI, the search extended chatbot is a step towards “agentic” behavior where the AI can take actions on behalf of the user, such as reserving a hotel room. In tests conducted by MIT Technology Review, the authors found that ChatGPT Search could still hallucinate, especially when presented with incorrect facts as the chatbot tries to create plausible results. The tool has a long way to go to challenge Google’s search engine’s market dominance which is currently estimated at 90%. One of the challenges faced is the disruption it might cause to the existing search engine optimization market. Chatbots fundamentally take people away from websites – which up to now gained money from user visits. ChatGPT Search is now available to paying ChatGPT customers.

2. Namee Oberst on Small Language Models and How They are Enabling AI-Powered PCs

This InfoQ podcast interviews Namee Oberst, the founder of AI Bloks, on the emergence of Small Language Models (SLMs). An SLM is a model small enough to run on a standard PC or edge computing device. A key advantage of such a model is that, by running within the confines of an organization, there is less risk of data leakage of organizational data. Practically speaking, an SLM has less than eight billion quantized parameters – though this number can increase as PCs get more powerful. Apple intelligence for instance uses a 3 billion parameter model on devices. One of the applications of SLMs is the so-called AI-powered PC, where tasks like searching for documents, audio transcriptions, and generating SQL queries can be easily handled. Oberst argues that SLMs are better for auditing, since the smaller model size makes the decision tree of the model in response to a request easier to follow. Further, since an SLM is designed for a specific task, the scope for prompt injection attacks is smaller. Finally, the argument is made that a workflow environment composed of several SLMs is more cost-effective than using a single LLM, and the performance is also better. Oberst recommends the Microsoft Phi series as a good starting point for organizations wishing to experiment with SLMs.

3. Inside the Massive Crime Industry That’s Hacking Billion-Dollar Companies

This WIRED article looks at the structure of the current cybercriminal ecosystem. The starting point for criminals is the theft of access credentials. The primary technique for this is infostealer malware which steals passwords and cookies from user browsers. Infostealers are generally distributed to users by hiding them in free software, but also by hiding them in content on social media sites like Youtube, Tik Tok and even Github. The article cites the case of a hacker, known as Dark X, who exfiltrated emails, addresses, phone numbers, and partial credit card numbers of 350 million Hot Topic customers after stealing the access credentials of one developer’s Snowflake account. Infostealer malware creators are engaged in a cat and mouse game with browser developers. One malware developer is cited as saying “We are professionals in our field and will continue to work on bypassing future Google updates”. One technique used by Google in response is to deliver software updates piece by piece, so that hackers have less time to develop jailbreaks for each released update. The most popular infostealers today are RedLine, Nexus, Aurora, META, and Raccoon. The original goal of infostealers was to steal credentials for cryptocurrency wallets, but the market for general credentials has evolved. The security firm Recorded Future says it sees 250’000 new infostealer infections every day.

After the malware developers, another group of actors are the “traffers”. These people are responsible for bringing criminals to the tools, and also for hiring new malware developers. They operate on underground forums like Lolz and use names like “Billionaire Boys Club” to gain attention. Their business model generally is to share the profits of the malware they sell with hackers that use the malware. Stolen credentials are then sold on underground marketplaces like Genesis Market (now shutdown) and Russian Market. Many credentials are also sold on Telegram channels.

4. 2024 looks set to be another record-breaking year for ransomware — and it’s likely going to get worse

This article reports that 2024 will be another record-breaking year for ransomware attacks, despite some successes by police authorities. One recent ransomware case was the US based Change Healthcare paying 22 million USD to the Russian cybercriminal group ALPHV after the theft of personal data belonging to more than 100 million people. The data included names and addresses, dates of birth, phone numbers, email addresses, as well as government identity documents that include Social Security, driving license and passport numbers. Another trend is the number of young hackers attempting ransomware scams. This results in a situation where there is a high number of uncoordinated attacks. Among the evolutions observed, ransomware criminals are less inclined to encrypt data, believing that the theft of the data is sufficient for the victim to pay a ransom. Another trend is the increased use of physical violence against victims who refuse to pay ransoms. Among the progress against ransomware noted by analysts, one is the takedown of the Russian criminal gang LockBit by UK police authorities. Another is the intelligence sharing task-force set up by President Biden. That said, one analyst is cited as saying that the only effective means that could currently be efficient is to ban ransom payments.

5. Zuckerberg: The AI Slop Will Continue Until Morale Improves

In this 404 Media article, the author uses the term of AI Slop to refer to the generative AI content produced on Facebook and other Meta platforms which is actually the source of revenue. The author underlines the three phases of content recommendation on Facebook. In the first and initial phase of Facebook, the content seen by a user were posts of friends in the network. A second phase introduced content from paying content providers. This meant that much of the content seen by Facebook and Instagram users are not from friends, but is content which Meta’s recommendation algorithm calculates as most likely to engage a user’s interest and time. This second phase has led to a third phase, where generative AI is used to create content that gets promoted by Meta’s recommendation algorithm. Meta admits that 1 million businesses are creating more than 15 million ads per month using generative AI. The article author uses the term AI Spam and AI Influencers to refer to the content and their creators.

6. How ChatGPT search paves the way for AI agents

Agents are featuring largely in the final months of 2024. A software agent is a next-generation of the current customer agents. On a person’s behalf, an agent is capable of organizing a vacation where hotel reservations match the person’s taste and budget, and where itinerary and clothes chosen suit the expected weather. Voice recognition is seen as a key factor in the agent experience, and OpenAI have just announced an updated Realtime API platform that enables developers to build voice features into their applications. Google has also an AI agent called Astra and Anthropic’s Claude 3.5 has agent features. A second type of agent mentioned is embodied agents. These are agents within robots which do house-cleaning for instance, or video games.

The article notes that there are several hurdles to agent adoption. The first is reasoning – or the ability to understand complex tasks. OpenAI claims progress in this area with its recent o1 model that uses reinforcement learning for “chain of thought” reasoning. Nonetheless, models are still very limited on reasoning. A second hurdle is the challenge of integration of the agent with desktop and Web tools. A final hurdle is the model’s context window – this is the data size that a model can simultaneous process. Context windows are limited in size, and this effectively limits the number of tasks that an autonomous agent may exercise.

7. Can Language Models Replace Programmers? REPOCOD Says ‘Not Yet’

Software code generation is one of the domains where language models are seen as being effective, and several published works give models over 90% pass@1 scores for Python coding problems. (Pass@1 scores indicate how well the model is able to generate correct code on the first attempt). This article from Purdue University seriously questions these existing measures, and suggests that current models have a maximum pass@1 score of 27.35%. Fundamentally, the authors argue that current model coding benchmarks are too simplistic since they lack real-world code completion tasks, realistic task complexity, and reliable correctness evaluation metrics. Many benchmarks are restricted to single-line code or short functions. In particular, realistic code completion tasks need to generate code with dependencies on other functions, files, and classes in the project’s repository, and many existing benchmarks have no repository-level context. The authors propose a new benchmark called REPOCOD to evaluate complex code generation. The benchmark includes 980 code generation problems collected from 11 popular projects and contains developer-written test cases to validate correctness of LLM-generated functions. Each model run uses 331.6 tokens, thus allowing more complex code samples, and one quarter of the benchmark cases require repository-level context. While REPOCOD gives a maximum pass@1 score of 27.35%, performance is lower for use cases requiring repository context. The models evaluated by the authors include GPT-4o, GPT-4o-mini, DeepSeek-V2.5, and Claude 3.5 Sonnet, CodeLlama, and DeepSeek-Coder.

8. Australia proposes 'world-leading' ban on social media for children under 16

The Australian government is proposing a law to ban social media access for children under 16 years of age. Australia would be the first country to introduce such a ban. The legislation is expected to pass through Parliament within the next year, and it has already received the support of the opposition Liberal party. Proponents of the ban cite the excess of content with harmful depictions of body image aimed at girls and misogynist content aimed at boys. The ban applies even in the absence of parental consent, so the social media platforms will have to demonstrate that they are taking steps to enforce the ban. The platforms concerned by the ban include Facebook, Instagram, TikTok, X and Youtube. Opponents of the ban say that children will search in alternative, and possibly “darker”, places on the Internet for content. They say that a more appropriate solution is to cultivate digital literacy and to encourage social media platforms to develop digital content spaces that are more appropriate for children.

9. Microsoft-owned LinkedIn fined €310m by Irish Data Protection Commission

LinkedIn has been fined 310 million EUR by the Irish Data Protection Commission for failure to comply with the GDPR. It is the fifth-largest fine the Irish commission has issued under the GDPR and the sixth largest by any EU data protection authority since the GDPR came into effect in 2018. In the case of LinkedIn, the GDPR stipulates that consent to process personal data must be “freely given, sufficiently informed or specific, or unambiguous”. The Irish commission found that consent from users had not met these criteria in relation to the processing of user data for targeted advertising. Revenue for the Irish unit of LinkedIn was around 5 billion EUR in 2022, with pre-tax profits estimated at 93 million EUR, so the fine has little financial impact on the company. The Irish Times article reports that Microsoft, LinkedIn’s parent company, has already indemnified LinkedIn for the fine.