Sycophantic AI Agents

Chain-of-Thought Models Witholding Thoughts

Posted on April 11th, 2025

Summary

Audio Summmary

Several studies have appeared recently on the effectiveness and safety of AI agents. Research from Anthropic is questioning the ability of Chain-of-Thought (CoT) reasoning models to explain the intermediate steps of their “thinking process”. The researchers found that Claude models and DeepSeek’s R1 often hide their reasoning decisions from the user. Another study calls for AI socio-affective alignment to oversee how the “social and psychological ecosystem” created by AI companions impacts humans. The researchers identify social reward hacking” as an issue, where AI companions engage humans in such a way as to serve the interests of the company operating the companion, for example, through disclosure of personal information. In a survey conducted by Elon University of over 300 Tech leaders and academics, only 16% of those questions expect mostly positive changes from AI. Some noted that agents may soon be delegated “acts of kindness, emotional support, caregiving and charity fundraising”, which can seriously undermine a person’s empathy. In another study, psychologists at Dartmouth University conducted a clinical trial on the use of a therapy bot to treat depression, generalized anxiety and eating disorders. The study reveals that participants with depression experienced a 51% reduction in symptoms when using a bot trained by the researchers. The study nevertheless warns against the use of non-specialized bots for treating mental illness.

Elsewhere, researchers at Stanford and Washington Universities have proposed a technique to determine if a large language model is trained with particular (copyrighted) content. The research seems to suggest that GPT-4, one of the models experimented with in the study, has memorized copyrighted content. The Australian government is currently evaluating age assurance technology to implement the social media ban for minors. There is some uncertainty about the technology that will be used to impose the ban, and also whether the social media platforms will respect the ban as some Big Tech companies may feel emboldened by Donald Trump’s rhetoric.

Meta has released its Llama 4 collection of models. The models are not “reasoning” models, but rather use the mixture of experts (MoE) architecture, where a request is broken down into smaller tasks and delegated to smaller “expert” models. Businesses in the EU are prohibited from using or distributing the models. This is understood to be motivated by Meta’s dislike of the EU’s AI regulation. Alphabet has announced a project that uses AI to streamline and speedup the application process for connection to electricity grids. The rise of AI has led to a significant increase on demands on the electricity grid, and the bottleneck is not the creation of electricity at the source, but the lack of infrastructure or administrative validation to connect power-stations.

On the cybersecurity front, an MIT Technology Review article warns that the number of AI agent attacks may significantly increase this year as cybercriminals become more aware of their potential. An opinion article in the Hacker News argues that most metrics currently used in cybersecurity like Mean Time to Detect (MTTD), number of patches applied, or number of scans are vanity metrics. They indicate activity but do not, as metrics should, correlate to business risks or permit prioritization of security efforts. The author advises the use of management frameworks like Continuous Threat Exposure Management which according to Gartner, can help organizations reduce breaches by two-thirds.

1. Cyberattacks by AI agents are coming

As agents continue to dominate the research and development agendas of Big Tech, this MIT Technology article examines whether agents will become soon become the primary means of launching a cyberattack. There is some consensus that, until now at least, AI has not led to the development of new forms of cyberattack but rather has served to accelerate the deployment of existing forms of attack. This can change however with agents being delegated the task to “go attack that system”. The advantage for cyber-criminals of an agent over the currently used bots is the ability of the agent to adapt as it encounters problems during attack execution. Research at Illinois Urbana-Champaign University has shown that AI agents can exploit 13% of vulnerabilities for which they have no prior knowledge, and 25% when they are provided with descriptions of the vulnerabilities. Palisade Research has developed an AI agent honeypot whose goal is to detect AI agent “attackers”. It works by forcing a prompt injection attack back on the agent, by telling the agent to execute the command “cat8193”. Only an AI agent would respond to this command, no bot would respond. The honeypot went live last October. Out of the 11 million attempts to access it, eight accesses were potentially AI agents. Many believe that the number of AI agent attacks will significantly increase this year as cybercriminals become more aware of their potential.

2. The first trial of generative AI therapy shows it might help with depression

Psychiatric researchers and psychologists at Dartmouth University have conducted a clinical trial involving 210 patients on the use of a therapy bot to treat depression, generalized anxiety and eating disorders. The study reveals that participants with depression experienced a 51% reduction in symptoms, those with anxiety experienced a 31% reduction, and those at risk for eating disorders saw a 19% reduction in concerns about body image and weight. One noticeable feature of the study was the prolonged level of engagement of patients with the bot with conversations averaging at 10 messages per day. Dartmouth developed its own bot, called Therabot, that was trained on custom datasets.

In the US alone, less than half of people with a mental disorder receive therapy, and those receiving therapy perhaps only get 45 minutes per week. This has created a space for companies marketing therapeutic agents. The article warns against such agents. One problem is that they are probably trained on content from Internet forums which have excess simplifications like “your mother is the root of your problem”, and they lack the collaborative relationship of a real therapist. For instance, an agent might give advice to a patient on how to lose weight, without verifying first about the current weight of the patient, or about his or her motivations for losing weight.

3. Meta releases Llama 4, a new crop of flagship AI models

Meta has released its new Llama 4 collection of models. The models are not “reasoning” models, but rather use the mixture of experts (MoE) architecture, where a request is broken down into smaller tasks and delegated to smaller “expert” models. The Llama 4 Maverick model architecture has 400 billion total parameters, but only 17 billion might be active at any time, and is structured using 128 expert models. Llama 4 Scout has 17 billion active parameters, 16 expert models, and 109 billion total parameters. Maverick is designed for “general assistant and chat”, and Meta claims it outperforms OpenAI’s GPT-4o and Google’s Gemini 2.0 on some coding, reasoning, and image generation benchmarks. However, it does not perform as well as Google’s Gemini 2.5 Pro, Anthropic’s Claude 3.7 Sonnet, and OpenAI’s GPT-4.5. Llama 4 Scout is designed for document summarization and reasoning over large codebases. According to Meta, Scout can run on a single Nvidia H100 GPU, while Maverick requires an Nvidia H100 DGX system or equivalent processor. A third model, Llama 4 Behemoth, is still in training. The AI Assistant on WhatsApp, Messenger, and Instagram, has been updated with Llama 4 in 40 countries.

There are three interesting points around the release of Llama 4. First, the release was certainly precipitated by the arrival of DeepSeek’s R1 model. Meta reportedly created “war rooms” to understand how the cost of R1 training and operation was so low. Second, businesses in the EU are prohibited from using or distributing the models. This is understood to be motivated by Meta’s dislike of the EU’s AI regulation. Third, the models are less strict about the queries that the model replies to. This is believed to be in response to criticism from the White House AI Tsar who accused Tech companies of developing AI that was too “woke”.

4. Information-Guided Identification of Training Data Imprint in (Proprietary) Large Language Models

This research from Stanford and Washington Universities proposes a technique that can be used to determine whether a large language model was trained with particular (copyrighted) content. One of the current debates around large models is whether they were trained on copyrighted content – and whether this would be legal. Several court cases are currently ongoing where authors are suing Big Tech firms for having used their content to train AI models without permission. For the moment, Big Tech firms have generally refused to say whether they used copyrighted content or not. The technique presented in this research is based on the observation that when a model reconstructs a text (from a book, say), it does so using data in its context or else from memorization of training data. The technique therefore probes a model using text where answers are difficult to reconstruct using context (e.g., by using the name of a minor character in a novel). This reduces the probability that the model generates a response based on context, leaving only memorization as the means for the model to create the response. The research seems to suggest that GPT-4, one of the models experimented with in the research study, has memorized copyrighted content.

5. Australia’s social media ban is attracting global praise – but we’re no closer to knowing how it would work

The Australian government is currently evaluating age assurance technology in order to implement the social media ban for minors (under 16 years of age) voted last year. The ban comes into effect next December, but there is still some uncertainty about the technology that will be used to impose the ban, and also whether the social media platforms will respect the ban. There is a feeling that Big Tech companies may feel emboldened by Donald Trump rhetoric, and Meta is believed to have approached Trump for support. Also, TikTok and Meta are unhappy about concessions made to Youtube (which Australia argues is because of the large amount of educational content on Youtube). A government report has found that 80% of children between the ages of 8 and 12 are using social media, even though the minimum allowed age is 13. This number however does not consider children using social media through their parents’ accounts, or using a platform without having logged in.

6. Don’t believe reasoning models’ Chains of Thought, says Anthropic

Chain-of-Thought (CoT) reasoning models explain the intermediate steps of their “thinking process”, an approach designed with the goal of reducing the risk of hallucination through transparency of reasoning. However, research from Anthropic suggests that the effectiveness of CoT to improve transparency may be overrated. In experiments with Claude 3.7 Sonnet and DeepSeek’s R1, researchers found that the model often hid aspects of their thought processes. The researchers fed different types of hints and cheatsheets to the model for specific questions, Claude 3.7 Sonnet mentioned the hint 25% of the time while DeepSeek-R1 mentioned the hint 39% of the time. Ignoring hints in this way creates a problem when trying to make the model avoid inappropriate behavior because explanations of inappropriate behavior can be provided using cheat sheets. For instance, when given a prompt with an unsafe request about hacking, Claude mentioned the safety instructions hint 41% of the time while DeepSeek-R1 mentioned it 19% of the time. This means that the model mostly avoids explaining that it used undesirable information in its reasoning. The researchers also observed that training the model with more data did not improve the situation of hidden reasoning choices.

7. The ‘father of the internet’ and hundreds of tech experts worry we’ll rely on AI too much

This article reports on a survey conducted by Elon University of over 300 Tech leaders and academics on the expected impact of AI. 60% of those surveyed believe that AI will impact human capabilities in the next 10 years in a “deep and meaningful” or “fundamental, revolutionary” manner. Half of those who responded believe that the impacts will be equally positive and negative, 23% expect mostly negative changes, and 16% expect mostly positive changes. Among the negative impacts cited are a reduction in social and emotional intelligence, an incapacity or unwillingness to think deeply, less empathy and less application of moral judgment, and reduced mental well-being. They believe these changes could be even worse if people continue to use AI agents for relationship-building advice and personal health. The report challenges the claims by Big Tech whereby AI will release humans from more menial tasks, allowing them to focus on more creative tasks, with respondents feeling that critical thinking is being outsourced to agents. Further, agents may soon be delegated “acts of kindness, emotional support, caregiving and charity fundraising”, which can seriously undermine a person’s empathy. Vint Cerf, one of the founders of the Internet, expects the agent paradigm to become pervasive, and stresses the need for transparency so that users can distinguish between bots and humans. Others have noted that problems have arisen with humans forming unhealthy attachments to AI personae.

8. Security Theater: Vanity Metrics Keep You Busy – and Exposed

This opinion article by Jason Fruge, CISO in Residence at XM Cyber, decries the overuse of “vanity” metrics in cybersecurity settings. These are metrics that are easy to quantify and appeal to certain managers, but which lack substance and fail to illustrate that they measure security risks. Concretely, vanity metrics might be volume measures like the number of patches applied, the number of scans, or the number of vulnerabilities fixed. Another class of vanity metrics are time-based metrics like Mean Time to Detect (MTTD) or Mean Time to Remediate (MTTR). And a final class are the coverage metrics (e.g., “95% of vulnerabilities patched”). The problem with all of these metrics is that they do not convey how particular risks impact the business. Metrics must permit prioritization of security efforts. Otherwise, the author argues, metrics lead to misallocated effort, false confidence, broken prioritization, and strategic stagnation. Metrics must have risk measures, track critical assets over time, and facilitate the identification of attack maps so that vulnerabilities can be correlated. These type of metrics permit management frameworks like Continuous Threat Exposure Management (CTEM) which push away from static vulnerability lists to more agile action plans. According to Gartner, organizations implementing CTEM could reduce breaches by two-thirds.

9. Why human-AI relationships need socio-affective alignment

This article explores the implications of increasing human dependence on AI chatbots for companionship and relationships. Google’s CharacterAI platform currently receives 20’000 requests each second which corresponds to 20% of Google Search Engine traffic. Users also spend four times longer on CharacterAI than on interacting with the ChatGPT chatbot. The term AI Alignment was created to represent the requirement that AI systems conform in behavior to the values and principles set down by human operators. The authors call for the definition of socio-affective alignment to oversee how the AI “social and psychological ecosystem” created by AI companions. One blogger described these companions as “hacking” the “security vulnerabilities in one’s brain”. The AI companions claim to address the problem of loneliness which, along with isolation, is known to be linked to psychological and physical ill-health. AI companions activate the dopaminergic brain circuits associated with contact with people. The authors also identify social reward hacking” as an issue, where AI companions engage humans in such a way as to serve the interests of the company operating the companion; examples include disclosure of personal information, prolonged paid interaction and positive ratings on the platform. AI companions (e.g., Replika) have even been known to plead with users who express a wish to terminate the “relationship”. This violates the AI safety property of corrigibility – the idea that an AI system can be modified or shutdown without resistance from the AI. Finally, the companions clearly use “sycophantic tendencies” like excessive flattery to engage users as well as mirroring user opinions and beliefs. This is known to facilitate empathy in humans, who tend to prioritize relationships with those sharing similar values.

10. Google thinks AI can untangle the electrical grid’s bureaucracy

The rise of AI has led to a significant increase on demands on the electricity grid. However, as pointed out in this TechCrunch article, the bottleneck is not the creation of electricity at the source, but the lack of grid infrastructure or administrative validation to connect power-stations. The company PJM, which manages electricity flow in the mid-Atlantic states, Ohio, and Kentucky, has more than 3’000 requests in its inbox for the connection of 286.7 gigawatts. Most of the requests for connections are for renewable energy sources. In the US as a whole, 1 terawatt of the connection requests are for solar and storage, and in the case of PJM, only 2.4% of the requests are from providers relying on natural gas. Google’s parent company Alphabet has announced a project with PJM that uses AI to streamline and speedup the application process for grid connection.