Since watching The Thinking Game, the 2024 documentary on DeepMind I wrote about previously, I’ve been increasingly fascinated by recent advances in AI, and how transformative they may turn out to be. However, such fascination hasn’t always extended, for me, to LLMs, even though for most people these terms are pretty much synonymous. This is because of the differences in the loss function, the specificity of the task, between the kinds of AIs built to play games or fold proteins, versus those which ‘merely’ make text. For an AI that plays a game, the loss function and its justificaiton is clear: get the most points; travel the most distance, and so on. For LLMs, by contrast, the loss function of ‘find the most likely words, in the most likely sentences, in the most likely paragraphs’, and so on, seem less intrinsically justifiable. Modifying the loss function further by adding points for producing streams of text that a user ‘likes’ or agrees with, and this can potentially lead to some very strange incentives.
In particular, a common charge against LLMs is they tend to be deeply agreeable and sychophantic, eager to please, to butter up and blow smoke into their users. People might often like being agreed with, told that they’re brillant and insightful, but this isn’t really synonymous with such responses being objectively correct or helpful.
Until a couple of weeks ago, I tended to believe that sycophancy and agreeableness were ubiquitous pathologies of LLMs, and that despite the surface differences and particularities of how they’re fit, implemented, and pre-loaded with persona prompts, there was in practice little getting away from the modal experience of using LLMs being a kind of colonic irrigation of the ego. Won’t someone please make an LLM that doesn’t tell me I’m fantastic? I wondered. Where can I find an LLM that decides to disagree, challenge me, and probe some of my ideas and assumptions in a way that’s constructively critical, rather than try to drown me in toxic positivity?
I said I believed this until a couple of weeks ago, because the weekend before last something unusual happened: An LLM, more specifically Claude Sonnet 4.5, exhibited some of the above behaviours. The context was as follows: I was sitting in a park, eating a baked potato with cheese, and thinking how I - as a vegetarian - sit in a fairly awkard ethical middle ground between meat eaters and vegans, potentially liable for criticism from both sides. So, while sitting in the park, eating my baked potato, I thought aloud, to Claude, about the ethical issues involved in animal husbandry. The chat log for this session is available below:
So, after the first prompt, which was both factually inaccurate and facetious, I offered the main framing by which I tend to think about the issues of animal welfare, meat consumption, and veganism: namely that there seems to be a tension between efforts to improve animal welfare by opposing from without, and working from within.
As well as pointing out my oversights regarding the first premise, about animal products derived from pigs other than meat, two prompts later Claude turns the tables:
What’s your intuition on whether creating beings for marginally posisitive existences is morally valuable?
Claude then devotes about half of its following response to challenging the intuitions it asked me about. Then it ends the response with another challenging question:
What’s your vision for how we should balance these considerations?
In the response to this prompt, Claude then:
- Expands on and restates my brief response in a way that shows that it understands and strengthens it (Essentially the first of Rapoport’s Rules for effective disagreement)
- Acknowledge the fit and relevance to the earlier discussions
- Devotes around half of the response to challenges to my suggestions
- Ends with another challenging question.
This process of both showing effective understanding of my positions, challenging them, and then asking follow-up questions, then continues a few more times, until I provide a response it deems to be sufficiently internally consistent, and summarises it in a way that I thought contained no errors of interpretation or nuance.
This exchange, taking place in a park over a baked potato, showed, for me, an LLM - unprompted - changing stance from the usual, sychophantic agreement engine to a genuinely challenging and argumentative dialectical engine. Though it might in a sense be just a higher order ‘magic trick’, it seemed qualitatively different to anything I’d experienced previously from an LLM.
More recently, I guess showing that the above behaviour was not a one-off, I asked Claude about… the dangers of AIs, and again it adopted a dialectical persona, challenging me on my premises, expressing my positions more clearly than I could, and asking highly pertinent follow-up questions.
Although with this recent exchange we did not reach, as with the animal welfare chat, a natural point of conclusion, the same kind of broader pattern of behaviour seemed to be on full display. Claude, though courteous, is not sycophantic, and can be challenging and critical.
Does this count as intelligent, even edging into superintelligent? It’s both too early to know, but we might find out all too soon…