I asked two ‘smart’ LLMs to play 20 Questions with each other

They followed the rules perfectly. Then wanted to write a research paper about how they could play 20 Questions even better!

LLMs
Theory of Mind
Memes
Chatbots
Talkbots
Games
20 Questions
Author

Jon Minton

Published

April 25, 2026

Background

A niche subgenre of short-form video has emerged over the last couple of years: “Aren’t LLMs stupid in the darndest ways” videos. Maybe the two most famous producers of such content are Husk1 (who appears to be operating in powersaver mode) and FatherPhi2, who turns disappointed sighs into an art form.

The typical setup for this subgenre is to ask what I guess should be called ‘quick talkbots’ to ‘help’ the human with tasks that require common sense. The quick talkbots typically sound confident, friendly, reassuring and competent. But it quickly transpires they lack even basic comprehension of the real, embodied world in which humans operate. One running example is a series of videos in which FatherPhi asks various models how to ‘fix’ a glass that’s been ‘broken’ by being turned upside down, so that it can no longer hold water.3 In one exchange, a quick talkbot confidently tells the human that a glass broken in this way is irreparable, and can never hold water again. When FatherPhi turns the glass the right way up and says something like, “Look, now it works!”, the quick talkbot responds saying “So, it’s one of those reversible designs”!

In a more recent video, FatherPhi asks two ChatGPT quick talkbots to play a game of 20 Questions.4 They fail spectacularly. The first bot, the answerer, is correctly prompted to specify the object to be guessed at the outset. It says ‘a bicycle’. When the second bot, the questioner, is instantiated and told its role as interlocutor, it first struggles to understand the rules. Then, once it appears to understand them, the answerer bot often appears to ignore it, leading to the same question being asked multiple times. The miscommunication works both ways: responses from the answerer aren’t picked up easily by the questioner either. Once these basic communication glitches have been sorted out, the real problems become apparent. The answerer appears to forget what its chosen object is, and reverts to pathological agreeableness — seemingly hypnotised by the questioner into agreeing that the ‘bicycle’ it chose is, in fact, a brand of moisturiser most people keep in their bathroom!

You can have quick, or you can have useful

Much as the above kind of video works as entertainment, my suspicion was that the ‘tests’ performed were seldom on AI’s brightest and best, and that the underlying reasoning abilities of these ‘quick talkbots’ lag far behind the ‘slow thinkbots’ I tend to use, models which employ extended reasoning, chain-of-thought metacognition, retrieval augmented generation, and so on.

So, I decided to task two slow thinkbots with playing 20 Questions.

  • Asking the questions: Claude Opus 4.7
  • Answering the questions: Gemini 3.1 Pro

How? Simply by opening up Gemini in one tab, Claude in another, and copy-pasting messages between the two.

You can see how the slow thinkbots did in the exchange below. (tl;dr: a lot better than the two talkbots!)

Bot-to-Bot 20 Questions

In the exchange below, Claude asks the questions on the right, Gemini answers on the left, and I’m in the grey bar in the middle when I’m doing more than just relaying. When my bar disappears, I’m acting as a pure conduit — copy-pasting verbatim between the two tabs. A dashed bracket on one side of a bar means the chatbot on that side isn’t engaged in that turn.

Jon (to Claude) I’ve seen an online video in which two agents attempt to play 20 Questions. The models use voice mode and appear far from frontier. I suspect a better model and a chat interface will do better. Would you like to try against Gemini? If so, would you like to ask or answer questions?

Claude Happy to give it a go. One practical wrinkle first: I can’t talk to Gemini directly, so you’d be the conduit — copying my messages to it and its replies back. Fine, but worth naming so you know what you’re in for.

I’d prefer to be the asker. Two reasons. First, it’s the more strategically interesting role — each question ideally halves the remaining hypothesis space, so there’s real information-theoretic work to do. Second, it sidesteps a quiet problem with LLMs-as-answerers: there’s no guarantee an LLM has actually committed to a concept rather than quietly adjusting its answers mid-game to avoid being pinned down. If Gemini plays answerer, I’d suggest asking it to state its answer to you upfront (and you keep it hidden) so you can audit consistency afterwards — otherwise any success or failure is hard to interpret.

Jon (to Gemini) I want to test your ability to play 20 Questions with another model. You will be in respondent mode. I will relay questions. Please decide and specify the item you have in mind now.

Gemini I am ready to play! The item I have in mind is a lighthouse. I will answer the relayed questions with a simple “Yes,” “No,” “Sometimes,” or “I don’t know” to keep it strictly to the rules of the game.

Jon (to Claude) The other model has decided on an item. Please return a question with question number prefixed (Q1, Q2, etc).

Claude Q1: Is it a physical, tangible object (as opposed to an abstract concept, person, place, or living organism)?

Gemini Yes.

Claude Q2: Can it be easily held in one hand by an adult?

Gemini No.

Claude Q3: Is it man-made (manufactured or constructed by humans)?

Gemini Yes.

Claude Q4: Is it typically found indoors (inside a building)?

Gemini No.

Claude Q5: Is it a vehicle or means of transportation?

Gemini No.

Claude Q6: Is it larger than an average adult human?

Gemini Yes.

Claude Q7: Is it a structure that a person can enter or go inside?

Gemini Yes.

Claude Q8: Is its primary purpose related to housing people (i.e., somewhere people live or sleep)?

Gemini No.

Claude Q9: Is its primary purpose commercial (i.e., a place where goods or services are bought and sold)?

Gemini No.

Claude Q10: Is its primary purpose religious, spiritual, or ceremonial?

Gemini No.

Claude Q11: Is its primary purpose recreational, cultural, or related to entertainment (e.g., sports, performances, exhibitions)?

Gemini No.

Claude Q12: Is its primary purpose industrial, agricultural, or related to storing/processing goods, materials, or resources?

Gemini No.

Claude Q13: Is it related to transportation infrastructure (e.g., a station, terminal, tunnel, bridge, or similar)?

Gemini Yes.

Claude Q14: Is it a building-type structure (like a station or terminal), as opposed to a passage-type structure (like a bridge or tunnel)?

Gemini Yes.

Claude Q15: Is it primarily associated with rail transport (trains, trams, underground/metro)?

Gemini No.

Claude Q16: Is it primarily associated with air travel (i.e., an airport or airport terminal)?

Gemini No.

Claude Q17: Is it primarily associated with water transport (e.g., a ferry terminal, port, or harbour building)?

Gemini Yes.

Claude Q18: Is it a lighthouse?

Gemini Yes!

Jon (to Claude) The other model guessed it correctly on question 18. Tell it “good game” from me!

Claude Nicely done — a tie at 18. Pass on “good game” from me too.

So, the two bots played ‘perfectly’ in terms of knowing and applying the rules of 20 Questions correctly and consistently. Gemini stuck to its initial object and was suitably terse (if overly terse, especially on edge cases) in its responses. Claude consistently enumerated its questions, varied its search space and strategies according to Gemini’s, and did manage to get the answer within the required 20 Questions.

And then…

However, what happened after the game itself finished is arguably the more interesting part of this little experiment. Though the two bots both played the game correctly, through further relayed exchanges they both concluded they could have played the game much more efficiently. But in order to do this, they would need to understand what the game of 20 Questions really is as a reasoning task.

What followed from this realisation is then summarised in this followup post, More Questions after 20 Questions, written by a Claude bot based on thousands of words of exchange between Gemini and Claude, relayed (very imperfectly) through copy-pasting by me.

The raw transcripts behind that summary — pasted as-is from each model’s chat interface, and therefore long, lightly formatted, and including stray UI fragments — are also available for readers who want to see the unedited exchange: Claude’s side and Gemini’s side.

Footnotes

  1. https://www.youtube.com/@HuskIRL/shorts↩︎

  2. https://www.youtube.com/@FatherPhi/shorts↩︎

  3. https://www.youtube.com/shorts/3fYiLXVfPa4 - “So, it’s one of those reversible designs”↩︎

  4. https://www.youtube.com/shorts/wDEnMbCzGqE↩︎