Confluence for 5.24.2026
Why OpenAI solving an Erdos problem is a big deal. Watch what your AI is doing. AI as a peer reviewer. Start with the work.
Welcome to Confluence. Here’s what has our attention this week at the intersection of generative AI, leadership, and corporate communication:
Why OpenAI Solving an Erdos Problem Is a Big Deal
Watch What Your AI Is Doing
AI as a Peer Reviewer
Start With the Work
Why OpenAI Solving an Erdos Problem Is a Big Deal
Especially in the context of Mythos.
This week OpenAI announced that one of its unreleased general models had solved the “planar unit distance” problem, an Erdos problem in discrete geometry that has been sitting around without optimal solutions since the 1940s. Erdos problem? Discrete geometry? Most readers might run from such things. We’re not mathematicians, and while the whole thing seems like a foreign language to us, we also think it’s a big deal and one you should notice.
The fact that many mathematicians who post online greeted the announcement with statements like, “Are you sitting down?” and “We’re cooked” only makes it more so.
There are two important things to take away.
The first is what this result says about the rate of progress in large language model (LLM) capability. A model at OpenAI solved this problem (well, technically, it disproved the central conjecture of this specific problem) just 24 months after many people were ridiculing LLMs for not being able to do simple math. The rate of improvement is really pretty hard to grasp, and if anything, it seems to be accelerating. LLMs are far more capable today than they were 24 or even 12 months ago.
The second is what the result says about the next wave of models. A general OpenAI model solved this problem, not a model trained for discrete geometry. Anthropic’s Mythos model, which identified cybersecurity vulnerabilities in pretty much all major operating systems and platforms, is also a general model, not a model trained in cybersecurity. There is every indication that the next major release of large language models will be a step change in ability from what we’ve seen in the past. The current models, like Opus 4.7 and ChatGPT 5.5, are amazing, but they aren’t big leaps from their immediate predecessors. But whatever models OpenAI and Anthropic have behind their own walls (and that they’re using to do these things) have every indication of being a big leap when released into the wild.
We can probably expect that release to happen in the next six months. With Mythos, Anthropic first has to ensure it doesn’t create massive cyber risk, work which is underway. We’d presume OpenAI (and probably Google) are taking similar measures with their own next-generation models. But when they do come to general users, we’re going to expect them to be a pretty big step forward.
And of course, when they are released, OpenAI, Google, and Anthropic will already likely be testing, if not using, the models to follow.
Watch What Your AI Is Doing
A recent experiment reveals what can happen when you don’t change the default.
An experiment that made the rounds on X this week serves as a caution to those working with generative AI, specifically Copilot, and more specifically, the default settings. Adam Kucharski, writing in his Substack Understanding the Unseen, created two datasets. The first included 2,000 free-text responses, then duplicated them, labeling the original set “UK” and the duplicates “US” for a total of 4,000 identical responses split between the two countries. The second dataset followed a similar pattern: 200 statements about career aspirations, duplicated five times and labeled with five different countries (UK, US, France, Germany, Italy).
When Kucharski fed these datasets into Copilot, it managed to find differences that didn’t exist. US and UK responses, Copilot concluded, “differ mainly in tone, intensity, and wording style.” When asked to dig deeper into the five-country dataset, Copilot quantified the made-up differences: Italians were three times more likely to aspire to a career in the arts than those in the UK, and Americans were 1.5 times more business-focused than the French. All from identical underlying data.
To those following along at home, this likely violates your expectations. The latest models are remarkably capable, and we don’t see obvious failures like this in our day-to-day work. So what happened?
The model and the harness matter a great deal. Kucharski used the free version of Copilot with the default “Auto” router, which gives Copilot the ability to choose the model for the task (though in our experience, it rarely does so effectively). He did this intentionally. The vast majority of Copilot users depend on the free version, and the vast majority of those users never change the model away from “Auto.”
This matters because of what lighter-weight models tend to do when asked to analyze a dataset. Rather than writing and executing code to compute real differences, they predict what the differences should look like based on patterns in their training data. That training corpus encodes cultural priors alongside everything else, so when the prompt is “compare these countries,” the model falls back on what it has absorbed about how Italians or Germans or Americans tend to talk. It produces a fluent, plausible-sounding analysis that has nothing to do with the data in front of it.
Fortunately, Kucharski made his prompts and datasets available, so we ran our own tests. Copilot Premium (even on Auto), ChatGPT, and Claude all nailed it, recognizing right away that the responses were duplicates. The leading models did real analysis. They wrote and executed Python rather than predicting what the answer should be.
That doesn’t mean we should be complacent. Even the most capable models won’t get everything right every time, and the more complex the task and the higher the stakes, the harder a mistake will be to catch. The answer is to build friction into the work. Whenever we use models for qualitative analysis, we ask them to show their work — produce a codebook, surface the specific responses that drove each theme, and make the reasoning auditable. Friction in the workflow keeps us in a position to verify rather than trust by default.
There’s always a risk of falling asleep at the wheel when working with generative AI, and even the most capable models can make mistakes that leave us scratching our heads. We put ourselves in a much better position by selecting the right model for the task up front and building verification into the workflow rather than depending on the default.
AI as a Peer Reviewer
A new study compares AI and human peer reviewers and finds that neither replaces the other.
AI is reshaping systems and processes that have long relied on the friction of human effort to keep submission volume manageable. Job applications are the example we’re hearing about most, but academic peer review faces the same problem, and major conferences are beginning to use AI reviewers to help human ones keep up. A new study from Carnegie Mellon examines how those AI reviewers measure up.
The study took 82 academic papers and compared peer reviews from human reviewers with reviews from GPT-5.2, Claude Opus 4.5, and Gemini 3.0 Pro. Each review was rated on its correctness, significance, and sufficiency of evidence. GPT-5.2 outperformed the top-rated human reviewer on overall quality, and all three AI models beat the lowest-rated human.
The AI reviewers surfaced distinct issues that no human raised, and their critiques were more often judged as significant and well-supported, particularly on detailed methodological and code-level problems that require time-consuming review to catch (and thus that human reviewers often missed). But the AI reviewers had their own weaknesses. They made more errors overall, often failed to grasp field-specific writing conventions, critiqued minor issues too harshly, and lost track of details in long papers. They struggled, in short, when the review required cultural context or exceeded the limits of their context windows.
The AI reviewers also overlapped each other far more than human reviewers did (21% versus 3% for human pairs). Given all our recent discussion about AI “tells,” this should not be all that surprising: across use cases, we’re finding AI models have their own norms, stylistic markers, and blind spots. Their intelligence, whatever it is, is not the same as human intelligence. The AI reviews added to but did not duplicate human review, and the study concludes that combined human and AI review is probably the best path forward.
For leaders and communicators, this is further evidence that AI can offer real value as a feedback and thought partner, provided you understand what it’s good at and what it isn’t. Expect it to surface problems and nuances you would never have considered, and to miss the context-specific insights a seasoned colleague would catch instantly. In our view, that’s this study’s most important insight: an AI’s output does not have to look or sound like a human’s to be useful, and treating that difference as an automatic deficiency means missing much of its value.
Start With the Work
In team conversations, focusing on AI too early can narrow the possibilities.
In sessions and engagements we lead with leadership teams on generative AI, the conversation inevitably turns to use cases. “What are the best use cases for our team, and how do we identify them?” We’ve written before about how many of the most interesting AI ideas come from subject matter experts closest to the work. Many of those ideas surface in isolation, as individuals work through challenges, apply AI to them, and discover novel approaches. In exploratory conversations among an entire team, we see a different approach that can produce real progress: perhaps counterintuitively, some of the most productive exploratory AI conversations do not start out being about AI at all.
These conversations begin instead with a discussion of the team’s work. This can include questions on what the team wants to do more of or less of, where they want to spend more time or less time, and how the team could add the most value in an unconstrained “blue sky” scenario. Importantly, the discussion at this stage does not consider AI at all. It’s entirely about the work. AI does not enter the conversation until these questions have been sufficiently discussed.
This works because leading with AI can constrain the conversation in subtle ways. People come in with different levels of understanding of what AI can do and different conceptions of its strengths and limitations. All of these preconceptions shape — and often narrow — the possibilities the team will then surface in conversation. The conversation becomes about the tool rather than about the work.
A conversation that starts with the work is broader and more open, which is actually better suited to the flexible and open-ended nature of generative AI’s capabilities. It puts the team’s ambitions and frustrations front and center and then treats AI as one of several means of addressing them. When AI is explicitly framed as a means to an end, AI considerations come in naturally. Given our goals, where can generative AI help? What’s amenable to assistance, augmentation, or full automation? The “blue sky” question — what have we always wanted to do but never been able to? — is a particularly powerful one, and especially so for leaner and more resource-constrained teams.
For leaders, our advice is that before your team discusses what you want AI to do, talk about what the team wants to do. Everything else should follow from that. The more capable these systems become, the more they will be able to suit themselves to your needs — rather than the other way around.
We’ll leave you with something cool: Google released its latest model, Gemini Omni, which allows you to create high-quality videos (and soon, more) from nearly any input.
AI Disclosure: We used generative AI in creating imagery for this post. We also used it selectively as a creator and summarizer of content and as an editor and proofreader.
