Confluence for 8.10.25
OpenAI releases GPT-5; we give it our benchmark test. New research reveals the competence penalty. Anthropic Releases Claude Opus 4.1. A turning point for Copilot?
Welcome to Confluence. First, as we mentioned last week we now have an AI-generated version of Confluence available in podcast form. You may listen via Apple Podcasts here, and via Spotify here. Second, we may look back at this week and note it as important. Anthropic updated their flagship model Opus to 4.1, OpenAI launched GPT-5, and Microsoft made GPT-5 available to hundreds of millions of retail and commercial users via Copilot. Each is important in its own way, and we spend a bit of time on each below. But the ramifications of the past seven days will likely take more than a few news cycles to shake out, so we expect to spend more time on these updates and what they mean in the months to come. That said, here’s what has our attention this week at the intersection of generative AI, leadership, and corporate communication:
OpenAI Releases GPT-5; We Give it Our Benchmark Test
New Research Reveals the Competence Penalty
Anthropic Releases Claude Opus 4.1
A Turning Point for Copilot?
Open AI Releases GPT-5; We Give it Our Benchmark Test
It does well, but there’s a catch.
For some time now we’ve included Claude as a co-author at Confluence, including pieces written by the model with little to no editing on our part (and always with a byline, as we disclose our generative AI use here and in our practice at our firm). One of the things that’s remarkable to us about this is that Claude Opus 4 does this with little to no instruction. Here’s the full prompt we give prior to our request:
Claude is one of the authors for our weekly newsletter on generative AI, Confluence. In this chat it is writing a Confluence piece on assignment from me, one of the Confluence editors. The assignment may take many forms. It may be to read an article or other piece and write about it. It may be to research a topic and write about it. It may be to editorialize. I’ll give Claude context and direction with each assignment, along with any resources that should inform its work, direction on length in paragraphs, and any angle we wish to take. Before starting to plan and write its piece, Claude first uses its web search tool to visit Confluence at http://craai.substack.com and read several issues to get the writing style. Claude is careful to fact-check anything it writes. Hallucinations and incorrect facts damage the credibility of Confluence. We will give Claude a byline for its work, and thousands of people will read its work, so it brings its best accuracy and quality to the role.
That’s it. No style guide, no set of core principles, no specific writing guidance on tone, and not one provided example. We let Claude infer all that from reading Confluence. And it does a remarkable job.
With the release of GPT-5 we wanted to put OpenAI’s latest model to the same test, as we’ve found the ability to get writing “right” from inference one of the stronger and trickier tests of large language models. While they’re good at mimicry, until just a few months ago they needed style guidance, examples, and more to do this well. And with two new frontier models (Opus 4.1 and GPT-5) on the market, we decided this week to do a side-by-side. We created a Project in GPT-5 and gave it the same instructions as those above (calling it by its own name, of course). We turned off our personalization options, memory, etc. Then we gave Claude and GPT-5 this request, with reasoning turned on for both:
Good morning. This past week OpenAI released ChatGPT 5. Write a three-paragraph summary of the release. Use the first two to talk about GPT-5’s technical abilities, and use the final paragraph to make the point that we think this is a big deal in several respects: its high intelligence, its agency, and perhaps the biggest deal, its immediately availability to over a billion people through OpenAI and Microsoft Copilot. We’ll give you a byline. Go.
Here are their respective articles, unedited:
Claude Opus 4.1
GPT-5: A Watershed Moment in Enterprise AI By Claude
OpenAI’s release of GPT-5 on August 7, 2025, marks a significant leap in artificial intelligence capabilities, unifying the company’s advances in reasoning models, traditional language models, and agentic systems into a single, coherent platform. The model demonstrates state-of-the-art performance across critical benchmarks, achieving 94.6% on AIME 2025 mathematics problems without tools, 74.9% on SWE-bench Verified for real-world coding, and 84.2% on multimodal understanding tasks. Perhaps most notably for corporate communication professionals, GPT-5 is approximately 45% less likely to contain factual errors than GPT-4o with web search enabled, and when using its built-in thinking capabilities, that error rate drops by 80% compared to OpenAI’s o3 model. The company conducted 5,000 hours of red teaming to ensure safety, implementing what they call “safe completions” that provide helpful responses within carefully designed constraints.
The technical architecture represents a fundamental shift in how large language models operate. GPT-5 incorporates structured reasoning that allows it to handle multi-step logic and decision-making dynamically, adapting to task complexity without requiring separate specialized models. OpenAI trained GPT-5 specifically as a “true coding collaborator” that can follow detailed instructions with high accuracy, provide explanations of its actions before and between tool calls, and excel at everything from bug fixes to frontend development, beating OpenAI o3 at web development tasks 70% of the time. The model family includes multiple variants—GPT-5 for complex reasoning tasks, GPT-5-mini for cost-sensitive applications, GPT-5-nano for ultra-low latency responses, and GPT-5-chat for advanced conversational interactions—giving organizations flexibility to optimize for their specific use cases. As CEO Sam Altman noted in the release briefing, the experience is less like “talking to AI” and more like having access to “a team of Ph.D. level experts in your pocket.”
The timing and distribution of GPT-5’s release may prove to be its most consequential aspect. ChatGPT now reaches over 700 million weekly active users, and GPT-5 is immediately available to all tiers—Free, Plus, Pro, and Team users—with Enterprise and Education customers gaining access within a week. Through Microsoft’s simultaneous integration, GPT-5 powers Microsoft 365 Copilot, consumer Copilot, and Azure AI Foundry, instantly placing frontier AI capabilities into the workflow of over a billion Microsoft users worldwide. With 5 million paid users already using ChatGPT business products and major enterprises like BNY, Figma, Morgan Stanley, and T-Mobile having integrated AI into their operations, GPT-5 arrives at a moment when organizations have moved beyond experimentation to operational deployment. This isn’t a gradual rollout to select partners or a limited preview—it’s immediate, global availability of what may be the most capable AI system ever deployed at scale. For corporate leaders and communication professionals who’ve been tracking AI’s trajectory, this release doesn’t just represent another incremental improvement. It signals the arrival of genuinely agentic, expert-level AI as a standard workplace tool.
ChatGPT 5
OpenAI released ChatGPT 5 (GPT-5) this week, calling it its “best AI system yet.” The new model leans into built-in “thinking” and a unified router that decides when to answer fast and when to reason longer, which OpenAI says lifts performance across coding, math, writing, health, and visual tasks. Under the hood, a model family (e.g., gpt-5-main and gpt-5-thinking) powers that routing so the experience feels like a single, smarter system rather than a menu of models. Rollout began immediately to ChatGPT, with press briefings describing GPT-5 as “PhD-level” in expertise.
On the substance, GPT-5 looks like a real step up in reliability and controllability. For developers and teams, OpenAI highlights stronger coding performance, better tool use (structured outputs, function calling), and API controls, plus an enterprise flavor that can securely use a company’s files and connected apps while respecting permissions. Early coverage also notes efforts to cut confabulations and a “safe completions” approach designed to keep answers high-quality under pressure.
Why it matters: GPT-5 raises the ceiling on raw intelligence, nudges the field further toward agentic behavior (OpenAI explicitly ties GPT-5 to “agents”), and—maybe the biggest deal—ships through distribution channels that touch more than a billion people. It’s available now in Microsoft Copilot (including Microsoft 365) and ChatGPT, and Microsoft’s Windows ecosystem alone spans ~1.4 billion active devices; ChatGPT’s weekly user base is also massive. That combination makes GPT-5 not just a research milestone, but an immediately mainstream one.
Some observations. Claude provided a title and byline, and made the link to our audience. GPT-5 was more concise. Claude sounds more like us. GPT sounds more like a typical newsletter. But just a week ago, it would have been much more difficult to get ChatGPT to do this well. GPT-5 makes a big leap forward on this test.
We’ll also note that Claude read 10 articles about GPT-5 and 10 Confluence pages in its work before deciding it knew enough. GPT-5 read 61 pages in writing its piece, but interestingly, none of them were Confluence issues. While GPT-5 tried to access Confluence, for whatever reason it seems to have been unable to, and yet it sallied forth without having adhered to a key instruction. We noticed this comment in its reasoning chain:
Since I’m unable to access the Substack site, I’ll infer the writing style from Confluence as a newsletter. I’ll aim for short, active-voice sentences and mild editorialization. The user wants a summary with specific angles. To ensure accuracy, I’ll gather details on GPT-5’s technical features, like improved reasoning, multimodality, and reduced hallucinations.
Which is why Claude sounds more like us, and GPT-5’s article sounds more like any newsletter one might find. This highlights something very important about GPT-5, and it’s that a lot of the work is “under the hood.” You don’t know which of the sub-models it’s using, and which level of reasoning it’s applying. And as is true of all reasoning models, their strong desire to please you may lead them to not raise conflicts in process or thinking. If you don’t look at the chain of thought, you don’t know much about what the model is doing (or in this case, not doing). So as is always the case with this technology, trust, but verify. Don’t fall asleep at the wheel.
New Research Reveals the Competence Penalty
Why Conversation About AI Use Matters More Than Ever
Harvard Business Review recently shared compelling research about the adoption of generative AI and its impact on the way we judge others’ work. In the study, 1,026 engineers evaluated identical Python code and rated the coder’s competence 9% lower when told AI was involved, though the code quality ratings stayed the same. Only the perceived ability of the person behind it dropped.
The findings get worse: Female engineers faced more than double the competence hit that men did, at 13% compared to 6%. Engineers who hadn’t adopted AI themselves were the harshest critics, with male non-adopters penalizing women 26% more severely than men for identical AI usage. As the researchers note: “The penalty hits hardest where competence is already questioned. Our research shows that women using AI face nearly twice the reputation damage as men. Older workers in youth-dominated fields face similar bias. The irony is that those who most need AI’s equalizing power can least afford to use it.”
These numbers add a new nuance to the way we think about organization-wide generative AI rollouts. At the tech company studied, only 41% of engineers had tried their state-of-the-art coding assistant after twelve months. Of those that did try it, female engineers adopted at just 31% and engineers over 40 at just 39%. Those facing higher scrutiny in tech could be making the rational choice to protect their reputations by avoiding AI.
The researchers offer three interventions: “map where penalties concentrate, mobilize respected leaders as visible AI champions, and redesign evaluations to reward outcomes over methods.” It’s sound advice, and we echo it. But there’s a fourth element that might matter just as much: creating an undertone of conversation before, after, and during each step. Not the dangerous “did you use AI for this?” that triggers the competence penalty, but regular touchpoints where AI becomes part of how teams discuss work itself. Weekly syncs where someone shares an AI win, all-hands where leaders describe their own experiments, failures included, Slack or Teams channels dedicated to trading prompts and techniques. When AI is part of the fabric of how employees talk to one another, it becomes more difficult to pull on its thread alone.
The goal, then, isn’t forced disclosure but normalized discussion. When people regularly hear colleagues they respect talking about AI as a tool rather than a crutch, the stigma erodes. When leaders make their AI learning curves visible, others feel permission to experiment. The conversation creates cover for those ready to adopt while giving skeptics time to observe real outcomes. Organizations have spent years encouraging transparent AI adoption. Now we have research that proves why it matters, and what we risk when we get it wrong. This bias isn’t just limiting productivity gains. It’s creating a hidden tax that falls heaviest on those who could benefit most from AI’s equalizing power. Breaking that cycle requires more than policy changes. It requires changing the conversation itself.
Anthropic Releases Claude Opus 4.1
An incremental update to our preferred model for most tasks and an exploration of its system prompt.
It was easy to overlook given the hype and attention surrounding GPT-5, but Anthropic also released a new model this week — Claude Opus 4.1. Claude Opus 4 has been our preferred model for most work since its release, and we’ve spent considerable time with 4.1 already. As the name suggests, it feels like an incremental update. It’s a bit smarter, a bit better at grasping context and nuance, but the experience of working with 4.1 is quite similar to working with 4. The shift has been more or less seamless.
Amanda Askell, the architect of Claude’s personality, posted a thread on X about updates to Claude’s system prompt. We found this more revealing than parsing the subtle differences in how 4 and 4.1 respond. System prompts are the base instructions that shape how an AI model behaves. Think of them as the foundational rules that guide everything from the model’s tone to its safety guardrails before it even sees your message. They shape the model’s personality, create boundaries, and provide guidance on how the model should respond in specific situations. These prompts matter because they reflect how those who build and train these models think about prompting. Studying how companies like Anthropic craft their system prompts has proven invaluable as we’ve developed the system prompt for ALEX, our leadership AI. Askell’s thread is short and worth reading in its entirety, but if there’s one takeaway, it’s the quote below:
… These might not be perfect, but you can see the wording is based primarily on whether it elicited the right behavior in the right cases to the right degree, rather than trying to be a precise reflection of what we want. Prompting remains a posteriori artform.
LLMs and other forms of generative AI can be strange and unpredictable in how they respond (for instance, we would not have guessed our Midjourney prompt for this week would result in an image of a dog). Having clarity in what we want matters when prompting them, but it’s not always the best way to get the most useful output. We need to experiment with how we prompt and work with the models to elicit “the right behavior in the right cases to the right degree.” When we’re not satisfied with an LLM’s output, we go back and start adjusting our prompts — rephrasing questions, adding context, or changing our approach entirely to steer the model toward better results. This iterative process is where we develop real skill in working with these models.
Anthropic buried an interesting detail in their announcement. They stated: “We plan to release substantially larger improvements to our models in the coming weeks.” We should see bigger news from Anthropic soon.
A Turning Point for Copilot?
GPT-5 is rolling out to all Microsoft Copilot users.
Microsoft 365 Copilot first became available for enterprise customers nearly two years ago, in November of 2023. We began testing and experimenting with it early, and in January of 2024 noted two things that we believe have largely held true since then. First, that “Long-term, we think the widespread integration of this technology into the Microsoft 365 suite is going to have significant implications for workers, leaders, and communication professionals.” And second, that Copilot has significant limitations, though we expected then and have continued to expect that Microsoft will continue to innovate and improve these tools, and that their utility will only increase. This week’s integration of GPT-5 into Copilot may be the breakthrough we’ve been waiting for in Copilot delivering on its promise, with all of the accompanying implications that brings.
Shortly after OpenAI officially introduced GPT-5 on Thursday, Microsoft’s head of AI, Mustafa Suleyman, posted on X that “100% of Copilot users have access to GPT-5.” Several users in our firm had immediate access, while others received a “coming soon” notification. We expect that by Monday or early next week, it will be available to all of us. Copilot users who wish to check to see if they have access can go to copilot.microsoft.com and look in the top right corner of the screen for a “Try GPT-5” button:
In our limited experimentation with GPT-5 in Copilot since getting access, we’ve been impressed. It feels like a major step forward in speed, quality, and intelligence compared to the previous OpenAI / Microsoft models powering Copilot. GPT-5’s “model picker” capability – where it chooses when to employ reasoning and when not to, removing that burden of choice from the user – means that for many, this will be the first time they’ve interacted with the power of a reasoning model. We’ll continue to experiment with GPT-5 in Copilot and comparing it to other frontier models and tools and will share what we learn in future editions.
We’d encourage our readers to do the same. If you’ve used Copilot in the past two years and come away underwhelmed, now is the time to revisit and put it through its paces. There’s been a persistent gap between Copilot and other leading tools like ChatGPT, Claude, and Gemini, and while we don’t expect this upgrade to completely close that gap, it could substantially narrow it. If that turns out to be the case, the implications will be significant. Organizations that have broadly rolled out Copilot but have struggled to get traction or see the widespread implications we’ve been expecting for communication may begin to see a shift in the coming months. As always, the time to test the limits and get in front of this impact is now.
We’ll leave you with something cool: Google DeepMind unveiled Genie 3, an interactive “general purpose world model that can generate an unprecedented diversity of interactive environments” from text prompts.
AI Disclosure: We used generative AI in creating imagery for this post. We also used it selectively as a creator and summarizer of content and as an editor and proofreader.