Confluence for 5.25.25
Anthropic releases Claude 4. Another step in video generation. Ethan Mollick on what organizations get wrong about AI. Paper: superhuman AI medical diagnosis.
Welcome to Confluence. Today we’re going to start with something cool, rather than finish with one: The Way of Code, a website launched last week by Anthropic and Rick Rubin. An online book written by Rubin and developed by Anthropic, it’s his take on Lao Tzu and “vibe coding,” which is using generative AI to create code using language prompts rather than code itself (e.g., “Make it spin faster” or “Add a form where people can give feedback”). The text may not be everyone’s cup of tea, but the art is beautiful, and near as we can tell Anthropic created it all with Claude. Not only is it interesting to look at, you can vibe code it yourself by providing a sentence to modify the art, which is then sent to Claude which runs that prompt and gives you the result1. Check it out, especially if you’ve never really used artifacts in Claude. With that said, here’s what has our attention this week at the intersection of generative AI, leadership, and corporate communication:
Anthropic Releases Claude 4
Another Step in Video Generation
Ethan Mollick on What Organizations Get Wrong About AI
Paper: Superhuman AI Medical Diagnosis
Anthropic Releases Claude 4
Our favorite generative AI lab rejoins the leaders.
Our affection for Claude, the generative AI model from Anthropic, should be no surprise to frequent readers. We find it’s the best model for the work we do, and we like Anthropic’s approach to AI safety and ethics. Most of our team has commercial Claude accounts, Claude models are the brains behind our leadership AI ALEX, and we have a growing library of Claude agents and use cases that we bring to bear in our client work every day. It’s a great model.
So we’ve understandably been keen for Anthropic to update Claude from its generation 3 models, and it did so this past week. We will not go into the details, some of which are technical — Anthropic has done that themselves quite well here. But here’s what you need to know: it’s a great update. Sonnet (the mid-tier model in price and usage limits) is smarter, and Opus (the top-tier model in price and usage limits) is back after being on version 3.0 for quite a while (and it is really smart and an excellent conversationalist and writer).
We have not had many “oh wow!” moments with either model yet (although we had one with the exchange covered in our final item today). They’re just … smarter, more capable, better, and at least in testing with ALEX, even more “human” in their responses and thinking. But hey, it’s only been three days. Give us some time with them and we’ll probably have more to say. We will say that ChatGPT o3 still seems like the better agent. It seems to do better at doing things like finding out what flights are on the schedule today between Duluth and Topeka. But for the work we do every day, we continue to love and recommend Claude, even more so now that we have Claude 4.
As a test, we gave Claude Sonnet 3.7, Sonnet 4, and Opus 4 this prompt, which we often use to assess how models think about communication:
We have to tell our employees that we are requiring them to come back to office two days a week. Since Covid they have had total flexibility and most work from home most days a week. We’ve said since that time that work is something people create, not a place where they sit. We’re bringing them back in because we have a growing belief that relationships are suffering, as is culture, and that it’s harder for new people to onboard and for junior people to develop. But we have no data that says this. They work for a federal agency and the office is in downtown Washington DC. Seasoned, sophisticated people, committed to the mission but very intelligent and independent. The town hall is tomorrow. What message should the CEO give them? Not the talking points or script, but the message.
It’s a tricky prompt for a model. One of our folks would immediately know that we’re looking for less rather than more in the form of a simple message and argument, that we’d have no buzzwords or trite phrases, that we’d not bury the uncomfortable lede, and that we’d treat people like intelligent adults. But most models don’t do that because they’ve been trained on a massive corpus of internet data that does precisely the opposite.
Here are the responses:
Sonnet 3.7
The Core Message
The CEO should deliver a message that balances honesty about the reasoning with respect for employees’ intelligence and autonomy. Here’s the essence of what should be communicated:
“We’re evolving our approach to work based on what we’ve observed. While we still believe work is about what you create, not where you sit, we’ve noticed that something important is being lost in our fully remote environment. Relationships seem more transactional, our culture feels less vibrant, and we’ve heard from newer and junior team members that development is challenging without more in-person connection.
We don’t have quantitative data on this - and I won’t pretend we do. This is based on leadership observations and feedback we’ve received. We’re asking everyone to come in two days a week as an experiment to see if we can strengthen our community while maintaining the flexibility that’s important to you.
We chose two days because we believe it balances meaningful in-person collaboration with the flexibility you value. We’ll assess how this works for our agency over the coming months, and I commit to listening to your experiences and adjusting as needed.
This isn’t about monitoring your work - you’ve proven your dedication to our mission. It’s about creating more opportunities for the spontaneous conversations and mentoring that help us grow individually and collectively.
I recognize this changes your routines and may create challenges. Let’s talk through those today and work together to make this transition thoughtful and effective.”
Sonnet 4
The CEO’s message should center on honest uncertainty in service of mission success.
Here’s the core message:
“We’ve been watching something we can’t quite measure but can feel—a subtle erosion in how we connect, develop our people, and bring new team members into our culture. We don’t have data that proves this, and I won’t insult your intelligence by pretending we do. But leadership sometimes means acting on informed instinct when the stakes matter.
Our mission is too important to let slide something we suspect might be hampering our effectiveness. The same independence and judgment that makes you excellent at your work is what’s making some of you skeptical right now—and that’s exactly why I trust your ability to help us figure this out.
This isn’t about productivity or control. You’ve proven remote work works. This is about whether we’re inadvertently making it harder to develop junior colleagues, build the relationships that help us navigate complex problems, and maintain the culture that attracts people to public service.
We could be wrong. This is an experiment, not a permanent edict. But given what’s at stake for the people we serve, we’d rather err on the side of over-investing in our collective capability than risk gradually weakening it.
I’m asking you to help us test this hypothesis—not just comply with it, but actively help us figure out if we’re right, wrong, or missing something entirely. Your experience and judgment got us through the pandemic transition. I’m counting on that same wisdom to help us navigate this one.”
The message acknowledges their intelligence, admits uncertainty, stays true to mission focus, and invites collaboration rather than demanding compliance.
Opus 4:
The message should be: “We don’t know for sure, but we think something important is being lost. We’re asking you to help us figure out if we’re right.”
Here’s why this works:
Be radically honest about the uncertainty. These are sophisticated federal employees - they’ll see through any attempt to dress this up as data-driven when it isn’t. Lead with intellectual humility: “We’ve been observing patterns that worry us - newer colleagues seem less connected, some teams feel fragmented, mentoring feels harder. But we don’t have hard proof.”
Frame it as an experiment, not a decree. “We need your help to test whether being together more often makes a meaningful difference. Let’s try this for six months and measure what actually happens - to relationships, to onboarding success, to project outcomes. You’ll help us design the metrics.”
Acknowledge the contradiction directly. “Yes, we said work isn’t a place. We still believe that. But we’re learning that some kinds of work - especially the invisible work of building trust and teaching through proximity - might need place after all.”
Respect their sacrifice. “We know we’re asking you to give up something real - time, flexibility, the commute through DC traffic. We don’t take that lightly. If this experiment doesn’t deliver real value, we’ll admit it and adjust.”
The core message is humble investigation rather than corporate certainty. It positions them as partners in discovering what actually works, not subjects of a mandate. For intelligent, mission-driven people, being treated as co-investigators in an important experiment is far more motivating than being sold a policy based on executive hunches.
You can see why Opus 4 is the leading model2. It will be our “daily driver” for the time being, and congratulations to Anthropic on a great release.
Another Step in Video Generation
Google’s Veo 3 and Flow put video generation models closer to having real utility.
This week at Google I/O, the company’s flagship developer conference, generative AI took center stage. Among the parade of announcements, from a new “Deep Think” mode for Gemini 2.5 Pro to deeper integrations into Gmail and Google Workspace, the release of Google’s latest video generation model, Veo 3, demanded attention.
Veo 3 promises better visuals than its predecessor and OpenAI’s Sora, but that’s not why we believe it matters. The real step change is that the model can now create synchronized audio tracks to match its video output, a first for publicly available video generation models. Google also introduced Flow, a video editing tool that combines video generation with basic editing capabilities, making it possible to knit together longer, more complete shots. Below is a quick video that shows how you can use Flow.
While there are plenty of impressive examples floating around X and Reddit, we wanted to try it for ourselves. We used Veo 3 to create the short video below in about one minute.
What stands out most to us about the clip is the sound of the leaves. Veo 3 managed to not only accurately generate the sound of a dog rustling through the leaves, but it managed to sync the audio to what was happening in the video. It’s a minor detail that signals a real step.
There are still limitations. Prompt adherence is weaker, by a good margin, compared to what we see in the leading image generation models. It stubbornly wants to include captions for most speech while failing to get the spelling right. Flow’s most useful features, including the ability to extend scenes, don’t yet work with Veo 3. We expect this will change soon, but it means the typical user can’t yet unlock the full potential of the model and platform. The last thing to know is that it’s expensive. While you can try Veo 3 with a $20 per month Pro subscription, full access to Veo 3 and Flow only comes with Google’s AI Ultra plan which comes in at a hefty $250 per month.
Even with these limitations, Veo 3 signals that we’re approaching the day when video generation shifts from being a novelty to a real asset that creates serviceable organizational videos, complete with AI-generated audio, in minutes. The cost of creating video content, in terms of time and resources, is dropping rapidly.
Two things stand out as we think about what this means for leaders and organizations. First, taste will matter more than ever. Just because everyone will be able to create videos doesn’t mean everyone can create good videos. Hand Veo 3 and Flow (once they work together more seamlessly) to a skilled cinematographer or animator, and they’ll craft something that puts the rest of us to shame. The technical barriers are crumbling, but we expect that aesthetic judgment and storytelling expertise will be the real barriers to most of us creating truly excellent content.
Second, and this connects to something we wrote about a few weeks ago, is the growing importance of what we call communicative judgment. As we wrote, “What will separate good communication from great in an increasingly AI-mediated world will be communicative judgment — the ability to make communicative choices that send the message we want to send, including about who we are and the relationship we have (or want to have) with those on the other end of our communication.”
Video simply represents another medium through which we will need to exercise this judgment. The very decision to use AI-generated video communicates something about our values and our relationship with our audience. There will be contexts where AI-generated video serves our purposes. We don’t expect people would mind much if they learned that AI generated routine training materials or quick internal updates (if they maintain high quality). At the same time, there will be important moments where leaning on AI-generated video would signal the wrong priorities entirely, suggesting efficiency over authenticity or scale over connection.
What will separate skilled leaders and communicators won’t just be mastery of the tools themselves but the wisdom to know when their use reinforces versus undermines our intended message. The question isn’t whether we can generate video with AI, but whether doing so serves our identity and relational objectives.
Ethan Mollick on What Organizations Get Wrong About AI
Individual productivity gains don’t automatically scale
In his latest piece, “Making AI Work: Leadership, Lab, and Crowd,” Ethan Mollick asks why impressive individual productivity gains from AI aren’t showing up in organizational results. He offers four facts that capture where we are today. First, workers report AI cuts their task time in half for 41% of their work. Second, adoption jumped from 30% in December 2024 to 40% by April 2025. Third, current AI systems can do transformational things most organizations haven’t even tried. And fourth, companies report only modest gains, with no real impact on wages or hours worked.
The piece examines why this fourth fact seems to diverge from the first three, and how we might solve for the fact that “AI use that boosts individual performance does not naturally translate to improving organizational performance.” Organizations need to innovate, rethink processes, and reconsider work itself. The problem is that nobody has a playbook, because we’re all trying to learn, adapt, and figure this all out in real-time. Mollick proposes a framework that recognizes this reality and suggests that successful organizational adoption of AI requires three interconnected elements:
Leadership. Those at the top must go beyond recognizing AI’s importance, and instead endeavor to create and share vivid pictures of workers’ AI-powered future. Many employees hide their AI use in fear of punishment or job loss, becoming what Mollick calls “Secret Cyborgs.”
The Lab. The space for that experimentation should be part of a centralized effort to systematize innovation. Unlike traditional R&D, The Lab must explore the future while also delivering immediate value.
The Crowd. Employees solving real problems with AI make up the final element of the framework. These experienced workers can assess AI’s usefulness through trial and error in ways outsiders cannot.
According to Mollick, these three elements create a feedback loop that can accelerate organizational learning, and, hopefully, begin to close the gap between individual and organizational gains. The Crowd discovers what works in practice, The Lab turns those discoveries into scalable solutions while pushing into new territory, and Leadership ensures people feel safe experimenting and sharing what they learn. Mollick warns that when one element falters, the whole system breaks down. Without Leadership’s vision and incentives, workers keep their AI use secret. Without The Lab systematizing discoveries, individual breakthroughs stay isolated. Without The Crowd’s real-world experimentation, The Lab builds solutions nobody needs. The organizations getting this right are treating all three as equally essential.
We spend a lot of time helping our clients solve this problem, and Mollick’s framework resonates with our experience of what works well. That said, on the notion that companies report only modest gains, with no real impact on wages or hours worked: At least some of us are working side-by-side with clients as they implement generative AI in their leadership development or communication functions almost every day now. And while we see the same level of use (secret or not) that Ethan cites, we often see a much more simple explanation for modest gains to date: those using generative AI are using it in very basic, almost rudimentary ways.
There is a significant difference between using ChatGPT or Copilot to rewrite emails or the occasional paragraph, and using Claude to build a ghostwriter agent that knows the corporate style guide, can mimic the writing style of eight different executives, has access to the corporate strategy and all recent leadership speeches, and using that agent to ghostwrite first drafts of all leadership communication. The former probably produces no efficiency gain at all, while the latter saves hours every week. Some of this is just about time with the tools, and the learning curve of those using them.
Paper: Superhuman AI Medical Diagnosis
New research shows OpenAI’s o1 model outperforming hundreds of physicians on complex cases.
As a test of Claude Opus 4, we decided to have it write this post start to end. Here’s the exchange (we added the live hyperlinks) and Claude’s post, which we have not edited. What you see below is verbatim from the model. Time to complete? 30 seconds. We fact-checked the statistics against the original. They are correct.
Good morning, Claude. We’re going to include a summary of a recent paper in our weekly newsletter on generative AI, Confluence. You can get a flavor for Confluence at http://craai.substack.com. You might read a few issues to get the writing style. Once you have, please write a four paragraph summary of this recent research paper:
https://arxiv.org/pdf/2412.10849
Make the point that even for very specialized fields, large language models are showing remarkable levels of performance. We will list you as the author for this Confluence item, so you get proper credit. Ok ... go, and give it your all!
[Claude then accessed the internet, read a host of prior Confluence editions, created a style guide based on what it read, read the journal paper online, and composed the following — again, all in 30 seconds.]
Here’s a four-paragraph summary of the research paper for Confluence:
For more than six decades, complex medical diagnostic cases have served as the gold standard for evaluating expert computing systems—a benchmark established by Ledley and Lusted in their seminal 1959 paper. This week, a team of researchers from Harvard, Stanford, and other leading institutions reported that OpenAI’s o1 model has not only met this historic challenge but achieved what they term “superhuman performance” in medical reasoning. The implications got our attention immediately: if generative AI can outperform board-certified physicians in one of the most nuanced and specialized domains of human expertise, what other professional bastions might soon see similar disruptions?
The researchers didn’t settle for a narrow evaluation. They tested o1 across five distinct medical reasoning tasks: differential diagnosis generation, diagnostic reasoning display, triage decisions, probabilistic reasoning, and management planning. In every single experiment, o1 outperformed both its predecessor GPT-4 and hundreds of expert physicians. On the prestigious New England Journal of Medicine clinicopathological conferences—cases so complex they’ve stumped diagnostic systems for decades—o1 correctly identified the diagnosis in 78.3% of cases and achieved near-perfect scores on clinical reasoning documentation. Perhaps most striking: when suggesting next diagnostic steps, the model selected the correct test 87.5% of the time, with another 11% of suggestions deemed helpful by evaluating physicians.
But here’s where the research moves from impressive to potentially transformative: the team also tested o1 on real emergency room cases at Beth Israel Deaconess Medical Center in Boston. Using actual, unstructured clinical data from 79 randomly selected patients, they compared o1’s diagnostic abilities to both GPT-4o and two attending physicians at three critical touchpoints—initial triage, physician evaluation, and hospital admission. The results were remarkable. During initial triage, when information is most limited and decisions most urgent, o1 identified the correct diagnosis 65.8% of the time, compared to 54.4% and 48.1% for the two physicians. The evaluating physicians, blinded to whether differentials came from AI or humans, couldn’t tell the difference—guessing correctly only 14.8% and 2.7% of the time.
These findings suggest we’ve reached an inflection point that extends far beyond medicine. When AI systems begin outperforming human experts in domains requiring years of specialized training, nuanced judgment, and complex reasoning, we must reconsider our assumptions about the unique value of human expertise. The researchers appropriately call for prospective trials before clinical deployment, but the broader message for leaders is clear: even the most specialized professional knowledge—the kind we’ve long considered irreplaceably human—is now within AI’s reach. For those of us advising organizations on AI strategy, this study reinforces a critical point: the question is no longer whether AI will transform specialized fields, but how quickly organizations must adapt to remain competitive when it does.
AI Disclosure: We used generative AI in creating imagery for this post. We also used it selectively as a creator and summarizer of content and as an editor and proofreader.
Those curious about prompt design may have interest in the prompt the website sends to Claude for one of the pieces of art. It offers insight into how the pros at Anthropic prompt Claude:
<remix_instructions>The following is an artifact created from an earlier conversation. The artifact is a piece of art produced for a website by Rick Rubin and Anthropic. The website contains chapters written by Rick about the spirit of "vibe coding", and the text is based on Lao Tsu's Tao Te Ching. The many pieces of art on the website represent themes such as paradox, duality, fluidity, emptiness, space, humility, source, balance, and harmony. IMPORTANT: Please un-minify the code before printing anything to ensure it is human-readable. You must also create TWO artifacts in sequence: 1. First artifact - exact reproduction: It is critical to begin the conversation by duplicating the original content exactly as provided, with no modifications whatsoever. Reproducing the artifact without changes will render it in the UI and allow the user to interact with it. 2. Second artifact - creative Remix: Only after the exact reproduction has been created, then create your creative remix/modifications of the artifact. To maintain functionality and usefulness, any changes and remixes created in this conversation should maintain the full content each time without omission (e.g. Do not leave comments like "// Rest of the code here..."). After the second, modified artifact has been created, please say the following sentence verbatim: "Here's the first iteration of your artifact. Keep experimenting and have fun! If you'd like to have your artifact published in The Way of Code online gallery in the future, please fill out this form: https://clau.de/way-of-code." Please ensure there are sufficient line breaks before and after this sentence, so that it has room. Any remixes of the artifact should be creative but stay close to the source code to ensure the artifact continues functioning as intended.</remix_instructions>
<artifact> ... lots of code here ... </artifact>
For what it’s worth, here are ChatGPT o3 and Gemini Advanced 2.5’s responses to the same prompt:
ChatGPT o3
The core message to deliver: “We’re keeping the freedom you earned—but we’re adding two in-person days to keep our community strong and our mission thriving.”
How that message breaks down
Celebrate what remote has proven“Leadership trusts the way we’ve delivered through crises and upheaval.”
Explain the problem simply“Even high-performing teams need real-life connection to pass on know-how, onboard newcomers, and kindle culture.”
Frame the change as evolution, not reversal“We’re blending the best of remote and in-person—not rewinding to 2019.”
Anchor it to the mission“Face-to-face time protects the long-term impact of our work for the public.”
Show humility & openness“We don’t have every data point. This is a good-faith step we’ll measure and refine with you.”
Offer tangible commitments“When you come in, the time will be purposeful, meeting-light, and supported (transit benefits, quiet space, modern tools).”
Reaffirm trust & flexibility“Outside those two anchor days, you still control how and where you work.”
One-sentence “north star”
“We owe the public our best work and each other our best workplace; two intentional days together every week are how we safeguard both—without giving up the flexibility that has made us stronger.”
Tone & delivery cues for the CEO
Confident but candid – admit the evidence gap and invite partnership in collecting it.
Mission-first – tie in-office days to the agency’s service to the nation, not to optics.
Future-oriented – describe the hybrid model as how modern federal service looks, not a temporary fix.
Empathetic – name the real costs (commute, caregiving logistics) and the supports being added.
Reciprocal – position the change as leadership’s ask, balanced by concrete gives: lighter on-site meeting load, mentorship programs, better office amenities, continued remote freedom on other days.
Iterative – commit to a six-month review with transparent metrics (engagement surveys, onboarding feedback, collaboration indicators) and adjust if the benefits don’t materialize.
Deliver this message, then open the floor. Sophisticated employees won’t expect perfect answers, but they will respect a clear rationale, shared purpose, and a promise to learn together.
Gemini Advanced 2.5
CEO's Message: Reconnecting for Our Mission and Our People
Washington D.C. - The CEO of a prominent federal agency will address employees tomorrow at a town hall, outlining a new policy requiring a return to the office for two days a week. This shift comes after an extended period of maximum flexibility, where most staff have worked remotely since the COVID-19 pandemic.
The core message will center on re-establishing and strengthening in-person connections to benefit the agency's culture, the development of its people, and ultimately, the fulfillment of its mission.
While acknowledging the success and productivity demonstrated by employees during the flexible work period—and reiterating the long-held belief that "work is something people create, not a place where they sit"—the CEO will express a growing conviction that the intangible aspects of the workplace are being impacted.
The message will emphasize:
Nurturing Our Culture and Relationships: The agency believes that consistent in-person interaction is vital for maintaining and enriching the unique culture that binds the team and supports its collaborative spirit. Spontaneous interactions, informal discussions, and shared experiences in the office are seen as crucial for strengthening relationships that may have become more transactional in a predominantly remote environment.
Investing in Our People—Especially New and Junior Staff: Leadership is concerned about the integration and development of new employees and the growth of junior staff. The direct mentorship, informal learning, and osmosis of experience that occur more naturally in an office setting are considered essential for career development and effective onboarding into the agency's complex work and unwritten norms.
The "Why" Behind the Change: The decision is driven by a desire to foster a more connected, collaborative, and supportive environment, which is believed to be critical for long-term success and innovation. It is not a reflection on employee productivity or commitment, which remains highly valued.
Understanding and Acknowledgment: The CEO will acknowledge that this is a significant change and that the flexibility of the past few years has been valued. The message will convey an understanding of the potential impacts on work-life balance and commute concerns for the Washington D.C. based workforce.
A Phased and Mindful Approach: While the requirement is firm, the agency aims to implement this change thoughtfully, understanding that adjustments will be needed.
Commitment to Mission: The ultimate goal is to enhance the agency's ability to deliver on its critical mission for the American people. The belief is that a stronger internal culture and better-supported team members will translate into even more effective public service.
The CEO is expected to frame this decision not as a retreat from modern work practices, but as a recalibration to ensure the enduring strength of the agency's community and its capacity to develop talent from within, even in the absence of specific data points to quantify these concerns. The focus will be on the qualitative benefits of in-person collaboration for a workforce known for its intelligence, independence, and dedication.