Confluence for 6.7.2026
The gap between writing and shipping. The latest from Microsoft. Navigating without a map. AI's judgment is improving.
Welcome to Confluence. Here’s what has our attention this week at the intersection of generative AI, leadership, and corporate communication:
The Gap Between Writing and Shipping
The Latest from Microsoft
Navigating Without a Map
AI’s Judgment Is Improving
The Gap Between Writing and Shipping
Finding where the real bottlenecks are and deciding what to do about them.
If you pay attention to the AI conversation, you’ve likely heard a new term in the last few months: token-maxxing. Software engineers have started to use as many tokens as possible to write as much code as possible. It’s a badge of honor. Some companies, including Meta and OpenAI, post leaderboards highlighting the engineers using generative AI the most.
Uber is a telling case. The company pushed its engineers to use AI as much as possible, ranked their usage on internal leaderboards, and watched adoption of Claude Code climb. The result: CTO Praveen Neppalli Naga recently revealed that Uber had burned through its entire 2026 AI budget in just four months and is now capping engineers’ token spend. You’d imagine that with that kind of usage and spend, there would be plenty to show for it.
Not quite.
A working paper from the National Bureau of Economic Research helps explain why. The authors studied more than 100,000 developers on GitHub, looking at their work from the earliest days of AI coding tools to understand how much generative AI speeds up specific tasks and how much of that gain reaches final output. The most striking finding? The tools produce dramatically more code while barely changing how much finished software actually ships. Developers using sync agents, the tools that draft and revise code alongside a person in real time while the human supervises and approves, wrote 741% more lines of code. Yet the software they shipped rose by only about 20%. And the same gap holds across every way of working the authors studied, down to the simple autocomplete that suggests the next line as you type.
Productivity on a task doesn’t necessarily mean more in the final product. Just because AI writes your company’s earnings report in five minutes doesn’t mean it speeds up the review, the legal sign-off, and the iteration that gets it out the door. Alternatively, being able to produce fifty newsletters in the time it used to take to write one doesn’t mean you’ll send fifty. The path toward any product runs through many discrete tasks, some of which generative AI can speed up and some of which it can’t. Trace any process and you’ll find the bottlenecks that determine the speed we can move through them.
It’s worth sharing this to manage expectations. Producing written work faster won’t bring a matching jump in final output. But it raises another question. What do we do about the bottlenecks?
We see two options. The first is to re-evaluate the bottlenecks we already have. Does this one serve a real purpose, or is it a holdover from a constraint we should try to remove? Our presumption isn’t that bottlenecks are bad. Many of them likely prevent mistakes we’d otherwise have to clean up. But it’s worth understanding what each one is and why it exists, and generative AI gives us the reason to look.
The second option is harder, and it may be where teams and leaders increasingly look to go. Rather than speeding up the steps inside a process, we redesign the process itself to account for what generative AI can now do. That takes real work. Leaders will need to make choices about tradeoffs and where to keep or even increase friction. We expect more organizations and leaders to take this path as more capable generative AI spreads inside their walls. But making smart choices about how to redesign our work means understanding both the technology and the human factors of working alongside it.
The Latest from Microsoft
Microsoft goes independent, and agents go mainstream.
At last week’s Build Conference, Microsoft unveiled a slew of announcements. As expected, most focused on the company’s AI efforts. Given Microsoft’s role in the industry, the conference serves as a useful proxy for where enterprise AI is headed. Two announcements stood out to us: the company’s release of its own generative AI models, and its continued push into autonomous AI agents.
Microsoft released seven of its own in-house models, headlined by MAI-Thinking-1, Microsoft’s first dedicated reasoning model. Until now, the models available within Microsoft’s tools — including Copilot — were developed by other companies, most notably OpenAI. In the past few months, Microsoft has signaled a shift away from its exclusive partnership with OpenAI, including its expanded partnership with Anthropic to bring Claude models into Copilot and develop Cowork for Copilot. Last week’s announcement signals that Microsoft is now aspiring to join the ranks of the frontier developers on its own. As for the models, Microsoft claims that they reach rough parity with Anthropic’s previous-generation models like Claude Sonnet 4.6, though those benchmarks come largely from Microsoft’s own evaluations and have not been independently verified. How Microsoft’s models are able to compete is worth watching, but the bigger takeaway is that Microsoft’s era of single-vendor dependence is officially over, and organizations can expect a growing menu of models in Microsoft’s tools. Some of those models will be Microsoft’s own.
Our second big takeaway from the Build announcements is Microsoft’s continued push into AI agents. Much of Build was organized around an “agent-first” concept of software that does real work rather than waiting to be operated by a user. For most of us up until now, running a capable autonomous agent has meant using something like OpenClaw, which places a significant burden for configuration, maintenance, and risk assumption on the individual user. This has largely kept agents in the hands of the early adopters who are willing to put in the time and take on the risk to make the leap. Microsoft is working to build similar capabilities into its standard AI infrastructure, with the governance and security controls that enterprises expect. The biggest announcement on this front was Scout, a personal work agent that lives inside Teams and Outlook. But Microsoft’s materials also note that the company is “invested in continuing to make OpenClaw run securely on Windows.”
It’s still early for both of these developments. We expect that Microsoft’s in-house models will trail the true frontier for some time, and its agent offerings likely will not match the most capable tools available right now. But the direction they both signal is more important. The days of Microsoft’s AI offering being essentially a chatbot powered by OpenAI’s models are over. We can expect a wider range of models, more of them Microsoft’s own, and a steady expansion of agentic capabilities woven directly into the tools that people already use.
That list shift deserves particular attention. Most of the training and workflow design organizations have done to date are based on the chatbot paradigm of the past three years: a person prompting a model, reviewing its response, and doing something with it. Agents that take action on their own behalf will demand a broader set of skills, expectations, and guardrails, and that adjustment will have its own learning curve. The scale of the shift will match the scale of Microsoft’s footprint in organizations, and it’s a shift to start getting in front of now.
Navigating Without a Map
How should we make decisions when we don’t know where we’re going?
Beneath every AI decision a leader makes lies an uncomfortable truth: no one knows the right way to do this. The labs are releasing models faster than anyone can absorb them, the researchers are scrambling to study the effects in real time, the executives are selling tools they can’t fully vouch for, and the institutions are buying them on faith. We are all figuring it out in real time, revising our views as the ground shifts, trying to solve problems we don’t quite understand. So the practical question is not “what is the correct AI strategy?” There isn’t one yet. The real question is how to make consequential decisions, and smart investments, about where to go when the destination is unknown.
The California State University system offers an example of one way to get this wrong. As Linda Kinstler reports in The New York Times Magazine, CSU spent $16.9 million to make itself “the nation’s first and largest A.I.-powered public university system,” releasing 500,000 ChatGPT licenses across its campuses (in the middle of a $2.3 billion deficit and alongside faculty layoffs and a tuition increase). CSU navigated by its destination — a confident vision of the AI-ready university and the AI-ready graduate — and the destination turned out to be a mirage. Only about half of the licenses were ever activated; some professors banned the tools outright while others required them, leaving students unsure of the expectation.
The easy reading of this story is “be cautious,” but it’s an over-simplification. In moments of uncertainty, action and inaction are both bets. A massive rollout with uneven uptake is a visible way to fail. Sitting out a technology shift this consequential is a quieter way to fail, and probably the costlier one. It makes sense why CSU bet on the rollout, but their fatal mistake was navigating by destination.
Instead of being driven only by the destination, organizations should navigate by what they should already hold fixed: their principles and culture. You don’t need an accurate forecast of the 2030 labor market to judge whether a decision fits what your organization stands for. CSU’s deeper struggle, in Kinstler’s telling, was that the technology exposed an unanswered question about the institution’s own purpose, and so “first and largest” became the goal by default. A leader grounded in mission can ask a more focused question than “where is all of this going?” They can ask “how does this serve who we are?” That kind of decision rests on a few things: clarity about what your organization stands for; input from the people doing the work, who usually know best where these tools earn their keep; and a commitment to protecting your people’s learning and judgment instead of quietly outsourcing it. None of these depend on knowing the future.
Principles tell you what to decide from. Acting wisely means deciding in a way that stays adaptable, which is exactly where CSU stumbled. A $16.9 million systemwide purchase is a hard bet to walk back, and half those licenses went unused. The lesson is to match the size of a commitment to your actual confidence, favor decisions you can reverse, and pilot before you scale. It also means revisiting your principles on a regular cadence, since the ground that makes a decision wise this quarter may shift by the next. CSU changed its commitment going into the second year: a shorter horizon, annual review, and openness to more than one vendor. The right instincts, a year late. CSU tells a cautionary tale of the cost of navigating by a destination instead of a principled foundation. The practical test is a short one: Does this decision fit who we are? Have the people doing the work weighed in? Can we walk it back if we’re wrong? The destination will keep changing, but how you make decisions doesn’t have to.
AI’s Judgment Is Improving
A new study on AI’s legal reasoning skills shows it can hold its own on ambiguous, judgment-dependent problems.
A recent Stanford-led study tested whether AI can answer complex legal-reasoning questions common to first-year law courses. The findings suggest AI is getting better and better at tackling abstract problems that require deep professional expertise and judgment.
The study’s authors argue that law school offers a uniquely demanding test for AI as an educational tool because the legal profession is so judgment-dependent. Lawyers need to know the objective details of the law, but they also need to interpret them against a professional standard that is often ambiguous. A strong legal education will teach and model that judgment, and any worthwhile AI tutor or legal assistant would have to do the same.
To see how AI measures up, the researchers recruited sixteen contract law professors from fourteen US law schools to write questions representative of what they hear from students in office hours. The professors wrote their own short answers, and Gemini 2.5 Pro and NotebookLM answered the same questions. The professors then judged their colleagues’ answers against the models’ in a blind comparison. The LLMs won decisively. They preferred Gemini’s answer 76% and NotebookLM’s 75% of the time across all categories of questions, from basic fact-recall to open-ended hypotheticals and policy situations with no clear right answer. The models’ answers were also flagged as “harmful” 3.5% of the time versus 12% for the professors. The pattern held across every judge: each professor, on average, preferred the AI answers, suggesting the models were tracking a real, if unstated, professional standard rather than the taste of a few evaluators. These results came from AI models that are already outdated and that had no specialized guidance or standards of excellence built in (like you would get from a well-made Claude Skill). When the authors simulated the study with newer models like Claude Opus 4.7 and ChatGPT 5.4, the gap between human and LLM responses was even greater.
This is a small study focused on a limited area of expertise, but it does show clear evidence for AI’s ability to reason through problems we tend to label as requiring human taste and judgment. Since this newsletter’s earliest days, we have argued that expertise-specific judgment is what will matter most for professionals in the age of AI. This study does not overturn that, nor does it prove that future models will match or exceed human professional judgment in all situations all the time. But it is an early signal that AI may get better at these higher-level skills faster than we might like to believe.
What that looks like, and what it means, will play out over time. For now, the study is a good example of what we might call AI’s double-sidedness—the fact that its ability to help or hinder often depends on the role we ask it to play. On the one hand, a law student could use AI to answer questions for them, limiting their own development and replacing the live, relational debate they might have with a professor in office hours. Or they could use it to do the opposite, to act as a supplemental tutor when no professor is around. But that requires the student to use AI in a fundamentally different way. Rather than simply asking for an answer, they’d need to ask it to review their work, push their thinking, build a quiz, grade their responses. AI is strange that way: it almost always offers a medicine for its own poison, at least where learning and skill development are concerned. What matters is how you use it, and whether you choose to add the friction back in. More often than not, that is the mark of an advanced user. And for those users, AI’s growing ability to take on judgment-rich tasks makes it, for now at least, a tool that will continue to put them ahead.
We’ll leave you with something cool: Anthropic is doubling Claude Cowork usage limits on paid accounts this month. If you haven’t yet tried it, now’s an excellent time.
AI Disclosure: We used generative AI in creating imagery for this post. We also used it selectively as a creator and summarizer of content and as an editor and proofreader.
