Confluence for 10.26.25
The shift from doing to reviewing. AI models are approaching expert-level work; now what? Generative AI and entry-level work. An updated guide to choosing AI tools.

Welcome to Confluence. Here’s what has our attention this week at the intersection of generative AI, leadership, and corporate communication:
The Shift from Doing to Reviewing
AI Models Are Approaching Expert-Level Work. Now What?
Generative AI and Entry-Level Work
An Updated Guide to Choosing AI Tools
The Shift from Doing to Reviewing
One consequence of creating (near) instant content.
Among the most obvious advantages of working with generative AI is it produces content quickly. For short pieces, delivery is nearly instant. With increased intelligence and capabilities, reasoning models, and tools such as Claude Code, AI can create massive, complex pieces of content in a fraction of the time humans need. See what we wrote about our experience with Claude Code for evidence.
Using generative AI to produce larger bodies of work introduces a dilemma for those depending on the output. Creating a 65-page strategy document with AI agents feels like magic, especially when the quality appears high at first glance. But there’s a catch: if you’re going to use it for real work, someone needs to take the time to review it in detail. We’ve written about these risks when touching upon the wizard problem, and a nuance of this has come up in recent client conversations.
When writing or producing any piece of content without generative AI, the production itself serves as part of the review process. We make decision after decision, making judgments and evaluating the quality of ideas as we go. We choose what data to pull in and which to exclude, and where we source it from. When we use generative AI, we forgo those choices. When we review this output, we’re seeing and evaluating these choices for the first time. Evaluating a brand new decision, sentence, or fact takes more time, more energy, more attention than checking something you’ve already considered. The consequence of this is that not only does the review and quality check become more important, it also takes more time and effort than reviewing content produced by you or another person whose judgment you trust.
The world we need to ready ourselves for is one where the balance of labor shifts meaningfully toward review over production. Consider that 65-page strategy document again. A person or team might need something like 100 hours to produce it. But during those hours, they would engage in hundreds of smaller checks, evaluating sources, testing arguments, refining logic, meaning the work of final review would be straightforward. Perhaps another five hours to polish and validate, bringing the total to around 105 hours.
When AI produces the same document in 45 minutes, the labor calculates differently. The reviewers now encounter each choice for the first time. They must verify claims, evaluate arguments, check sources they haven’t necessarily seen before. What would have been quick gut checks during production become deliberate evaluations during review. That 45-minute draft could easily demand 15 to 20 hours of rigorous review to reach the same level of confidence.
The time savings remains real—a rough reduction from 105 hours to under 21 hours represents meaningful efficiency. But we can’t treat AI-produced content as production time saved without review time added. The work doesn’t disappear. It shifts. And understanding where it shifts, and what that shift demands of us, informs whether or not we fall into the trap of falling asleep at the wheel. It also raises questions about which tasks we should continue doing ourselves, preserving the capabilities we’ll need when reviewing AI output.
AI Models Are Approaching Expert-Level Work. Now What?
New research shows frontier AI matching human professionals on nearly half of real-world tasks.
Anthropic’s Claude generative AI researched, wrote, edited, and fact-checked this article with no human intervention or editing.
OpenAI released a new benchmark last month called GDPval, and while benchmarks rarely warrant attention from business leaders, this one does. Rather than measuring AI on academic puzzles or narrow technical tasks, GDPval evaluates frontier models on actual work from experienced professionals across 44 occupations in the nine largest sectors of the US economy. The researchers recruited industry experts with an average of 14 years of experience to create tasks based on their real work—deliverables that take an average of seven hours to complete and carry specific dollar values based on market wages. When human experts compared AI outputs to human work in blind tests, Claude Opus 4.1 performed best, with nearly 48% of its deliverables rated as good as or better than expert work. This isn’t narrow capability in coding or customer service. This is approaching parity with experts across legal briefs, financial analyses, engineering designs, film edits, and healthcare management plans.
The speed and cost implications are tangible. The researchers modeled several scenarios for incorporating AI into expert workflows and found that even conservative approaches show meaningful savings. In a “try the model, review it, fix it yourself if needed” workflow, models delivered time savings of 1.08x to 1.39x and cost savings of 1.13x to 1.63x compared to unaided experts. These aren’t dramatic multiples, but they represent real efficiency on high-value knowledge work. What’s more revealing is how the models failed. Only 3% of failures were catastrophic errors—the kind that would damage client relationships or create legal exposure. Instead, 44% were rated “acceptable but subpar,” work that could be used but lacked the polish or depth of expert output. This matters because it suggests that with appropriate oversight, a substantial portion of knowledge work has entered the realm of “good enough,” and the threshold for “good enough” keeps rising.
The GDPval results strengthen the case that large language models are general purpose technologies—rare innovations that reshape economies across sectors. General purpose technologies share three traits: continuous improvement over time, applicability across diverse domains, and the ability to spawn complementary innovations. GDPval demonstrates the first two clearly. Performance on these real-world tasks has improved roughly linearly over time, and the occupational breadth—from mechanical engineering to film editing to legal analysis—shows genuine cross-domain capability, not narrow specialization. The third trait, spawning complementary innovations, is already visible in the proliferation of AI-powered tools and entirely new workflows that didn’t exist two years ago.
If AI models are general purpose technologies, leaders should prepare for the pattern that played out with electricity, computers, and the internet. The gap between technical capability and economic impact spans years, sometimes decades—not because the technology isn’t ready, but because organizations need time to develop new processes, restructure workflows, and reimagine roles. GDPval suggests we’ve reached the point where technical capability is ready for broad deployment. The question isn’t whether AI can perform knowledge work at scale—it increasingly can—but whether organizations will integrate this capability thoughtfully. The risk isn’t just failed implementations; it’s that leaders treating this as a simple cost-cutting exercise will miss the larger opportunity to fundamentally rethink how knowledge work gets done. That’s the real challenge, and it’s one no benchmark can measure.
Generative AI and Entry-Level Work
OpenAI’s efforts on entry-level finance work are a harbinger of what’s to come.
As we note (with Claude’s assistance) in the piece above, new benchmarks suggest that today’s leading generative AI models are increasingly capable of producing expert-level work across domains. This week, there was also news on the entry-level side of the spectrum, with Fortune reporting that “OpenAI is planning to automate away entry-level tasks in finance” and has “enlisted more than 100 ex-investment bankers to help train its AI models on how to build financial models that could automate hours of entry-level tasks.” Such is the unique nature of generative AI: both experts and entry-level employees stand to gain and, potentially, to lose. How organizations decide to handle this technology will determine how that plays out. What looks increasingly likely at this point, though, is that the nature of knowledge work will change at nearly every level of the organization.
Despite the headline (“Open AI is coming for Wall Street’s grunt workers”), the article paints a nuanced picture. The article quotes Shawn DuBravac, economist and CEO of Avrio Institute, extensively. DuBravac argues that task automation does not necessarily equate to role replacement:
I’m not convinced that we get rid of entry-level workers anytime soon, but I could imagine a world where the skill set we need those entry-level workers to have is different,” Shawn DuBravac, an economist and CEO of Avrio Institute, a research and consulting firm, told Fortune.
…
“Within the next year, I’d expect firms will move quickly to try to automate 60% to 70% of the time analysts currently spend on these lower-level tasks.”
As routine, mundane tasks are automated and AI is more embedded into junior analysts’ workflows, senior analysts will find more “sophisticated” tasks for them to execute—like building financial models with greater complexity, or performing more quantitative analysis, which are skills they might usually learn further along in their careers—he added.
The emphasis above is ours. Automating certain tasks to reduce the time spent on them by 60-70% is not the same thing as automating 60-70% of roles. We agree with DuBravac’s view that this can create as much opportunity as it removes for entry-level employees, allowing them to take on more sophisticated, complex work more quickly while working with AI to master the fundamentals.
On the topic of enhancing junior employee capabilities, the article quotes Ram Srinivasan, managing director of consulting at JLL: “AI will give every analyst superpowers and allow banks to compound human insight.” And then, in a point that echoes the one we make in this edition’s lead item, “Analysts become reviewers and customizers rather than builders from scratch, allowing each person to support more deals simultaneously.” We agree. Generative AI can empower junior employees not just to become proficient more quickly, but simply to do more.
OpenAI’s work right now is focused on finance, but we have little doubt that similar efforts will come for every domain. Generative AI will change entry-level work in every function. There is no right answer for how to manage this, but there is likely a wrong one: ignoring that generative AI will reshape work at every level. For entry-level work, the most important question is “What is entry-level work in this organization for?” The answer to that question should inform what that work should look like now and in the future, including how generative AI is integrated into it.
An Updated Guide to Choosing AI Tools
Ethan Mollick’s latest is a good overview of what to consider across the shifting landscape
This week, Ethan Mollick published another iteration of his regular “Opinionated Guide to Using AI Right Now.” Given the pace of development we’ve been tracking in Confluence, the timing is useful. The landscape keeps shifting, and the question of which tool to use remains worth revisiting. Mollick’s latest core guidance is straightforward: For casual users, the free versions of ChatGPT, Gemini, or Claude will do the job. For regular users, spend the $20 monthly to access the paid version of your choice.
What we find most valuable in the guide is the breakdown of model types within each platform. Chat models, the default interface, answer quickly and work for conversation. Agent models take longer but can handle multiple steps autonomously, executing web searches, creating code, or producing documents. Wizard models (GPT-5 Pro, Gemini Deep Think) tackle complex academic tasks but require significant time. Mollick’s recommendation: use agent models for work that matters. They’re more consistent and less prone to errors.
This distinction matters particularly for organizations deploying in-house tools with multiple models from different labs. As we wrote two weeks ago, when faced with options like “Llama-3.3-70B-Instruct” or “Mistral-medium-2508,” it’s easy to feel overwhelmed. Mollick’s guidance provides a clearer way to think about model selection: understand what type of work you’re doing, then choose accordingly. For most professional tasks that require quality and consistency, agent models are the right starting point.
Two features stand out for improving results. First, Deep Research mode conducts 10-15 minutes of web research across hundreds of sources before answering, producing reports that seem to hold up under scrutiny. Second, data connectors integrate your Gmail, SharePoint, and other tools, enabling queries like “give me a detailed briefing for my day.” Claude recently rolled out its Microsoft365 Connector, for example, and we’ve been impressed by its performance so far.
Mollick’s guide also addresses recurring concerns. Hallucinations persist across all models, though they’re less frequent in advanced versions. When you need honest feedback, tell the AI to act as a critic to avoid sophisticated sycophancy. And, lastly, elaborate prompting techniques don’t seem to help much anymore. These models are good at figuring out what you want, so spend less time crafting your initial prompt and more time in the back-and-forth.
We use Claude Sonnet 4.5 with Extended Thinking for most tasks. It’s fast, capable, and the usage limits work for us. But your specific choice matters less than Mollick’s broader point: pick a system and use it for work that actually matters. The goal isn’t expertise in every tool’s capabilities. It’s building intuition about what’s possible so you’re prepared for a multi-model future.
We’ll leave you with something cool: Anthropic named the winners for a contest to see the coolest things users could build with Sonnet 4.5 in one week.
AI Disclosure: We used generative AI in creating imagery for this post. We also used it selectively as a creator and summarizer of content and as an editor and proofreader.