Confluence for 3.9.2025
Another surprising moment. Generative AI's effect on writing skills. New research on the impact of generative AI's advancing capabilities. A brief Copilot explainer.
Welcome to Confluence. Here’s what has our attention this week at the intersection of generative AI and corporate communication:
Another Surprising Moment
Generative AI’s Effect on Writing Skills
New Research on the Impact of Generative AI’s Advancing Capabilities
A Brief Copilot Explainer
Another Surprising Moment
A simple question-and-answer with ChatGPT 4.5 gets our attention.
We have written several times about specific moments with generative AI that got our attention, surprising us in ways that left us thinking that in some way the technology (and us with it) had crossed a threshold.
One occurred when we first used ChatGPT 3.5. We weren’t writing Confluence then, but the ability of ChatGPT to create novel content really struck us (and millions of others). The second happened when we first used Anthropic’s Claude 3 Opus model, which prompted us to write our Now Is the Time to Start Paying Attention post. A third came just a few weeks ago, when we first used OpenAI’s Deep Research capability.
And another occurred last night, with a deceptively mundane use of OpenAI’s new GPT-4.5 model.
First, some context on this model. OpenAI released it with a bit of fanfare, and the community of people who follow generative AI had high expectations for what it could do based on the significant data and technology that OpenAI invested in its development. And against those expectations, for many, the model fell flat. Initial reactions were muted, with many not seeing much of an improvement over prior models. But after a week or so, users started noticing differences in what it could do that were difficult to explain. People agreed with OpenAI’s claims that working with GPT-4.5 was more like talking to a person, that the conversations were more natural, that it had a real sense of humor — all of which was feedback a bit hard to pin down. And then earlier this week, GPT-4.5 rose to the top of the “Chatbot arena leaderboard,” a chatbot comparison site that compares model outputs in blind tests.
Our experience with GPT-4.5 was consistent with this larger arc. Upon first use, we really weren’t that impressed. It didn’t seem to give significantly better answers than other models, and for many things we thought Claude was better. But with more use, we started to think something interesting was going on. It’s intangible, but GPT-4.5 does feel different. It is more like talking with a person, but there’s something else to it. And then yesterday, we had a chat that brought things into focus for us. And the topic is almost ridiculously mundane.
Some of those who write for Confluence are — ahem — of an age where recalling obscure facts is more challenging than it has perhaps been in the past. Yesterday, one of those facts was a favorite actress from several decades past. As author-and-spouse sat on the couch struggling to recall her name, they asked Google:
“sitcom and film actress from the 1990s and 2000s with unusual name, striking”
We got lists of web pages that were lists of actresses, none of which had the person we were trying to remember. Generative AI was our next choice, so we asked Grok 3:
I’m trying to remember the name of an actress from the 1990s and 2000s. She did a sitcom / romcom on TV that got some attention but didn’t last long. She had an unusual name. Striking beauty. She was selective in what she did and wouldn’t just act in anything. I cannot remember her name. I seem to remember she also had an interesting voice.
Grok 3 gave us:
Based on your description, one actress who might fit the bill is Calista Flockhart. She starred in the TV sitcom/romcom Ally McBeal, which aired from 1997 to 2002 and garnered significant attention during its run, though it didn’t have a long lifespan compared to some other shows. Her name, Calista, is fairly unusual, and she was widely recognized for her striking, delicate beauty. Flockhart was also known for being selective with her roles, focusing on projects that suited her rather than taking on everything offered to her. Additionally, her voice has a distinctive, soft, and slightly quirky quality that could be described as interesting. Does that sound like the actress you’re thinking of? If not, feel free to provide more details, and I’ll refine the guess!
Nope. Not her. (And for what it’s worth, in other queries Grok 3 with equal confidence gave use Portia de Rossi and Zooey Deschanel.) So we asked ChatGPT 4o:
That sounds a lot like Paula Marshall or possibly Saffron Burrows—both had striking looks, a unique screen presence, and were selective about their roles.
Paula Marshall was in several short-lived sitcoms and romcom-style TV shows, including Cupid (1998) and Out of Practice (2005). She also had recurring roles in Spin City and Nip/Tuck.
Saffron Burrows had a distinctive voice and an elegant presence, though she was more known for films (Deep Blue Sea, Troy) and later TV (Mozart in the Jungle).
Does either of those names sound right, or should we dig deeper?
Nope. Claude Sonnet 3.7 said:
I might be able to help narrow this down. There were several actresses from that period with unusual names and distinctive voices who were selective about their roles.
Some possibilities that come to mind:
Phoebe Cates (though she was more active in the 80s and early 90s)
Janeane Garofalo (known for her unique voice and selective roles)
Téa Leoni (starred in "The Naked Truth" sitcom)
Calista Flockhart (from "Ally McBeal")
Neve Campbell (more known for films but did TV work)
Poppy Montgomery (had several TV roles)
Eliza Dushku (distinctive voice and selective about roles)
Do any of these sound like who you're thinking of? Or perhaps there are other details about the show or her career that might help jog your memory?
Turns out it’s Téa Leoni (and kudos to you if you guessed it, which you very likely did not). Still curious about how other models would handle the query, we asked GPT-4.5:
You’re probably thinking of Téa Leoni. She had an unusual name, distinctive voice, and striking appearance. Her sitcom The Naked Truth (mid-90s) got attention but didn’t last long. She was known for being selective, eventually starring in films like Deep Impact, Family Man, and Fun with Dick and Jane. Could that be her?
First shot, no alternates1.
So why did this simple, mundane exchange strike us so? The speed, directness, and accuracy of the GPT-4.5 response. When you think of it, the query we provided — I’m trying to remember the name of an actress from the 1990s and 2000s. She did a sitcom / romcom on TV that got some attention but didn’t last long. She had an unusual name. Striking beauty. She was selective in what she did and wouldn’t just act in anything. I cannot remember her name. I seem to remember she also had an interesting voice. — is incredibly broad. Walk up to almost anyone on the street and ask them that question, and they will do just what we did (and what the other models did): struggle, come up with a bunch of names, most or all of which are incorrect.
But not GPT-4.5. One shot, little context, right answer. Now let’s be clear, there is much that GPT-4.5 is not great at. Like all the LLMs, if you ask it to design a leadership message and without significant prompting on voice and context, it still produces a lot of corporate pablum. But for us, this insignificant exchange was the first conversation with an LLM where we walked away thinking, “Now that feels like general intelligence.”
We are not suggesting the Téa Leoni Test as the new threshold test for AGI. But earlier this week, when talking with a colleague about the muted reception to GPT-4.5 and its intangible qualities, we noted that as humans our ability to understand is limited by our own powers of perception. It’s why we can’t visualize multiple dimensions, and why we consider mysteries mystical. There are certain realities that we can only understand at the edges, because we simply lack the ability to understand them directly. Perhaps part of why we struggle to see the advancements in GPT-4.5 is because they now are of a character we can’t easily understand or perceive. Like dark matter, we can tell they are there, but not observe them directly. If something unique and new was emergent, could we tell? The ability for GPT-4.5 to so easily, and quickly, and accurately answer a question that was befuddling human and machine alike had that feel to us. And it still has us wondering what that might mean.
Generative AI’s Effect on Writing Skills
New research shows how we can use generative AI to become better writers with less effort.
While we remain mindful of how AI can diminish skills and create dependency, a new working paper titled “Learning Not Cheating: AI Assistance Can Enhance Rather Than Hinder Skills Development”2 flips this concern. The research team — which includes Angela Duckworth of Grit fame — from the University of Pennsylvania, Harvard, and Microsoft found something that caught our attention: practicing with AI, or even just viewing AI-produced examples, led to writing that scored significantly higher in quality.
The researchers conducted a randomized controlled experiment where participants wrote short pieces under different conditions. Those who worked with AI produced higher-quality writing, even when later asked to write without assistance. They also expended less effort — measured by time spent, keystrokes, and self-reported difficulty — and the performance gap persisted a full day after the initial interaction.
What makes this finding striking is the contrast with expectations. Before the study, 64.7% of participants believed practicing alone would be more effective than practicing with AI. Once again, our intuitions about technology prove to be poor predictors of reality.
This doesn’t surprise us. Our own experience mirrors these findings. Working with generative AI has made us better writers, not worse. The AI serves as a thought partner that challenges our framing, and the quality of our thinking has improved even as the effort required has decreased.
The study also highlights the value of examples in writing. Seeing a range of approaches expands our conception of what’s possible and provides models that we can learn from, adapt, and integrate into our own work. Examples help us break out of our familiar patterns and demonstrate techniques we might never discover through trial and error alone.
Generative AI dramatically improves our access to these examples through its tirelessness and versatility. We can ask an AI to generate multiple versions of a sentence, paragraph, or entire piece in diverse styles, each representing different approaches or voices. We would never ask a person to generate multiple examples like this — it would be impractical and time-consuming. And searching for existing examples often isn’t an efficient use of time, with no guarantee they’ll be relevant to our specific needs. But AI can accomplish this task in seconds, producing examples precisely aligned with our requirements.
This study serves as a powerful reminder: we often underestimate AI’s ability to make us better. Our mental models often remain stuck in a zero-sum framework where either AI does the work (making us dependent) or we do it ourselves (preserving our skills). The reality is more nuanced — AI can serve as coach and thought partner in ways that enhance rather than replace human capability. The main constraint isn’t technological — it’s our willingness to work with these tools creatively.
New Research on the Impact of Generative AI’s Advancing Capabilities
A new study finds that reasoning models and retrieval augmented generation can improve productivity and quality in legal work.
We first wrote about the “jagged frontier” of generative AI and its implications for professional work in September of 2023. That post came after the publication of “Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality,” a working paper that immediately captured our attention and in our view remains the most important study to date on the impact of generative AI on knowledge work. Among other important findings, the study found that when subjects (in this case, a subset of consultants at Boston Consulting Group) used generative AI on tasks for which it was well-suited, they completed 12% more tasks, worked 25% faster, and produced work that was rated by humans as 40% higher quality. This was two years ago, in the spring of 2023. The subjects used GPT-4, which by now has been surpassed in capabilities by dozens of models (and is listed last in ChatGPT’s menu of models, as the “Legacy” model). We’ve been watching for new, similar studies that use more contemporary models. Last week we came across one from the University of Minnesota Law School, which builds on the original “jagged frontier” research in important ways3.
The study, “AI-Powered Lawyering: AI Reasoning Models, Retrieval Augmented Generation, and the Future of Legal Practice,” explores the impact of two areas of advancing generative AI capabilities: retrieval augmented generation and reasoning models. Retrieval augmented generation, or RAG, is a technique that equips large language models with the ability to access and retrieve relevant information from external knowledge sources before generating responses. Reasoning models (about which we’ve written several times, most recently two weeks ago) assign more computational power to systematically reason before generating a response. To test the impact of both RAG and reasoning models, the researchers assigned law students to complete six realistic legal tasks using either Vincent AI (a legal tool powered by RAG, integrating specific legal databases to ground its output in authoritative sources), o1-preview (a reasoning model from OpenAI), or no AI assistance at all.
The findings were significant, if not surprising. The most striking findings from the study involve quality:
Our most significant finding is that access to both o1-preview and Vincent AI led to statistically significant and meaningful improvements in overall quality of work across four of the six assignments tested—with o1-preview producing larger and more statistically significant gains than Vincent AI. For both AI tools, these improvements were primarily reflected in enhanced clarity, organization, and professionalism of submitted work. Notably, o1-preview also significantly improved the logic and nuance of the legal argumentation in three of the six assignments.
The researchers also note significant productivity gains:
Shifting from quality to speed and productivity, we found that participants generally completed legal tasks more quickly when using both o1-preview and Vincent AI than when working without AI. However, the magnitude and variability of these speed gains were comparable to those observed with GPT-4 alone in a prior study. By contrast, productivity gains—measured by changes in overall quality score points per minute of work—appeared to be greater for both o1-preview and Vincent AI than for GPT-4 alone. We find that Vincent yields statistically significant productivity boosts of approximately 38% to 115% and o1-preview increased productivity by between roughly 34% and 140%, with particularly strong effects in complex tasks like drafting persuasive letters and analyzing complaints.
The emphasis above is ours. Both tools led to significant productivity gains, with productivity defined as changes in quality score points per minute of work. For complex tasks in particular, access to RAG-powered tools or reasoning models makes a big difference.
While this study examines the impact of models and tools that are significantly more contemporary than the model of the original “jagged frontier” study (GPT-4), the pace of progress is so swift that even the models in this “new” study are already somewhat dated. o1-preview has been surpassed by OpenAI’s o3 models. That fact only strengthens the importance of these findings, however. Just as the findings of the “jagged frontier” study become all the more striking for their being based on a now-obsolete model, these new findings on quality and productivity gains from RAG tools and reasoning models likely understate the gains achievable by using the latest models and tools — and suggest even greater potential impact as these capabilities continue to advance.
A Brief Copilot Explainer
Knowing which Copilot you have is essential for making the most of Microsoft’s AI.
When someone says “We use Copilot,” they could be referring to any of nearly a dozen different Microsoft AI tools. What began as Bing Chat has rapidly evolved into a sprawling family of AI assistants, all sharing the Copilot name but offering significantly different capabilities, access levels, and integration points. A recent TechCrunch article (published in August 2024 and updated last month) catalogs around a dozen Copilot-branded products, each designed for specific environments and use cases. Some are free, while others require significant investment — and many professionals using these tools don’t realize they’re not all created equal.
We see this confusion regularly with clients (and experience it ourselves). When someone mentions they “have Copilot” or “use Copilot daily,” we’ve learned to pause and probe further. The version they’re using changes our advice on how to use it — and what to use it for. There are many differences between the widely available consumer version and Microsoft 365 Copilot, beginning with price but also including access to organizational data, integration into Office tools (Word, Excel, etc.), and capabilities that can affect how teams collaborate. Microsoft’s own helpful guide offers clarity on the major Copilots:
Microsoft Copilot: The free consumer version that accesses only web-based information. Works well for personal tasks but limited for business use beyond basic research. Anyone can use this Copilot.
Microsoft 365 Copilot Chat: Free with Microsoft 365 licenses but restricted to web-based information without access to work data. Available at www.m365.cloud.microsoft/chat. If you have Microsoft 365, you have this Copilot.
Microsoft 365 Copilot: The premium work-focused version ($30 per user per month) that can access both web data and organizational information from emails, documents, and meetings. Deeply integrated with the Microsoft 365 suite, including in Word, PowerPoint, Excel, Outlook, and more. This is the Copilot many use to summarize Microsoft Teams meetings.
Microsoft Security Copilot and GitHub Copilot: Specifically for security professionals managing incident response and threats, and for developers writing code, respectively. These Copilots are more niche.
Microsoft Copilot Studio: A tool for creating custom AI agents and connecting to specialized data sources (Copilot’s equivalent of ChatGPT’s Custom GPTs and Claude’s Projects).
Copilot is one of the most widely accessible AI tools across professional environments today, but its potential value depends entirely on understanding which version you’re working with. Are you limited to web data, or can your Copilot tap into your team’s documents? Can your Copilot analyze spreadsheets or just summarize web articles? Does it understand your organization’s context? Each variant has its strengths and limitations—whether you’re using the free consumer version for research or the full Microsoft 365 Copilot for deep organizational integration. Identifying your specific Copilot is the first step. Once you understand your tool’s capabilities and blind spots, you can design workflows that leverage its strengths while supplementing its weaknesses, ensuring you get the most from Microsoft’s expansive AI ecosystem.
We’ll leave you with something cool: A crowd-sourced collection of Claude Artifacts at claudeartifacts.com. (If you’re not familiar with Artifacts, we wrote about them here.)
AI Disclosure: We used generative AI in creating imagery for this post. We also used it selectively as a creator and summarizer of content and as an editor and proofreader.
When we went back and turned “thinking” on for Sonnet 3.7 so it could reason, Ms. Leoni was its first option of several:
Based on your description, I'm thinking of a few possibilities from that era who had unusual names, striking beauty, and were known for being selective about their roles:
Téa Leoni (starred in "The Naked Truth" which ran for 3 seasons)
Calista Flockhart (though "Ally McBeal" was quite successful)
Phoebe Cates (who stepped away from acting for family)
Janeane Garofalo (known for her distinctive voice)
Sherilyn Fenn (from Twin Peaks)
Parker Posey (often in independent films)
Does any of these sound like who you're thinking of? If not, do you remember anything about the show's premise or her appearance (hair color, etc.) that might help narrow it down?
But on the whole, very similar to Claude’s first response.
Here is GPT-4.5’s summary of the research:
The paper examines the common concern that using AI tools like generative artificial intelligence (AI) for cognitive tasks might harm long-term learning, despite boosting immediate productivity. Researchers from the University of Pennsylvania, Harvard University, and Microsoft Research explored whether AI tools could instead support skill development.
The research involved a series of experiments where participants practiced writing cover letters with and without an AI assistance tool. Contrary to expectations from lay forecasters who believed AI would negatively affect learning, the experiments showed the opposite:
Participants who practiced writing with AI assistance improved their writing skills more significantly than those who practiced without AI, both immediately after practice and a day later.
Notably, participants using the AI tool achieved better results with less effort—measured through time spent, keystrokes, and subjective effort ratings—than those who did not use AI.
A third experiment revealed that simply viewing a high-quality AI-generated example without active practice was as effective as actively using the AI tool, suggesting that the benefit came primarily from exposure to personalized, high-quality examples.
The findings indicate that AI tools, when used effectively, do not necessarily lead to “skill atrophy.” Instead, these tools can facilitate learning by providing tailored demonstrations of strong performance, which learners can emulate. This research challenges common fears regarding AI use in education and work, proposing that rather than replacing human skills, AI can accelerate skill development by offering high-quality, personalized examples.
For those interested, as we were, we asked GPT-4.5 to expound upon how this study builds on the original “jagged frontier” study. After we uploaded both papers, GPT-4.5 provided the following:
After reviewing the original "Jagged Frontier" paper ("Navigating the Jagged Technological Frontier") in detail, here are the top five ways the Vincent AI and o1-preview AI study (the "Legal AI" study) advances and extends the findings from the original Jagged Frontier research:
1. Specificity of AI Models (Reasoning vs. Retrieval-Augmented Generation):
The original Jagged Frontier study broadly explored GPT-4 and its general-purpose capabilities in realistic consulting tasks. It highlighted variability in AI effectiveness but didn't deeply distinguish between different model architectures.
Advancement: The new study explicitly compares two AI types—retrieval-augmented generation (RAG) and a reasoning-focused model (o1-preview)—clarifying the specific strengths and weaknesses each architecture brings to complex legal tasks. This is a major extension, showing that AI productivity benefits depend critically on matching task types to specific AI architectures.
2. Detailed Exploration of Hallucinations and Accuracy Trade-offs:
The Jagged Frontier study highlighted general accuracy issues and how AI can negatively impact accuracy outside the frontier, but it focused primarily on tasks intentionally designed to fail AI.
Advancement: The Legal AI study provides nuanced empirical data on hallucination frequencies for both AI types (RAG vs. reasoning). It showed a clear trade-off: RAG models greatly reduce hallucinations, whereas advanced reasoning models, while increasing analytic depth, may introduce more hallucinations. This provides critical practical guidance for knowledge-intensive fields like law and strategic corporate communication.
3. Insight into Analytical Depth and Complexity Handling:
The original study found that GPT-4 was broadly effective for tasks inside the frontier but didn't deeply segment effectiveness by task complexity or analytical depth.
Advancement: The Legal AI study explicitly examines AI effectiveness across tasks of varying analytical complexity, demonstrating that reasoning-based AI significantly improves depth of analysis in litigation tasks—precisely the complex knowledge work most relevant to strategic internal communications and corporate knowledge tasks.
4. Exploration of AI-Enabled Human-AI Integration Models (Cyborgs vs. Centaurs):
The original study identified "centaur" and "cyborg" behaviors conceptually—highlighting differences between delegating work to AI (centaur) and deeply integrating work with AI (cyborg).
Advancement: While the legal AI study didn't explicitly differentiate centaurs vs. cyborgs, it empirically supports the underlying logic by demonstrating specific task conditions under which close human-AI integration (o1-preview) outperforms mere delegation or reliance on authoritative retrieval (Vincent AI). It thus adds depth to the concept by empirically grounding it in detailed task-by-task performance data.
5. Granular, Domain-Specific Methodology and Robustness of Evidence:
The original Jagged Frontier study was broad, involving many knowledge tasks from product ideation to strategic analysis, but its evidence, while strong, remained high-level.
Advancement: The Legal AI study employed a rigorous randomized controlled trial focused exclusively on realistic legal tasks commonly undertaken by junior associates, ensuring higher domain-specific precision. This approach provides strong causal evidence about how AI specifically impacts productivity and quality in complex knowledge domains, offering practical, precise insights for corporations and consulting firms in related fields.
We’re getting ready to launch the $30 a month version to a sizable group. We’re working on the training now to roll out in a couple of weeks. We built all the documentation using Loop and shot screencasts using Clipchamp and hosting on Stream. It’s a good workflow.