Confluence for 2.9.2025

OpenAI Deep Research stuns us. Evolving perspectives on model hallucination. Generative AI's impact on workplace critical thinking. Brief thoughts on OpenAI's Operator.

Feb 09, 2025

Midjourney prompt: *Art Informel, Library of Alexandria meets neural networks —ar 4:3*

Welcome to Confluence. Here’s what has our attention this week at the intersection of generative AI and corporate communication:

OpenAI Deep Research Stuns Us
Evolving Perspectives on Model Hallucination
AI’s Impact on Workplace Critical Thinking
Brief Thoughts on Open AI’s Operator

OpenAI Deep Research Stuns Us

OpenAI releases a tool that amazes us … and leaves us with deeper questions.

Readers know we’ve spent the past 24 months helping our firm prepare for and begin to integrate generative AI, both internally and in our client work. We’ve followed the technology closely, brought hundreds of clients up to speed on generative AI, have helped build our leadership AI, ALEX, and have had our hands deep in the creation of our own AI consulting tools and our advisory practice’s strategy for AI adoption.

Along the way, two moments really got our attention. The first was the first time we used ChatGPT 3.5 nearly two years ago. The second was when we engaged Anthropic’s Opus 3 model in real conversation for the first time last Spring (and wrote the Confluence post Now Is the Time to Start Paying Attention). The third was last Monday morning when we first used OpenAI’s new Deep Research capability, which uses a combination of access to the internet and a version of OpenAI’s upcoming o3 model to answer research questions. From OpenAI:

Deep research is OpenAI’s next agent that can do work for you independently—you give it a prompt, and ChatGPT will find, analyze, and synthesize hundreds of online sources to create a comprehensive report at the level of a research analyst. Powered by a version of the upcoming OpenAI o3 model that’s optimized for web browsing and data analysis, it leverages reasoning to search, interpret, and analyze massive amounts of text, images, and PDFs on the internet, pivoting as needed in reaction to information it encounters.

Our test query was about how a communication theory that our advisors know well and apply it to a sales context. We picked this query because we know the literature on the theory very well, and we’ve advised sales leaders for years, and as a result we’d be better able to judge the output. Here’s our query to Deep Research:

Us: I want to craft a briefing document for sales people on the logic of message design and the implications of that theory in their work. Research the theory, note and summarize it and the seminal work on it, describe recent advances if any in the research on message design logic, and conclude with any findings that are material to sales leaders.
ChatGPT: To craft a strong briefing document, I’d like to clarify a few points:
Scope of Message Design Logic: Are you specifically referring to Barbara O’Keefe’s Message Design Logics (expressive, conventional, rhetorical), or are you open to broader theories of persuasive message design in communication, sales, and marketing?
Depth of Research: Do you want a high-level overview, or a more detailed academic synthesis with citations?
Application to Sales: Should the document focus on direct sales interactions, sales leadership communication, or broader sales messaging (e.g., marketing, positioning, customer engagement)?
Format: Do you envision this as a structured report with sections like “Theory,” “Key Research,” “Implications for Sales,” etc., or something more concise?
Let me know your preferences, and I’ll get started!
Us: Yes on O'Keefe. Detailed academic synthesis. Sales philosophy. Sections.
ChatGPT: Got it. I’ll conduct a deep dive into Barbara O’Keefe’s Message Design Logics, summarizing its theoretical foundation, key academic contributions, and recent research developments. The document will be structured into sections, covering the core theory, seminal work, advancements in research, and material takeaways for sales leaders, particularly in shaping a sales philosophy.

The output Deep Research authored is excellent. It’s accurate, well-organized, and insightful. It’s also nearly 8,000 words, 12 pages long, and took us some 20 minutes to read. Deep Research includes hyperlinked citations in the body of its text so you can verify sources, and it cited over a dozen pieces. We think the report is worth reading in full so you can start to sense where the future is headed, and you may read it here (we’ve taken the hyperlinked citations out to help the formatting):

OAI Deep Research: Message Design Logics Theory And Sales Communication

170KB ∙ PDF file

Download

Deep Research is only available to Pro users (who pay $200 a month) though OpenAI says it’s coming soon to Plus and Teams users. It also has some limitations, as noted by OpenAI:

Deep research unlocks significant new capabilities, but it’s still early and has limitations. It can sometimes hallucinate facts in responses or make incorrect inferences, though at a notably lower rate than existing ChatGPT models, according to internal evaluations. It may struggle with distinguishing authoritative information from rumors, and currently shows weakness in confidence calibration, often failing to convey uncertainty accurately. At launch, there may be minor formatting errors in reports and citations, and tasks may take longer to kick off. We expect all these issues to quickly improve with more usage and time.

We’ll also note that it can’t search behind paywalls, so we’re presuming it can’t authoritatively search the academic literature. That said, we also presume OpenAI can author licensing agreements with academic publishers to eventually solve that problem.

All that said, the output Deep Research created stunned us, and our lasting feeling is that we’ve passed yet another threshold in generative AI capabilities — now with an agent that you can assign to research and report on a topic, leave for a while, and come back and have a piece of work that is equal to or above most human standards. With something like Deep Research at their fingertips, billions will have the ability to quickly educate themselves about virtually any topic, with depth. And as Ethan Mollick noted:

If your organization is in a field that derives value from research & analysis of ambiguous data (law, consulting, many aspects of finance), you need to be benchmarking Deep Research & future AI agents.
They may or may not be good enough today, but you need to map the trajectory as advances continue.
Benchmark data gathering ability separately from analysis & insights. Assume that the lack of access to paywalled data is temporary. Start building prompting strategies for steering these agents. Assign folks to the tedious job of assessing hallucination rates. Watch how this changes over time.

The implications for how professionals do work, and lead others who do work, will be real.

We will leave you with this: several of us have been researching and writing about topics for clients for 30 years, are experienced in doing secondary research and literature reviews, and write daily in one way or another. We consider ourselves pros in that space. Our sense is that it would take one of us two to three days to produce the work we’ve attached above — not full-time days, but days in which this was our main focus among other work.

It took Open AI Deep Research seven minutes.

Evolving Perspectives on Model Hallucination

The risk of hallucination may never go away entirely, but it should not hold you back from working with these tools.

By now, most users of generative AI are aware of the risk of hallucination — the phenomenon of large language models producing information that is false, fabricated, or otherwise inaccurate. These models are probabilistic, not deterministic, and generally speaking do not “know” when they are wrong. It’s a phenomenon that has gotten many unwitting users in trouble, occasionally in ways that have made headlines. The possibility of model hallucination is now so well-known that it has become one of the biggest factors holding people back from engaging with these tools at all. But this fear, while understandable, often overshadows the remarkable progress being made in reducing hallucination rates — and the many ways you can mitigate the risk and work around them.

The newest models are significantly more reliable than their predecessors, and more importantly, they’re getting better at acknowledging their own limitations. For instance, in drafting this post, this writer asked Claude some questions about the academic research on the human factors of automation, specifically about the work of Lisanne Bainbridge and Raja Parasuraman. In both cases, Claude provided a high-level answer but noted that the information may not be perfect and encouraged checking the sources. This kind of transparency — AI systems actively flagging their own uncertainty — represents a significant step forward. Here’s another example of Claude articulating its limitations and uncertainty, in a way that simply did not happen with older models:

With older models and application versions, we’re fairly confident that Claude — or ChatGPT or any other tool — would have tried to give the chapter-by-chapter summary, whether or not it was accurate (which it likely would not have been). The model’s articulation of its own limitations is a subtle but big step forward, one that should go a long way toward helping people avoid becoming “confidently wrong” with inaccurate information provided by the models. And it’s not the only step being taken to reduce the risks of hallucinations. We’re also seeing more and more tools — including Deep Research, as we note earlier in this edition — incorporate direct citations when pulling information from the internet or from documentation the user provide, making verification increasingly straightforward.

That said, while these developments are encouraging, we need to remain as mindful of this risk as ever, especially as output chains grow longer. As we discussed above with Deep Research, we’re entering an era where generative AI systems will increasingly work more independently, analyzing large numbers of sources to create comprehensive reports (and, at some point in the not-too-distant future, do much more than creating reports). The longer these output chains become, the more we need to guard against two well-documented and related human tendencies: automation bias (our inclination to over-rely on automated systems and accept their outputs or recommendations without questioning them) and automation-induced complacency (the tendency for humans to be less vigilant and pay less attention when working with automated systems over longer periods of time).

The challenge is real but manageable. Hallucination risk matters most in situations where there’s “one right answer.” For many corporate communication use cases, the risk of hallucination is not particularly relevant. If we’re using an AI tool to review or critique something we’ve drafted, adapt language to be suitable to another audience or context, or brainstorm ideas for the structure of an introductory paragraph — to name just a few examples — there simply is no “right answer” for the AI to be wrong about.

For those cases where accuracy is crucial, the answer isn’t to avoid using these tools. It’s to implement appropriate quality assurance processes, just as we would with humans. This might include human review, peer validation, or fact-checking protocols. The verification effort should match the stakes. The specific approach will vary by organization and use case, but the principle is simple: verify.

Someday, the risk of model hallucination may disappear entirely. Or, some argue, hallucination may be so inherent to these models’ architectures that it will never go away. What we do know is that in the past two years, the labs have made significant progress on reducing hallucination rates, a trend which will likely continue. We also know that even with the risk of hallucinations, we can still get enormous value out of these tools. All it takes is awareness of the risk and appropriate safeguards where the stakes warrant them.

AI's Impact on Workplace Critical Thinking

Recent research finds AI boosting productivity but potentially eroding independent problem-solving skills.

A new study from Microsoft Research and Carnegie Mellon University examines how generative AI affects critical thinking in the workplace. The research, based on detailed surveys of 319 knowledge workers, confirms what many have suspected: while AI can dramatically boost productivity, it may also reduce independent problem-solving skills if not used thoughtfully.

The findings paint a clear picture of how confidence in skill shapes AI interactions. When workers trust AI’s abilities deeply, they often skip critical evaluation of its outputs and reduce verification efforts. One participant captured this mindset perfectly, noting they stopped carefully reviewing AI-generated content because “With straightforward factual information, ChatGPT usually gives good answers.” The study found a different pattern among workers with high confidence in their own abilities — these professionals engage more critically with AI outputs, frequently improving them by adding context and nuance based on their expertise.

The research spotlights fundamental shifts in work patterns. Instead of creating content from scratch, professionals now focus on verifying and customizing AI-generated content. Rather than solving problems independently, they refine AI-suggested solutions. The role has shifted from creator to curator and quality controller.

Matt Beane and other experts have written at length about “skill erosion” — the gradual loss of capabilities when we over-rely on generative AI. This study adds empirical weight to those warnings, revealing that people most often skip critical evaluation under time pressure, when viewing tasks as “routine,” or when lacking the expertise to properly verify AI outputs.

The researchers outline several practical approaches — some of which will sound familiar. Teams should maintain core professional skills even as AI makes certain tasks easier. They need to build verification into their workflow, even for seemingly simple tasks. Most importantly, professionals must focus on developing deeper expertise in their fields — the better we understand our craft, the better able we are to evaluate and customize AI outputs. Generative AI is an incredibly powerful tool, but its most effective use requires maintaining our own capabilities rather than becoming dependent on it. The goal remains using AI to augment our abilities, not replace them.

Brief Thoughts on OpenAI’s Operator

It’s more novelty than utility, but still merits attention.

We mentioned OpenAI’s Operator the other week, and while we don’t think it’s as important in the short term as Deep Research, it still merits real attention. If Deep Research represents a high-utility, narrow agent, then Operator represents a low-utility, broad agent. The idea behind Operator, which you can access with a ChatGPT Pro subscription, is that you can give it a task, say find a dinner reservation or identify an item for purchase that meets certain criteria, and it will go out and complete the task for you. Here’s a brief video of Operator looking for and finding a reservation at Parc, a well-known restaurant in Philadelphia:

Operator works in its own browser within the ChatGPT environment while giving you the option to take over at any time. And as you can see on the left, there will be moments where Operator asks for confirmation or additional information before proceeding.

All in all, it took Operator barely a minute to find a table within the given constraints, though it would have taken a bit more time to complete the reservation and, critically, it would have required that the user intervene in the process to provide personal information. While not all of our interactions with Operator have been as smooth or as quick as the example above, we’ve found that it can 1) generally understands what you want it to do and 2) has underlying capabilities to complete a wide range of tasks (even if it falters in specific instances).

Right now, we see Operator as more of a novelty than a utility, it is still easier and faster for us to book our own flights or dinner reservations, but it marks an important sign of progress. We now have a broad AI agent that can act on a user’s behalf to complete a diverse range of tasks online. Yes, it makes mistakes, can be slow, and still requires occasional human intervention, but this is merely the first iteration available to the public. If there’s anything we’ve learned from the past year and a half of writing Confluence, it’s that we shouldn’t view today’s limitations as persistent. Operator, or whatever it evolves into or whatever other AI labs call their general purpose agents, will get better and better over time. While we won’t be using Operator to do real work in its current form, we will continue to watch for updates and to test its boundaries.

We’ll leave you with something cool: Topaz Labs has built a new model designed to enhance old, low-quality videos with much higher resolution. The results are impressive.

AI Disclosure: We used generative AI in creating imagery for this post. We also used it selectively as a creator and summarizer of content and as an editor and proofreader.

Confluence: AI, Leadership, and Communication

Discussion about this post