Confluence for 4.6.25

Google and Meta release new models. New data on how people are using Claude. AI’s future: An important unknown. AI passes the Turing Test.

Apr 06, 2025

Midjourney prompt: *A photo of multiple artificial intelligences emerging off the coast of the Aran Islands, 1800s*

Welcome to Confluence. You’ll notice that we’ve gone back to Midjourney for this week’s image. Earlier this week, Midjourney released its latest model, V7. It’s more an incremental step than it is a revelation, but if you use Midjourney, it’s worth checking out. With that said, here’s what has our attention this week at the intersection of generative AI and corporate communication:

Google and Meta Release New Models
New Data on How People Are Using Claude
AI’s Future: An Important Unknown
AI Passes the Turing Test

Google and Meta Release New Models

Gemini 2.5 Pro is very, very good, while Llama 4 brings open-weight models closer to the frontier.

While they don’t always attract the same level of attention as releases from Anthropic and OpenAI, both Google and Meta continue to train and release new models. We’ll start with Google, who released its latest model, Gemini 2.5 Pro, last week while most of us explored GPT-4o’s image generation capabilities. After testing it this week, we think it’s fair to say it is very, very good. Google has often lagged a half-step behind Anthropic and OpenAI, with Gemini proving capable but rarely pushing the frontier forward.

Gemini 2.5 Pro feels different. It’s a reasoning model, similar to GPT-o1 and Deep Seek r1, meaning it “thinks” before it responds and can handle larger, more complex tasks. To give a sense of how Gemini 2.5 Pro compares to its competition, the model now sits at or near the top of nearly every LLM benchmark and there are plenty of anecdotal accounts on X of Gemini 2.5 Pro solving problems previously unsolvable by LLMs. One of us is currently working with a non-profit to design a strategic planning session and wanted to see how Gemini 2.5 Pro would design a similar session with little context. The prompt we use is below, as is a link to download a PDF with the output. We believe the quality speaks for itself. When we tested the same prompt with other leading models (including 3.7 Sonnet with Extending Thinking, GPT-4.5, and o1 Pro), only o1 Pro provided comparable output.

Design a strategic planning offsite for a non-profit focused on research, family services, and advocacy for a specific ultra-rare disease. It will begin Friday evening and end Sunday after lunch. The organization needs a better way of prioritizing its resources (both in terms of human attention and funding). Participants include the executive committee of the board and senior leadership for the foundation. Include any recommended pre-work or prep.

Non Profit Strategic Planning Gemini 2.5 Pro

122KB ∙ PDF file

Download

Unlike o1 Pro, you can try Gemini 2.5 Pro for free in Google’s AI Studio. Or you can pay $20 per month to get access through the Gemini app and web interface. One of the key differences between Google and the strictly AI-focused labs such as Anthropic and OpenAI is they should have an easier time weaving the capabilities of their models into tools we use each day. This has yet to truly materialize, but if Google continues to push the frontier, we are eager to see how they bring the capabilities of Gemini 2.5 into applications and services we use every day.

Though we’ve had a few days to experiment with Gemini 2.5 Pro, Meta’s latest models, three different iterations of Llama 4, dropped just yesterday. We don’t have much to add about their capabilities beyond what’s in Meta’s own write-up, but at a minimum it brings Meta’s models closer to parity with those at the frontier. This matters less because we think Llama 4 will unlock new capabilities, but more because it means we now have access to better open-weight models than ever before. For organizations with significant security concerns, there’s real value in being able to download and run a leading LLM in your own environment. Between Llama 4 and OpenAI’s recently announced plans to release an open model, there’s still plenty of intelligence to be gained in open models even if they are a half-step behind the true frontier.

What does this all mean? First, and this is a common refrain we’ve repeated for years at this point. the models continue to get smarter, more capable. The leading AI labs continue to release new models, quickly and sometimes with little warning, that push the frontier of capabilities. And we have no reason to believe this will slow down. Second, althought there are differences in the models, and the more you work with them the more likely you are to develop preferences, all the leading models are extremely capable. We’ve worked with clients who use a range of tools and models, and there are ways of getting utility out of each. Every serious lab’s leading models offer value.

It’s worth keeping up with and experimenting with models that push the frontier, and we’ll do our best to flag them for you. But don’t let the number of models, their confusing naming conventions, or anything else slow you down from experimenting. At this point, the newest models from the leading labs (Anthropic, Google, OpenAI, and, to a lesser extent, Meta) can help you in countless ways.

New Data on How People Are Using Claude

Users continue to favor augmentation over automation — especially for writing-intensive tasks.

In February, Anthropic published the initial Anthropic Economic Index with the aim of providing “the clearest picture yet of how AI is being incorporated into real-world tasks across the modern economy.” The initial report — which we read but did not cover in Confluence — broke down how people use Claude, based on data from anonymized conversations. Last week, Anthropic followed up with a second Economic Index report, this time focused specifically on how people are using Claude 3.7 Sonnet. The report examines how patterns of use have changed since the initial February report, how people are using Claude’s extended thinking mode, and differences in augmentation versus automation across tasks. While each of these topics is interesting, we find the third — “How does augmentation vs. automation vary by task and occupation?” — most relevant to our focus here in Confluence.

As we’ve integrated generative AI into our work as a firm and advised our clients on their own approaches, we’re constantly navigating the balance of automation and augmentation. In the initial Economic Index Report, Anthropic defined automation uses as “where AI directly performs tasks such as formatting a document” and augmentation uses as “where AI collaborates with a user to perform a task.” This categorization scheme echoes the “centaur” and “cyborg” approaches to using generative AI identified in the “Navigating the Jagged Technological Frontier” study we often reference. The authors of that paper define centaurs as generative AI users who, “like the mythical half-horse / half-human creature divide and delegate their solution-creation activities to the AI or to themselves.” Cyborgs, on the other hand, “completely integrate their task flow with the AI and continually interact with the technology.” Centaurs delegate and automate. Cyborgs integrate and augment. We’ve long favored the cyborg / augmentative approach for most of our work.

Anthropic findings suggest that 57% of tasks use Claude in a way that appears integrated, with the remaining 43% of tasks taking more of an outsourcing approach:

…in just over half of cases, AI was not being used to replace people doing tasks, but instead worked with them, engaging in tasks like validation (e.g., double-checking the user’s work), learning (e.g., helping the user acquire new knowledge and skills), and task iteration (e.g., helping the user brainstorm or otherwise doing repeated, generative tasks).

More interesting than the overall augmentation vs. automation data is how the data breaks down by occupational categories. Of all the occupations represented in the data, tasks associated with copywriting and editing “show the highest amount of task iteration, where the user iterates on various writing tasks with the model.” These two writing-intensive tasks suit an iterative approach with generative AI, which mirrors our experience and client counsel. It’s useful to see this reflected in the real-world data, and we look forward to keeping up with the Anthropic Economic Index as the data continues to evolve.

AI’s Future: An Important Unknown

The answer isn’t clear, but the conversation matters.

If there’s one thing everyone seems to agree on about generative AI, it’s that we’re in uncharted territory. What we build, how it’s used, and where it takes us — these are big, open questions. But even in the face of that uncertainty, one thing feels increasingly clear: it’s time to start treating these conversations with the seriousness they deserve.

Lately, three very different pieces have caught our attention, each underscoring the value of engaging with the future of AI more seriously and collectively. A New Yorker essay by Joshua Rothman, a new technical paper from DeepMind, and a speculative near-future narrative called AI 2027 all make the case, in their own ways, for deeper reflection and broader participation.

Rothman’s article, titled “Are We Taking AI Seriously Enough?” delivers a pointed call to action. He argues that shaping the future of AI shouldn’t be left to the small circle of insiders who currently dominate the conversation. Many of those voices are deeply embedded in the technical minutiae — and while their expertise is valuable, it’s far from the whole picture. We need more diversity of thought: people with different priorities, values, and lived experiences, asking different questions. “We need to have more arguments,” Rothman writes in the closing lines. “We need to get imaginative and serious, and dream up many futures— not just one, or two, or none.”

The DeepMind paper, “An Approach to Technical AGI Safety and Security,” takes a different tack, laying out a framework for identifying and mitigating risks associated with more advanced AI systems. It focuses on two areas — misuse and misalignment — and proposes both model-level and system-level strategies for managing them. The paper doesn’t claim to have all the answers, but it offers a thoughtful structure for how we might approach these challenges with rigor, before they become urgent. It reminds us that safety isn’t just about hard stops —it’s about building habits of foresight. If the New Yorker piece explains why we should be having this conversation, DeepMind begins to suggest how we might do so.

And then there’s AI 2027, a collaborative science fiction scenario written by a group of experienced AI thinkers and forecasters. This piece, unusual in format, might show us the “what” — what does a thoughtful, well-researched, yet imaginative forecast look like? While AI 2027 doesn’t aim to predict the future with certainty, it offers a rich and carefully researched imagining of how the next few years could unfold. What makes it valuable isn’t whether the timeline turns out to be accurate — that would be an incredible feat of forecasting — but rather, the exercise itself. This is the kind of grounded, creative thinking we need to engage with — informed by expertise, open to uncertainty, and willing to explore a range of plausible outcomes.

Each of these contributions comes from a different vantage point, but together they reinforce a simple idea: engaging with the future of AI isn’t just for experts, it’s for all of us. Whether through technical research, public dialogue, policy, or imagination, the time to start thinking ahead is now.

AI Passes the Turing Test

A new study shows generative AI convincingly passing as human in controlled experiments.

Last week, UC San Diego researchers published a study documenting generative AI systems definitively passing the Turing Test for the first time.1 In controlled experiments with two independent populations, GPT-4.5 (when prompted to adopt a human-like persona) passed as human 73% of the time — remarkably, more often than actual humans themselves. LLaMA-3.1-405B achieved a 56% human-judgment rate with similar prompting.

This milestone represents the first empirical evidence of artificial systems passing the standard three-party Turing test, where interrogators converse simultaneously with both a human and AI before determining which is which. The crossing of this symbolic threshold isn’t just academically interesting — it signals the acceleration of a trend we’ve watched unfold since generative AI arrived on the scene: the increasingly blurred line between human and AI communication.

The idea of the “Turing Trap” we explored last October remains relevant — that is, that we can limit ourselves by focusing solely on AI’s ability to imitate humans rather than create new value. Yet this milestone still deserves recognition as a pivotal moment in the evolution of these technologies, even as we maintain that the most powerful applications will come from collaboration rather than imitation.

The question is no longer “Can AI pass as human?” (and hasn't been for some time, in our opinion) but the more consequential question: “How can we harness these capabilities to achieve what neither humans nor machines could accomplish alone?”

We’ll leave you with something cool: 16 more ways people are using ChatGPT-4o for image generation (which, by the way, is now available on ChatGPT’s free tier).

AI Disclosure: We used generative AI in creating imagery for this post. We also used it selectively as a creator and summarizer of content and as an editor and proofreader.

Here is Claude 3.7 Sonnet’s summary of the Turing Test.
The Turing Test is a measure of a machine’s ability to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a human. It was proposed by mathematician and computer scientist Alan Turing in 1950 in his paper "Computing Machinery and Intelligence," where he explored the question "Can machines think?"

In the original version of the test, a human evaluator would judge natural language conversations between a human and a machine designed to generate human-like responses. The evaluator would be aware that one of the two participants is a machine, but would not know which one. If the evaluator cannot reliably tell the machine from the human, the machine is said to have passed the test.

The three-party version of the Turing Test, sometimes called the "standard interpretation," involves:

The interrogator (human judge)
A human participant
A machine participant

In this setup, the interrogator attempts to determine which is the human and which is the machine by asking questions of both. The participants are isolated from each other, typically communicating through text interfaces to remove physical cues. The machine passes the test if the judge cannot consistently identify which participant is the machine better than by random chance.

This three-party format creates a more rigorous test as the machine must not only appear human-like in its responses but must also be indistinguishable from an actual human participant who is actively trying to demonstrate their humanity.

Confluence: AI, Leadership, and Communication

Discussion about this post