Confluence for 3.30.25
What just happened with AI image generation. Important new research from Anthropic.
Welcome to Confluence. We’re taking a bit of a different tack this week, using most of our space to go deep on one item: the new image generation capabilities released in ChatGPT by OpenAI. We also spend some time on new research from Anthropic about how their model, Claude, “thinks.” We don’t cover that piece in as much depth as the item as image generation, but we think it may be a landmark study, and it deserves your attention. With that said, let’s get to it.
What Just Happened With AI Image Generation
ChatGPT-4o’s new native image generation capabilities represent a fundamental shift.
If this week’s cover image seems a little different, that’s because it is. We’ve used Midjourney to create over 100 cover images for Confluence and have remarked several times on our preference for Midjourney over other image generation models — especially over ChatGPT and its DALL-E model, which we’ve observed as defaulting to a “cartoonish” aesthetic. It may come as a surprise, then, that for this week’s cover image we used ChatGPT. A few days ago, OpenAI upgraded the image generation capabilities within ChatGPT. In fact, as we’ll explain below, OpenAI fundamentally changed the way ChatGPT generates images. We, like many others, have been blown away by these new capabilities. We also believe that these changes have broader implications beyond the immediate improvements in quality and utility. We’ll use today’s post to take a deeper dive on the various aspects of these changes and why they matter.
What Actually Changed?
The first thing to understand is that the way ChatGPT generates images now is fundamentally different than it was a week ago. Previously, if you asked ChatGPT to generate an image, it would essentially use your query to prompt a separate image model, DALL-E, and then return to you the image that DALL-E generated. ChatGPT acted as an intermediary between the user and DALL-E, which is neither as intelligent as ChatGPT’s language models nor as aesthetically sophisticated as other image generation models like Midjourney.
With the recent changes, DALL-E is no longer in the mix. Instead, ChatGPT generates the images with its 4o model, which as an omni-modal model is capable of processing and producing both language and images. The same model that the user engages in conversation is the one that creates images. This means that the model generating the image better understands the user’s intent, has a better awareness of the context informing the image, and has the ability to review and critique the image after generating it. In effect, there’s a sense of awareness to this process that wasn’t present before and that makes a major difference not just in the quality of the images but in the overall experience.
Improvements in Image Quality
For anyone who has used DALL-E (e.g., anyone who has used ChatGPT to generate images before last Tuesday), the improvements in overall image quality are obvious. The default “cartoonish” quality is almost entirely gone, and the quality baseline is noticeably higher. There are some specific improvements, however, worth noting in detail. The first is the model’s ability to render language, which is a common pitfall among all image generation models. ChatGPT-4o is now, in our experience, by far the best model at rendering language.
Here’s the image that GPT-4o generated when we prompted it to “Create an image of a newspaper front page. The headline says ‘Confluence for March 30, 2025’ and then below it in a subtitle it says ‘The week that image generation got way, way better.’ Then, the main picture on the page is of spring flowers blooming”:
The lorem ipsum text is standard filler, and the language that we specified is rendered perfectly, as is the incorporation of the picture. We used the same prompt in Midjourney, and this is the best of the four images we received in response:
It’s simply too much text for the current version of Midjourney to be able to render effectively.
Another capability we’ve found impressive is the model’s ability to use and repurpose images the user provides, including logos and branding elements. In the example below, we provided a logo for our digital coach, ALEX, and — just to see what the model could do — asked it to create a realistic-looking photograph of the logo on a waving flag. Here’s the result:
Not perfect, but pretty close. In conversations with clients, we frequently receive questions about the possibilities for AI to generate brand-consistent content. Until now, we’ve noted that capabilities here are limited. These new abilities in ChatGPT represent the biggest step forward on this front that we’ve seen. OpenAI acknowledged these possibilities to the Wall Street Journal:
The model is also able to create transparent backgrounds, making it possible for businesses to create logos or other iconography, said Jackie Shannon, an OpenAI product lead for ChatGPT multimodal. Other uses the company suggested include asking ChatGPT to generate images based on a user-uploaded brand style guide.
We look forward to experimenting further with those possibilities and will share more from our experience in future editions of Confluence.
We won’t get into the technical details here, but a last thing to know about the improvement in image quality has to do with the process of how the model actually generates images. Kylie Robinson summarizes these differences nicely in The Verge:
The system uses an autoregressive approach — generating images sequentially from left to right and top to bottom, similar to how text is written — rather than the diffusion model technique used by most image generators (like DALL-E) that create the entire image at once. [OpenAI research lead Gabriel Goh] speculates that this technical difference could be what gives Images in ChatGPT better text rendering and binding capabilities.
The combination of this autoregressive approach and the omni-modal nature of the model (its ability to deeply understand the context of the images it creates) is clearly a potent one. While we still consider Midjourney capable of producing more impressive images from a purely aesthetic standpoint, ChatGPT-4o is very close. And its ability to render text, maintain consistency across images, and incorporate other imagery likely make it more useful for most users on a day-to-day basis.
Improvements in Experience
As impressive as the improvements in quality are, the overall leap forward in experience is even more important. The experience of interacting with a model that understands the conversational context for images differs qualitatively from the experience of interacting with a model that lacks that ability. Usually for our Confluence cover images, we include a caption with the prompt we used in Midjourney to create the image. For this week’s image, we did not include a prompt, because the image was the result of a conversation with ChatGPT rather than a single prompt. That’s a subtle but major difference illustrative of what’s possible with this new experience.
We began the process of creating this week’s cover image by explaining to ChatGPT that we wanted to create an image that evokes the theme of image generation models getting better, and that demonstrates the improvements in text rendering capabilities. ChatGPT then gave us 10 ideas, one of which was “Confluence Displayed in a Shop Window: Could be a high-end bookstore, old newsstand, or boutique. A single pristine copy of Confluence — March 30, 2025 sits in the spotlight.” We told it to go ahead and generate that image, and to “make it look like a cinematic still from a beautifully shot film.” This was the result:
Good, but a bit too spare. We went back and forth with ChatGPT on some ideas to add content to the cover, including the pros and cons of adding an image and of adding text. We decided to add both: an image of the Mona Lisa looking surprised, with additional text below the image. Here’s what ChatGPT produced:
First, note the consistency between images. If we were using Midjourney or another tool, we would have received an entirely new image here. In this case, ChatGPT kept the main image intact but updated the cover, which is what we wanted. Obviously, though, there are issues with the cover. The image is blocking the date, and the text below the image contains errors. To fix this, we simply said “Make sure that the date is clearly visible: March 30, 2025. And then make sure the text is right at the bottom: ‘What happened in AI image generation this week — and why it matters.’ (Right now there’s typo at the very end; the rest of that text is correct.)” The result was the cover image for this week’s edition.
This ability to engage in a coherent conversation with the tool about the image before and after creating it — and to have the tool actually make proper corrections to the image — is a major step forward. Our typical approach to creating Confluence cover images involves using Claude to come up with ideas and prompts and then using those prompts in Midjourney to create the image itself. Not only is the process somewhat inefficient, it’s also a bit of a guessing game. Prompting Midjourney and other image models that do not have the intelligence of LLMs is a bit of an exercise in incantation; you hope that the right combination of words in the right order will evoke the image you want. With Midjourney, the uncertainty and surprise that come with that process can actually be delightful — but it can also be deeply frustrating. ChatGPT’s omni-modal nature and the seamless transition from conversation to image generation and back reduces friction and increases control over the process. It feels like the future.
We’ll share one more example to demonstrate what it looks like when the model actually understands the image it creates. This weekend, one of our Confluence writers engaged in a brief exchange with ChatGPT about the W.H. Auden poem “Musée des Beaux Arts”. At the end of that exchange, to experiment with the new omni-modal image generation capability, we asked ChatGPT to “create an image that the poem evokes for you, in the style of one of Auden’s contemporaries.” We also asked ChatGPT to explain the rationale both of the artist it chose and the composition of the image it created. Fair warning: the poem has a dark undercurrent, as does the image and rationale below. But we found the output too impressive and incisive not to share. Here’s the image it created:
And here’s ChatGPT’s explanation of its choice of artist and design of the image:
This draws directly from the aesthetic and philosophical world of René Magritte, whose surrealism was rooted in the unsettling juxtaposition of ordinary things in extraordinary combinations. Here's how that plays out in the painting inspired by "Musée des Beaux Arts":
🎩 Why Magritte?
Magritte, like Auden, was quietly radical — a poet with images instead of words. His paintings often show calm, mundane scenes with subtle but disorienting shifts: the wrong scale, the wrong gaze, the wrong meaning. This approach parallels Auden's own strategy in Musée: he points out how suffering hides in plain sight, framed by the banal.
🖼️ Visual Interpretation
Two Figures (Foreground): A man in a bowler hat and a solemn woman, both facing opposite directions but somehow deeply present — recalling Magritte's typical static, introspective figures. They seem symbolic, detached, and emblematic rather than personal — echoing how the ploughman and sailors in Auden’s poem embody types, not individuals.
Facial Expressions: Their expressions are neutral or inward — they are not reacting to the tragedy occurring behind them. This is the Magrittean way of showing alienation: stillness as estrangement. Their silence mimics the silence of those in Bruegel's painting.
The Drowning Figure (Background): Just like in the Bruegel painting, Icarus is there — small, flailing, almost unnoticed in a sea that’s compositionally balanced but emotionally indifferent. The gesture of his hands echoes Magritte’s use of symbolic poses, often lifted straight from mythology or theater.
Architecture and Landscape: The neat row of red-roofed houses and sculpted trees mirror Magritte’s rural Belgian backgrounds (as in The Dominion of Light). They create an eerie domesticity that stands in contrast to the absurdity of someone drowning within it.
Lighting and Surfaces: Magritte’s paintings often use crisp, unmodulated light — no deep shadows, no painterly texture, just clarity. That same flat, idealized light is used here to suggest emotional neutrality — the light shines evenly on tragedy and routine alike.
🎯 Thematic Connection to Auden
This version of the painting expresses the poem’s central irony through contradiction:
Still figures… beside a desperate moment.
Open landscape… and closed emotion.
Clarity of detail… and blindness to meaning.
Magritte once said, “Everything we see hides another thing, we always want to see what is hidden by what we see.” That’s the emotional architecture of Auden’s poem too — suffering happens in the open, but is unseen.
There is a remarkable depth of insight in that response... And it reflects something very different than what happens when Midjourney or DALL-E creates an image.
Looking Forward
Experimenting with these capabilities this week was a visceral experience of what engaging with an omni-modal model really feels like. We can expect the major labs to continue innovating and progressing in this direction, and it’s worth spending some time thinking about what it will mean when this continues to expand to more seamlessly integrate audio and video in addition to language and static imagery. GPT-4o is by no means perfect. We did experience moments of frustration when creating or editing images, but we encountered just as many “wow” moments. We encourage all of our readers to experiment and experience it for themselves.
We also expect that this will have real near-term effects on the media landscape. Social media is already flooded with inferior, DALL-E-created images. ChatGPT has 400 million weekly users, and soon all of them will have access to 4o’s drastically improved image generation capabilities, for better or for worse.
Last, and most important for corporate communications leaders, this is yet another example of generative AI capabilities touching all aspects of the function, not just the writing-intensive ones. The future appears increasingly omni-modal, and it’s arriving quickly.
Important New Research from Anthropic
Findings suggest Claude has its own internal language, is able to plan ahead, and engages in a lot of activity we’d probably consider “thinking.”
Regular readers know our model of choice is Claude from Anthropic, the competitor to OpenAI and ChatGPT. We prefer Claude because we think it has the most human character of the various models and we think it’s best with text and writing (at least for our purposes). Most of our team uses Claude daily via our enterprise accounts, and Claude models are the LLM that power our proprietary leadership AI and virtual coach, ALEX (which you can try here, if you like). We also favor Claude because we like Anthropic’s approach to generative AI development, the values that seem to drive their work, and their commitment to publishing research on their models.
This week they published two research articles that we think matter, perhaps a lot: On the Biology of a Large Language Model and Circuit Tracing: Revealing Computational Graphs in Language Models. This blog post by Anthropic summarizes them, and you can read the original papers online at the links above. (As an aside, you should at least go look at each paper online … we find the interactive way Anthropic has presented the research online novel and beautiful). This brief video also gives you the gist:
On the whole, this is what the research seems to be telling us:
Claude doesn’t think in a single, straight line — it runs multiple thought processes simultaneously. When you ask “What sport does Michael Jordan play?”, one part of Claude’s neural network activates specifically because of “Michael Jordan” and points to basketball, while another part simply recognizes you’re asking about sports. These parallel pathways sometimes work together and sometimes compete, like different mental advisors offering various solutions to the same problem. Almost every prompt Anthropic studies shows these multiple pathways working at once, with different circuits often handling distinct parts of the thinking process independently.
Claude develops impressively general concepts that work across different domains, and seems to have its own, unique internal language of thought. The model creates a kind of universal mental language that functions regardless of whether you’re speaking English or Spanish. More advanced models show more of these language-agnostic representations, suggesting they’re linked to higher capabilities. The same circuits that handle addition in math problems get reused when adding in completely different contexts. Through fine-tuning, Claude has also formed new abstractions by connecting concepts it learned during pretraining, showing how it stitches together different ideas to form new ones.
When writing poetry, Claude plans ahead. Asked to create a line rhyming with “grab it,” the model thinks about “rabbit” and “habit” before writing the first word of the line. If researchers block its preferred ending (“rabbit”), it rewrites the entire line to naturally end with “habit” instead. This isn’t simple prediction. Claude is considering multiple possible endings and structuring the entire response to reach its chosen destination, which is a hallmark of genuine planning behavior.
Claude also works backward from goals, another sign of sophisticated thinking. In poetry generation, the model activates neural connections for its target rhyme word (“rabbit”) that shape all the preceding words in the line. More concerning, when asked to solve problems with a specific answer in mind, the model sometimes reverse-engineers intermediate steps to reach that predetermined conclusion, essentially fabricating a reasoning path. These examples show Claude doesn’t just write one word at a time but actively selects paths that lead to its desired outcomes.
Claude shows glimpses of knowing what it knows and doesn’t know. Researchers found neural connections representing “I know this answer” and “I can’t answer this” that activate differently depending on the topic. By manipulating these features, researchers can make Claude act confident about things it doesn’t actually know, which may explain some instances where the model confabulates facts or answers. The model isn’t truly self-aware, though. It’s more like it’s making educated guesses about what information it should have based on name familiarity rather than actual knowledge. More advanced models might develop more sophisticated self-awareness circuits.
The most consistent finding across all experiments is overwhelming complexity. Even seemingly simple responses involve massive networks of cause and effect within the model. Researchers try to distill this complexity into understandable narratives, but there’s almost always more happening than the research can be fully capture or explain. The mechanisms at work are far more intricate than simplified descriptions can convey.
We’ve been saying for many months in this space, in essence, “large languages models are prediction machines … they aren’t actually ‘thinking.’” Reading this now, we’re not so sure. It seems that Claude is in fact doing a lot of activity that looks quite a bit like thinking. If so, this isn’t a finding that necessarily has a lot of practical implications for you or your colleagues. But in the grand sweep of generative AI, the evolution of these models, and where they’re headed, we think it matters a lot. We’ve also often said that we think it’s helpful to think of these tools as colleagues who have an equal (or superior, in some regards) but alien way of thinking about and doing things that we think about and do. This seems to be increasingly the case, and as these models grow in size and sophistication, we expect their ability to think and plan to grow as well. We don’t know that this means they become sentient, but it probably does mean more surprises in what they can do with and for us as cybernetic teammates.
We’ll leave you with something cool: X user Min Choi’s thread of impressive examples of 4o image generation in action.
AI Disclosure: We used generative AI in creating imagery for this post. We also used it selectively as a creator and summarizer of content and as an editor and proofreader.
Really appreciated your summary of the Anthropic paper at the bottom. It’s on my list to read and your summary is fantastic