Confluence for 7.27.25
AI models get gold at the International Math Olympiad. The AI bubble question. America's AI Action Plan. A Framework for Providing Context to LLMs. Claude can use Canva.

Welcome to Confluence. For this week’s image, we tested Google’s new Imagen 4 model, which reportedly features notable advances in text rendering among other improvements. We found that it’s still not perfect (it really did not want to add the Oxford comma to the tagline), but it certainly seems like a step forward. With that said, here’s what has our attention this week at the intersection of generative AI, leadership, and corporate communication:
AI Models Get Gold at the International Math Olympiad
The AI Bubble Question
America’s AI Action Plan
A Framework for Providing Context to LLMs
Claude Can Use Canva
AI Models Get Gold At International Math Olympiad
These things are getting smart.
Below is Claude’s contribution for this week’s edition. We present it with formatting but no editing. As always, we include the prompt in our footnotes. This week we also gave the same prompt to ChatGPT o3 as a point of comparison. Its version is also in the footnotes.1
This week marked a significant milestone in artificial intelligence as both OpenAI and Google DeepMind announced that their latest models achieved gold medal-level performance at the 2025 International Mathematical Olympiad (IMO). Both companies’ models solved five of the six competition problems, earning 35 out of 42 points — a score that places them squarely in gold medal territory. Around 10% of human contestants won gold-level medals at this year’s competition, which draws elite pre-university mathematicians from over 100 countries. What makes this achievement particularly noteworthy is that the models operated end-to-end in natural language, producing rigorous mathematical proofs directly from the official problem descriptions — all within the 4.5-hour competition time limit.
To understand the complexity of these problems, consider Problem 1 from this year’s competition:
A line in the plane is called sunny if it is not parallel to any of the x-axis, the y-axis, and the line x + y = 0. Let n ⩾ 3 be a given integer. Determine all nonnegative integers k such that there exist n distinct lines in the plane satisfying both of the following:
for all positive integers a and b with a + b ⩽ n + 1, the point (a, b) is on at least one of the lines, and
exactly k of the n lines are sunny.
These aren’t simple calculations — they require creative mathematical reasoning, complex proof construction, and the ability to navigate multiple approaches to find elegant solutions. The proofs generated by the AI models were noted for being “clear, precise and most of them easy to follow” by expert graders.
This represents a dramatic acceleration in AI mathematical capabilities. In 2024, AI labs were using grade school math (GSM8K) as an eval in their model releases. Since then, we’ve saturated the (high school) MATH benchmark, then AIME, and now are at IMO gold. Just last year, Google achieved only silver medal performance with models that required human translation of problems into formal mathematical languages and took days to compute solutions. This year’s achievement — using general-purpose models working directly with natural language in real competition time — demonstrates how rapidly these systems are advancing beyond specialized tools toward genuine mathematical reasoning.
The pattern here extends far beyond mathematics. Time and again, tasks that experts identify as representing the frontier of what AI cannot do become conquered territory within months or years. Chess fell decades ago, Go more recently, and now elite mathematical problem-solving joins the list. The results show that these models can perform better than those that have been hand-tuned for specific tasks, suggesting we’re witnessing the emergence of genuinely general-purpose reasoning systems. For organizations thinking about the trajectory of AI capabilities, the message is clear: the boundaries we imagine today are likely temporary. What seems uniquely human now may well be within reach of these systems sooner than we expect.
The AI Bubble Question
It matters, even if we don’t know the answer.
A recent research note from Torsten Slok, chief economist at Apollo Global Management, has been making the rounds in the past week or so for its ominous comparison of the rise of AI with the “dot-com bubble” burst of the late 1990s. In fact, Slok warns that the AI “bubble” is headed towards an even mightier collapse. Slok’s note is a single slide, below, with an accompanying comment: “The difference between the IT bubble in the 1990s and the AI bubble today is that the top 10 companies in the S&P 500 today are more overvalued than they were in the 1990s.”
We’re not economists, and we don’t pretend to be. But when someone as highly respected in their industry as Slok makes waves with something like this, we think it’s important to listen. Not because we feel qualified to dissect his data or argument, but because we believe in staying aware of conversations like this about where AI might be headed at the highest level.
While the comparison to the dot-com era makes for compelling headlines, several fundamental differences suggest the parallels may be weaker than they appear. The companies driving AI valuations aren’t speculative startups, for one thing. Microsoft, Google, Meta, and others generate hundreds of billions in annual revenue from established businesses. What’s more, the internet that sparked the late-90s craze began as a publicly-funded Defense Department project that took decades to reach commercial viability. The AI moment we’re living in has built on that mature digital infrastructure, with immediate applications already generating real value across industries. And, perhaps most tellingly, despite the spectacular dot-com crash, the internet didn’t disappear.
None of this means that we should ignore Slok’s warning, or that every AI bet will pay off. Still, understanding the contextual nuances is important to properly consider AI’s trajectory and how we’ll plan for the future. At minimum, Slok’s research reminds us to be intentional with AI investments while carefully distinguishing genuine capability advances from market enthusiasm. We should encourage conversation and narrative that balance both AI’s transformative potential and the financial unknowns. The possibility of an impending bubble burst extends beyond Wall Street water cooler talk to become essential context that shapes how we plan, invest, and communicate about AI’s role in our organizations’ futures.
America’s AI Action Plan
The US administration sets a path forward on AI.
We don’t talk too much about government AI regulation and policy in this space, but one of the big items from the past week was the US Administration’s AI Action Plan. Whatever regulation and oversight sovereign nations place on AI development within their borders and on their trade is going to have significant implications for the trajectory of AI in work and society, so we do keep an eye on it.
We won’t cover the action plan in detail, but you may see the full plan here and Zvi Mowshowitz has (as he always does) a beyond-thorough review of the plan and reactions to it here. The plan is not without a set of political premises separate from the technical, so depending on your personal views, some of its language may land with or frustrate you. We’ll stay out of that. Zvi’s gist: “America’s AI Action Plan is pretty good. No, seriously. If you look at the substance, it’s pretty good.” Note that Zvi falls closer to the “doomer” side of the AI Optimism Scale, meaning he is more inclined to believe that we must get AI right or there may be extreme and unknowable consequences. So we expect he’ll favor a plan that fosters smart development and risk management. He ends with:
Mostly, we are in a great position on substance: The plan is net helpful, and the main thing wrong with the substance of the plan is not what is in it, but what is missing from it. The issues that are not addressed, or where the actions seem to lack sufficient teeth. That doesn’t mean this puts us on a path to survive, but I was very worried this would be net destructive and instead it is net helpful.
I am less happy with the rhetoric, which is hostile and inflicts pain upon the reader throughout, and most importantly does not even deem many key concerns, including the most important concerns of all, even worthy of mention. That is worrisome, but it could have been far worse, and what matters most is the substance.
Given the way things have been otherwise going, I am very happy with the substance of this plan, which means I am overall very happy with the plan. I offer my thanks and congratulations to those involved in its creation, including Dean Ball. Great work.
We plan to read both his post and the plan in detail this coming week. If you’d like to listen to an AI-generated podcast overview of Zvi’s post and the Action Plan document, we’ve generated one using Google’s NotebookLM here.
A Framework for Providing Context to LLMs
Better context leads to better output.
One of the most basic but effective approaches to getting better output from LLMs is to provide them with the right context. The more the LLM knows about what you want and how you want it to engage, the better it can tailor its responses. Context is a broad term that can include a wide range of information, though, and for many users it can be difficult to decide what information is most helpful, how much context is too much, and so on. We’ve found that it’s not so much about the volume of context you provide as it is about providing the right context, and we’ve developed a simple framework for this that we hope our readers find helpful. It consists of three key categories of information: standards, examples, and situational context.
Standards are information you provide to the model about what excellence looks like for the task at hand. The model will have its own ideas of what that is from its training data, but this will be a conventional understanding based on, essentially, “the average of the internet.” But what does excellence look like to you? Make sure the LLM knows. In our firm, for example, we recently documented excellence standards for many of our core deliverables, which includes “must-haves” for each deliverable as well as strategic considerations to elevate quality. We wrote these standards for our colleagues rather than for LLMs, but this type of information, it turns out, is perfect context for steering LLMs as well. If you have similar documentation already, then you should absolutely use it in your prompting. If you don’t, then it’s worth spending a few minutes to write it down (or, better yet, dictate it) for the LLM.
Examples are an obvious but powerful way to get dramatically better output. Providing information about your standards tells the LLM what good looks like; examples show it. Have you or someone else in your organization ever done the task at hand? Do you have examples of the final product? If so, pick a handful of the best examples and provide them to the LLM. We’ve found that when it comes to examples, quality seems to matter more than quantity. You may be better off providing two of three great examples versus providing ten examples of varying quality. Either way, these models excel at picking up nuances in structure, tone, level of detail, and so on to mimic the examples you provide them.
Last, situational context is, not surprisingly, about providing the LLM with the important details about the context surrounding whatever it is you’re working on. This comes last in the framework for a reason. The first two are about providing the LLM with the foundation for the task. This one is about providing the details the LLM needs to execute it effectively. What are the details that make this situation unique? What really matters for this? What do we need to emphasize or avoid? Think of the same details you’d give to enlist a colleague to help you with this task, and give the same level of detail to the LLM. This could involve a combination of inputs, from uploading documents that contain the information (meeting notes, presentations, etc.) to creating and inputting the context “from scratch.”
Dictation can be particularly powerful for providing situational context, and it’s one of the most common uses we have for dictation. When we need to do a “context dump” to bring the LLM up to speed, we’ll often use the mobile app and dictate for several minutes, just as if we were talking to a colleague on the phone. Then, we’ll ask the LLM to transcribe that monologue into a cleaner format and use the resulting transcript as the context for the conversation. Another slightly more time-consuming approach is to ask the LLM to interview you. One prompt we’ve recently used for this and found effective is
Before you start, conduct a brief interview with me to gather the additional context you need to do this effectively. Start by creating the list of questions you plan to ask me so I can weigh in on those. Once I do, ask me these questions one at a time.
For efficiency, we dictate our answers. At the conclusion of that interview, we then ask the LLM to create a transcript of the conversation and then upload that transcript as context.
This framework can work in individual conversations as well as for the creation of custom GPTs in ChatGPT or Projects in Claude. If you provide the LLM with your standards for excellence, examples of what that looks like, and sufficient detail on the specific situation for which you’re enlisting its help, we’re confident that you’ll find the LLM producing significantly better output on its first try. Providing this context upfront takes a bit of time, but we’ve found that the return on investment of that time is almost always worth it.
Claude Can Use Canva
We’re closer to creating branded content using only natural language.
Anthropic continues to roll out more integrations for Claude, and one in particular stood out this week. Users can now create content in Canva using only natural language prompts in Claude. The integration allows users to describe what they want — infographics, presentations, and social graphics — and Claude creates it directly in Canva. Anthropic posted a video on X showing how this works.
One of the most common questions we get from clients is when Copilot, ChatGPT, or Claude will create content using their organization’s branding and templates. Now, when you connect your Claude and Canva accounts, you can. We cannot vouch for the quality of the output as we don’t use Canva in our daily work, but it confirms Anthropic and Canva understand the potential of bringing the power of LLMs to the production of visual content organizations use every day. Regardless of the quality of output today, we expect it to improve over time.
This reinforces another point. Leading AI labs and enterprise software companies recognize the ability of LLMs to act as a middle layer between the user and the final output. As the barriers to content creation continue to shrink, two things will become more important. First, understanding how to work with LLMs effectively so the output reflects your goals. We must hone our skills in prompting LLMs, providing appropriate context, direction, and feedback so the output suits our purposes. Second, we need to apply our own taste, our own judgment. The ability to use LLMs, in this case Claude combined with Canva, to create visual content quickly, doesn’t guarantee an excellent outcome.
The gap between being able to create and creating something worth sharing remains significant. Speed without judgment yields mediocrity at scale. For organizations looking to adopt these tools, the real challenge is developing the human capabilities to guide them well.
We’ll leave you with something cool: Google has discovered a new capability with its leading video generation model. Veo 3 can interpret annotations you place on the first frame to determine how the video should unfold.
AI Disclosure: We used generative AI in creating imagery for this post. We also used it selectively as a creator and summarizer of content and as an editor and proofreader.
Claude’s prompt:
Good day, Claude. We’re going to write a piece about the recent gold medals by multiple generative AI models in the international math olympiad.
Do web research to learn about what happened with that this week. Then plan for and write a four paragraph piece for Confluence. There's some controversy about when the results were announced and by whom, but we will ignore that. We want to make the point that people keep putting up tasks that are the frontier of what these models cannot do, and then they do them. Math won't be the only domain where this happens. You might include one of the math problems (both the problem and the solution) with some commentary about how long it takes a student to solve it to give a flavor of the level of math we're talking about here. You can probably find that on the web, too.
Don't be over the top on this. Play it straight. But the point is: these things are getting very, very capable.
… and here is ChatGPT o3 Pro’s article (also without editing):
Generative AI just crossed another line the humans drew in the sand. An advanced version of Google DeepMind’s Gemini, running in “Deep Think” mode, and a not‑yet‑named OpenAI reasoning model each solved five of the six problems on the 2025 International Mathematical Olympiad, earning the same 35‑point score that hands teenage prodigies a coveted gold medal. It’s the first time any system has done it within the contest’s own rules—no formal‑language hand‑holding, no multi‑day compute marathon, just 4½ hours per paper like everyone else.
If this feels like déjà vu, that’s because the goalposts keep scooting backwards. Last year Gemini’s forerunner needed two days and a team of Lean translators to scrape a silver; this year the model read natural‑language statements, produced line‑by‑line proofs, and finished inside the official time limit. The pattern is becoming familiar: pick a “grand challenge” task (Go, protein folding, theorem proving, now Olympiad maths), declare it a ceiling for large language models—and watch the ceiling turn into floorboards within a release or two.
To appreciate the level, consider Problem 1 from this year’s IMO. “A line in the plane is called sunny if it is not parallel to the x‑axis, the y‑axis or the line x + y = 0. Given n ≥ 3 lines, determine all k for which …”—the short version is: classify how many of the n lines can be sunny while still hitting every lattice point under a certain bound. The official solution shows the only possible values are k = 0, 1 or 3, proved via a clever incidence count and an inductive construction. Human contestants have 270 minutes per session, so even a future medalist budgets 60–90 minutes for a proof of this calibre; the AI cleared it—as part of the whole six‑problem set—inside the same window.
Mathematics won’t be the last domino. The same reinforcement‑learning tricks, search over scratch‑pad reasoning and heavyweight “parallel thinking” that cracked Olympiad geometry and number theory are already being tuned for drug design, code verification and policy drafting. The prudent takeaway is not panic but posture: every time we label a task “inherently human,” we issue an open challenge to the next generation of models. So far, the models are collecting trophies—and they’re not fussy about the domain.
Genial....