Confluence for 2.15.26
How to better steer your AI. The Chief Question Officer. Evaluating talent with AI. The generative AI workload trap.
Welcome to Confluence. Here’s what has our attention this week at the intersection of generative AI, leadership, and corporate communication:
How to Better Steer Your AI
The Chief Question Officer
Evaluating Talent with AI
The Generative AI Workload Trap
How to Better Steer Your AI
Something everyone can learn from claude.md files.
Two years ago, we focused much of the conversation about how to use large language models (LLMs) to the best effect on prompt design. What you asked the model, and how you asked it, had a large influence on performance, and “prompt engineering” mattered so much people felt it would become a new professional field.
Turns out that’s not what happened. As the models became smarter, they also became better at discerning intent and using their own reasoning threads to create output that was very good even with simple prompts. Indeed, the wellspring of effectiveness seems to have shifted from prompts to context. With LLMs, every new conversation is a blank slate. They carry over no real knowledge of prior conversations, and know nothing about you, your job, or your request other than what you’ve provided in the chat window. While they are smart enough to work out how to give you what you want most of the time regardless of how you ask, they have no real context. And context is critical to doing great work in any endeavor. Because of this, the focus has shifted from prompt engineering — deciding what to ask the model — to context engineering — deciding what additional context to provide the model so that it can do great work. While prompts direct a model, context steers it in all kinds of helpful ways.
In the world of software engineering, developers using tools like OpenAI’s Codex and Anthropic’s Claude Code have developed all manner of means to provide the model with additional context, including full access to code bases, file structures, data sources, and more. The more context the model has, the better it is at working on the task. Most users, though, are using web-based chatbots like Gemini, ChatGPT, and Claude.ai, and don’t appreciate the importance of context in shaping model behavior. There are ways to provide context to better steer these tools, too, though, and we can take a page from the developer handbook to do so: the claude.md file.
For developers using Claude Code, a claude.md file is a simple text file that Claude usually creates on its own when it starts to work on a body of code. Claude reads the code, and then writes a memo to itself that describes what that code is, what the user is trying to accomplish with it, and how Claude can be more helpful in the future. Then, whenever the user starts a new session in that code, the first thing Claude does (before any chatting takes place) is read that claude.md file and pick up whatever context it provides. The file is like a memo to future Claudes, telling them “here’s what you need to know before you even get started in this work.” Claude Code does this for any project a user is working on, and also at the “user” level, writing a claude.md file that covers all interaction with that user, reading both on startup. This gives the model context for its work with the person as a whole, and for whatever project they are working on in particular. This context makes the model far more effective, and by adding key things to these files over time, users create a very context-aware version of Claude, the adage being: “If there is anything you want Claude to do, not do, know, pay attention to, or ignore in the future, add it to the claude.md file.”
The good news is that you can get much of this benefit in your own uses of Gemini, ChatGPT, and Claude.ai.
Every major AI service now offers two levels of persistent context that mirror the developer’s claude.md approach: account-level preferences that apply to every conversation, and project-level instructions that apply to specific areas of work.
In Claude.ai, these are called User Preferences (found in Settings) and Project Custom Instructions. ChatGPT offers Custom Instructions at the account level and project or GPT-specific instructions for focused work. Gemini provides similar personalization settings and Gems with custom instructions for specific tasks. The names differ, but the architecture is the same: one layer for “who I am and how I work,” and another for “what this specific body of work is about.”
The account level is where you tell the model about yourself. Your role, your industry, your communication style, and anything about model behavior that you want it to know in every conversation you have. If you’re a CFO at a healthcare company, say so. If you prefer concise answers over exhaustive ones, say so. If you hate bullet points, or love them, say so. This is your standing brief to every new AI conversation: before we even start talking, here’s what you should know about me. Developers call this “user-level context,” and it’s remarkably effective at shifting a model from generic to genuinely useful.
The project level is where you tell the model about the work. If you’re preparing a board presentation, a project instruction might describe the audience, the company’s strategic priorities, the tone you need, and key data sources. If you’re using AI to help with a recurring deliverable like a weekly report, the project instruction might include the format, the distribution list’s expectations, and examples of reports that hit the mark. This is the equivalent of a developer’s project-specific claude.md: everything the model needs to know to do great work in this particular context. You can use the AI to help you craft these, and here’s a sample prompt to develop project-level context. Note that with ChatGPT, Gemini, and Claude you can upload reference files to projects, and we suggest doing this first so the AI can review those files in crafting the instructions:
I'm working on [describe the project or recurring task] and have uploaded files about it that you should read. My role is [your title/function], and the audience for this work is [who will see or use it]. The output I need is typically [format: memo, presentation, analysis, email, brainstorming, collaboration, etc.], and the tone should be [formal, conversational, direct, etc.]. Here are a few things that matter for this work: [list 2-3 specifics, like "regulatory language must be precise" or "keep it under one page" or "always consider budget implications"]. Based on this, write me a concise set of project instructions I can save and reuse. Keep it under 500 words. Focus on what a smart assistant would need to know to do great work on this project without being told twice.What should you actually put in these? Start simple. A few sentences about who you are and what you do. A line about the kind of output you find most useful. Then build the habit that developers have learned: whenever you find yourself correcting the model, or wishing it already knew something, add it to your instructions. “Don’t use jargon.” “Always consider regulatory implications.” “I present to the board quarterly; keep recommendations at that altitude.” These small additions compound. Within a few weeks, you’ll have a set of instructions that makes every new conversation meaningfully better than starting from a blank slate.
The key insight from the developer world is that the best claude.md files are living documents. They evolve as the user and the model learn what works together. The same should be true for your preferences and project instructions. Treat them as working notes. Update them when you notice something about how the model responds to you, what it gets right without being told, and where it keeps missing. The gap between what the model produces and what you actually need gets smaller every time you do.
Context, it turns out, is the real unlock. The models are plenty smart. The question is whether you’ve given them enough to work with.
The Chief Question Officer
A way to think about applying experience and expertise.
As generative AI takes on more of the execution in day-to-day work, we’re increasingly hearing a version of the same question from clients: How do I make sure I’m adding the most value? People want to know where their time and expertise matter most.
Erik Brynjolfsson, director of Stanford’s Digital Economy Lab, offered a useful framework at the Wall Street Journal’s Tech Council Summit. The interview is worth the full 21 minutes, but one idea in particular stood out. He calls it the “Chief Question Officer.” It doesn’t need to be a formal role, but the way he describes it is instructive for any team weaving generative AI into how they work. Brynjolfsson points out that in almost every project, there are three things we need to do:
You have to define the question. Figure out what you’re actually working on and why. If you ask a model to build a marketing plan, it will build you one. It won’t stop to consider whether a marketing plan is the right call in the first place.
You have to do the work. This is where today’s models help the most. They can draft the timelines for that plan, create the copy, mock up the collateral, and handle much of the labor involved.
You have to evaluate the output. Does the plan actually advance the goals you set at the beginning? This is where taste (understanding what good looks like) and judgment (the ability to draw on expertise and experience to decide whether this will work and what needs to change) come in.
AI is increasingly taking over the middle phase. That makes the bookends more important than they’ve ever been. How do you point the technology toward the right ends, the right goals? How do you evaluate whether what comes back will actually help you reach them? The constraint used to be the work itself. Now that it’s not, the questions you ask at the front of a project and the judgment you apply at the end carry more weight than ever.
Evaluating Talent with AI
As Claude gets smarter, Anthropic’s test for applicants gets weirder.
Tristan Hume, lead on Anthropic’s performance optimization team, recently published a piece about the take-home test he uses to evaluate and hire performance engineers. Hume has had to redesign the test as each new Claude model has outperformed the last, and find increasingly creative ways to measure the skills that matter.
Hume originally designed Anthropic’s take-home test in 2024, when they were growing rapidly and needed to evaluate more candidates faster. The test was designed to evaluate candidates’ creative problem-solving over a realistic timeframe and with assistance from Claude. Hume wanted to test candidates’ raw technical skills, but also how well they could partner with Claude to surpass what either could do alone.
The test proved effective—the candidates who excelled on it continued to impress after they were hired. Great. But then things got tricky:
By May 2025, Claude 3.7 Sonnet had already crept up to the point where over 50% of candidates would have been better off delegating to Claude Code entirely. I then tested a pre-release version of Claude Opus 4 on the take-home. It came up with a more optimized solution than almost all humans did within the 4-hour limit.
So Hume created a new test that Opus 4 couldn’t beat. Then, Opus 4.5 changed the game again: “it could both beat humans in 2 hours and continue climbing with time.” Hume realized he needed to go “weirder,” to find a problem that would require a level of ingenuity and reasoning Claude couldn’t match. The latest version looks radically different from early ones, presenting candidates with small puzzles that test judgment and creativity but bear little resemblance to the actual job. “Realism,” Hume admits, “may be a luxury we no longer have.”
No word yet on whether Hume will have to update the test yet again for Opus 4.6, but we suspect he will. Still, it’s worth noting that Anthropic’s best candidates not only performed well on the test, but also seemed to enjoy it. Many of them kept working on it long after the official time limit had expired, and they outperformed Claude more and more the longer they went on. That’s partly because LLMs still struggle with longer tasks (though that gap is quickly closing), but also because humans do well when we have a lot of time to play, ruminate, and think outside the assumed bounds of the task at hand.
All in all, Anthropic’s approach to talent evaluation has an ongoing rigor that we expect will become a requirement as the rest of us catch up to the software engineering industry. We’ll all need to find new and increasingly creative methods to both define and evaluate the human skills that matter most for our work — even if those methods have to keep evolving, or get a little weird.
The Generative AI Workload Trap
When AI makes everything feel doable, some people do everything
We wrote last week about research demonstrating that how leaders talk about AI shapes whether employees support or resist it. It seems to also be true, though, that whether leaders talk about AI at all might matter just as much. In an eight-month field study at a U.S.-based technology company, researchers from UC Berkeley found that “AI Doesn’t Reduce Work – It Intensifies It.” Workers moved faster, took on tasks outside their usual scope, and let work bleed into what used to be breaks and off-hours, often without being asked. The company didn’t mandate AI use; it simply offered enterprise subscriptions and let people experiment. Creating this space for experimentation allowed individuals across the organization to attempt work they previously would have deferred or delegated. That sounds like a good thing, right?
Too much of a good thing, perhaps, as what felt initially like empowerment quietly became overload. As one engineer put it: “You had thought that maybe, oh, because you could be more productive with AI, then you save some time, you can work less. But then really, you don’t work less. You just work the same amount or even more.” When AI makes everything feel a little bit more doable, we must carefully consider whether it’s something we really should do.
The researchers call for organizations to develop an “AI practice,” or intentional norms around when AI use is appropriate, when to stop, and how work should and shouldn’t expand in response to new capability. Their tactical recommendations are sound. Teams should build in intentional pauses before major decisions, sequence work in phases rather than reacting to every piece of AI-produced content in real time, and protect time for human connection so that work doesn’t become entirely solo and AI-mediated. These are worth adopting. But we think that workload creep can best be acknowledged and managed through simple conversation.
If AI is quietly reshaping how much people work, how many tasks they take on, and when they stop, then leaders need to create space for those dynamics to surface. That means normalizing conversations about AI at every level of the organization. More than policy enforcement or adoption cheerleading, it requires an ongoing, honest dialogue about how these tools are actually changing the experience of work. When AI makes starting a task frictionless, the old signals that regulated workload (think: waiting for a colleague’s input, deferring to someone with more expertise, or, simply, not knowing where to begin) disappear. The new pace feels productive, even exciting, until it doesn’t. Leaders who make AI a regular topic of conversation will give their people permission to say so before burnout is on the horizon.
We’ll leave you with something cool: ByteDance’s new Seedance 2.0 video generation model is wowing people with its capabilities – but not everyone is happy about it.
AI Disclosure: We used generative AI in creating imagery for this post. We also used it selectively as a creator and summarizer of content and as an editor and proofreader.
