Confluence for 10.22.23

The Best Available Human standard. Capitalizing on generative AI. Multilingual robocalls. An introduction to AI terms. The AGI debate. PSA: Large language models still hallucinate.

Oct 22, 2023

Bing prompt: *An image of your interpretation of a hallucination.* (Note the new Bing watermark in the lower-left corner.)

Welcome back to Confluence. We have a particularly rich issue this time around, but before we get into it we’d like to direct you to the most recent revision of the Eloundou, Manning, Mishkin, and Rock paper on the labor market impact potential of large language models (LLMs). We’ve posted about this paper more than once and consider it a must-read for anyone wanting to understand some of what’s likely coming as generative AI diffuses into the workforce. With that said, here’s what has our attention at the intersection of AI and corporate communication:

The Best Available Human Standard
Capitalizing on Generative AI
Multilingual Robocalls
An Introduction to AI Terms
The Artificial General Intelligence Debate
PSA: Large Language Models Still Hallucinate

The Best Available Human Standard

Ethan Mollick suggests an interesting rule of thumb for generative AI use.

We try to have no more than five items in each Confluence, but today we have six, as this morning Ethan Mollick published a post on his Substack that we think is worth reading in full. We reference Ethan a lot. If you’re not reading One Useful Thing, you should be. In this most recent post he reminds of us three things about AI: that AI is ubiquitous, that AI is extremely capable in ways that are not immediately clear to users, and that AI is limited and risky in ways that are also not immediately clear to users. As a result, he suggests a new standard for its use. In his words:

So, we have a tool that is capable of great benefit, but also of considerable harm, that is available to billions. The creators of these technologies are not going to be able to tell us how to maximize the gain while avoiding the risk, because they don’t know the answers themselves. Making it all more complicated, we don’t actually know how good AI is at various practical tasks, especially compared to real human performance. After all, AI makes mistakes all the time, but so do people.
Given this confusion, I would like to propose a pragmatic way to consider when AI might be helpful, called Best Available Human (BAH) standard. The standard asks the following question: would the best available AI in a particular moment, in a particular place, do a better job solving a problem than the best available human that is actually able to help in a particular situation? I suspect there are many use cases where BAH is clarifying, for better and worse. I want to start with two examples that I feel qualified to offer, and then some speculation (and a call to action!) for others.

We find this standard fits well with how we’ve been implementing generative AI in our own firm, as one of our rules of thumb has been, “If it’s something you can’t do yourself, and nobody else is available, and the AI can do it at least as well as a colleague, use generative AI for that.” There are some caveats for this rule of thumb related to skill development and skill erosion, and we will discuss those in a future issue, but so far this general rule has served us well. In practical terms for communication professionals and leaders, think of all the instances in which the Best Available Human standard could serve you: proofreading text, generating ideas, giving you briefings on topics you don’t know, tutoring you in areas you want to develop, writing talking points from the slides of a PowerPoint … there are many, many possible applications where it could do at least as well as the best person at your immediate disposal.

The question to ask is, “Could GPT-4 (or Bing, or Claude, or Midjourney) do a better job of helping me with this than the best person available to me?” If so, it makes sense to use the generative AI. Just be sure to keep a human in the loop for quality assurance, and be mindful of what generative AI does poorly and not just well (and as we write about in more detail in the final item of this issue below).

Capitalizing on Generative AI

Dan Rock and his co-authors offer new insights and a practical approach for companies to get started with generative AI.

A recent Harvard Business Review article by our friend Dan Rock of Wharton and co-authors Andrew McAfee and Erik Brynjolfsson sheds light on how generative AI is beginning to play out in organizations today and, for those organizations still on the sidelines, offers some practical guidance for where to start. The article includes a compelling case study, straightforward guidance for companies on how to assess the potential impact of generative AI on jobs, and guidance for dealing with current challenges and limitations like “confabulation,” bias, invasion of privacy, and intellectual-property problems. It’s an insightful, nuanced piece worth reading in its entirety.

Every section of the article has implications for leadership and corporate communication. Perhaps the most important, in our view, is the case study. While the case study centers on generative AI adoption by customer service agents at an enterprise software company, its conclusions are likely to be applicable for corporate communications (among many other fields). The first is about upskilling and is aligned with research we’ve covered in a previous edition (emphasis ours):

It’s especially interesting that the least-skilled agents, who were also often the newest, benefited most. For example, resolutions per hour by agents who had been among the slowest 20% before introduction of the new system increased by 35%. (The resolution rate of the fastest 20% didn’t change.) The generative AI system was a fast-acting upskilling technology. It made available to all agents knowledge that had previously come only with experience or training.

The second finding is about retention (again, the emphasis below is ours):

What’s more, agent turnover fell, especially among those with less than six months of experience—perhaps because people are more likely to stick around when they have powerful tools to help them do their jobs better.

As for the authors’ guidance on where organizations should start? “Find a project with an attractive benefit-to-cost ratio and low risks and start trying things.”

Multilingual Robocalls

New York Mayor Eric Adams leans in to voice translation — for better or for worse.

In a recent edition of Confluence, we noted that “advances in AI audio and video [are] pointing to a future where the audio of leaders’ messages can be translated into multiple languages, in their own voice.” Well, that future is here, at least in New York City. In the same week as New York released its Artificial Intelligence Action Plan, Mayor Eric Adams came under scrutiny for making robocalls to residents in a number of non-English languages, with critics accusing him of misleading residents by suggesting he speaks languages that he does not in fact speak.

The criticism is bolstered by the fact that the (very convincing) calls did not disclose that the voice was AI-generated. As we’ve discussed before, the questions surrounding disclosure of AI use are not going away any time soon. We continue to advise leaders and organizations to err on the side of over-disclosure, at least until new norms are established.

It’s impossible to predict how this will play out in politics in government, though that will be fascinating to watch. In the business world, though, we expect voice translation will become much more common — and will be met with not nearly the pushback as in the political arena. The value of employees’ hearing leaders in real time, in employees’ own native language(s), is simply too high. With the requisite disclosures and safeguards in place, we believe the benefits will outweigh the costs.

An Introduction to AI Terms

The Wall Street Journal has a nice primer on AI terminology for the uninitiated.

In previous editions of Confluence we’ve shared several valuable primers for understanding generative AI, including Cal Newport’s New Yorker article “What Kind of Mind Does ChatGPT Have?” and Madhumita Murgia’s “Generative AI exists because of the transformer” visualization in the Financial Times. You can add The Wall Street Journal’s new overview of AI terminology to that list.

The article breaks down essential AI terms, from basic algorithms to complex neural networks and large language models like OpenAI’s GPT series. It’s a nice resource for anyone looking to read or share a primer on the foundational terms and concepts in AI, and we’ve added it to the list of foundational primers that we’re sharing with clients.

The Artificial General Intelligence Debate

Noema Magazine provides a helpful overview of a concept that will be increasingly central to the discourse surrounding AI.

The WSJ piece we discuss above concludes with a definition of artificial general intelligence, or AGI. AGI has long been considered the “holy grail” of artificial intelligence research, and as the rate of technological improvement continues to accelerate, it will be increasingly central to the AI conversation. In a recent essay in Noema magazine, Blaise Agüera y Arcas (vice president and fellow at Google Research) and Peter Norvig (Distinguished Education Fellow at the Stanford Institute for Human-Centered AI) argue that “the most important parts of [AGI] have already been achieved by the current generation of advanced AI large language models.”

The piece provides an overview of general intelligence, an argument for why the current large language models are showing signs of AGI, and a rundown of four prominent areas of disagreement. Whether or not you agree with the authors’ conclusion — that “Decades from now, [current frontier models] will be recognized as the first true examples of AGI, just as the 1945 ENIAC is now recognized as the first true general-purpose electronic computer” — this essay will give you a better understanding of AGI and the contours of the debate surrounding it, which is sure to intensify for the foreseeable future.

The conversation surrounding AI will evolve as quickly as the technology itself, and staying current with that conversation will be important if corporate communication professionals are to credibly advise the business on AI’s implications. AGI will be a growing part of that conversation, and it is wise to begin understanding what that means now.

PSA: Large Language Models Still Hallucinate

Several recent items remind us how often tools like GPT-4 make things up and get things wrong.

We are bullish on the potential of generative AI, and we believe our clients should be as well. We believe we stand on the frontier of a transformation in technology, business, and likely society, driven by large language models and generative AI as general purpose technologies. We’re working hard to be far ahead in making use of these tools ourselves, and in helping our clients do the same. These tools are remarkable, and will only get better — the AI you are using today is the worst AI you will ever use. That said, we work hard to remind ourselves, and here want to remind you, of their significant flaws and potential for error. They hallucinate, confabulate, get math wrong, and just make things up. A lot. And three items today remind us of this fact.

The first is a post by Gary Marcus on “multimodal hallucinations,” or hallucinations in imagery. See his full post here, but the gist is that showing GPT-4 an image doesn’t mean it can correctly interpret what’s in that image. Show it a photo of a kitchen and for all it can get right, it miscounts chairs, miscounts drawers, and sees a refrigerator that isn’t there. Last week we showed it a photo of a column of images from a spreadsheet and while it read the numbers correctly, it could not sum them correctly (with repeated attempts).

Second and third are two new pieces of research, GPT-4 Doesn't Know It’s Wrong: An Analysis of Iterative Prompting for Reasoning Problems1 and Can Large Language Models Really Improve by Self-critiquing Their Own Plans?2

Study one sought to understand if “iterative prompting” — working with an LLM in successive prompts and asking it to critique its own responses in an attempt to improve its accuracy — does so. It seems not, according to this research, which questions claims that iterative prompting enables LLMs to self-critique and meaningfully improve reasoning. Study two investigates the ability of models like GPT-4 to self-critique and iteratively improve their own solutions when solving classical planning problems. The results suggest that some forms of self-critiquing actually degrade performance, and question the efficacy of LLMs at self-critiquing to iteratively improve their planning abilities.

All three items should serve as reminders that, at least today, large language models are capable of all sorts of errors and imaginations, and techniques like self-critique and iterative prompts don’t necessarily overcome those limitations. Gary Marcus is of the opinion that the hallucinations are central to the architecture of large language models themselves — that hallucination is an inevitable result of this form of generative AI, and living with it is a dilemma, not a problem to solve. We have nowhere near the expertise to comment on that, but in the meantime, for all the amazing capabilities these tools afford to save time and augment you ability to work and learn, caveat emptor on accuracy, and be certain to keep humans in the quality loop every step of the way.

To be sure, large language models are, in a lots of ways, an incredible advance over the AI of the 1970s. They can engage in a much broader range of tasks, and generally they need less hand-tinkering (Though they still need some – that’s what reinforcement learning with human feedback is really about.) And it’s lovely how they can quickly absorb vast quantities of data.

But in other ways, large language models are a profound step backwards; they lack stable models of the world, which means that they are poor at planning, and unable to reason reliably; everything becomes hit or miss. And no matter what investors and CEOs might tell you, hallucinations will remain inevitable.— Gary Marcus

We’ll leave you with something cool: A Twitter thread by Chase Lean with 10 tutorials on how to uncover DALL-E’s full potential …

AI Disclosure: We used generative AI in creating imagery for this post. We also used it selectively as a creator and summarizer of content and as an editor and proofreader.

Here’s the Claude 2 summary of this paper:

The paper analyzes the reasoning abilities of large language models (LLMs) like GPT-4 when solving graph coloring problems through iterative prompting. The main findings are:

LLMs are poor at directly solving random graph coloring instances, with only 16% success rate.
LLMs are also weak at verifying candidate colorings, frequently making mistakes like hallucinating non-existent edges between vertices.
Using LLMs to critique their own solutions via iterative prompting does even worse, below 1% success. The LLM fails to recognize when it generates a correct solution.
Iterative prompting with an external correct verifier improves performance to ~40%, but this occurs regardless of whether meaningful feedback is provided in the prompts.
The gains from iterative prompting appear to simply come from having multiple tries and getting lucky that a correct coloring appears in the candidates, which the external verifier recognizes.
Directly having the LLM generate multiple candidate solutions also achieves ~40% success if the external verifier checks them.

Overall, the results cast doubt on claims that iterative prompting enables LLMs to self-critique and improve their reasoning. The feedback does not appear to be used meaningfully. The study raises questions about the reasoning capabilities of current LLMs.

Here’s the Claude 2 summary of this paper:

The paper investigates whether large language models (LLMs) like GPT-3 and GPT-4 can effectively self-critique and iteratively improve their own candidate solutions for reasoning tasks, specifically in the context of planning.
The authors evaluate a planning system where the same LLM serves as both the plan generator and plan verifier. The verifier critiques and provides feedback on the generator's candidate plans.
Experiments were conducted using GPT-4 on deterministic planning problems in the Blocksworld domain. The LLM+LLM system was compared to two baselines: LLM with an external sound verifier, and LLM without any critiquing.
Key findings:
- The LLM+LLM system performed worse than using an external sound verifier, indicating issues with relying on the LLM's self-critiquing abilities.
- The LLM verifier had high false positive rates in verifying plans, approving invalid plans as valid. This undermines the system's reliability.
- Varying the feedback from binary to detailed had minimal impact on improving the generator LLM's performance.
Overall, the results cast doubt on the effectiveness of using LLMs for self-critiquing and iterative improvement for planning tasks. The authors suggest the core issue stems from deficiencies in the LLM's verification capabilities rather than the feedback provided.
The authors recommend further studies across more domains, instances, and prompting methods to further analyze these capabilities of LLMs.

Confluence: AI, Leadership, and Communication

Discussion about this post

Ready for more?