Confluence for 6.16.24
Why Apple AI Is a big deal. Ethan Mollick’s latest thinking on doing stuff with AI. Revisiting NotebookLM after Gemini 1.5. Does generative AI have Theory of Mind?
Welcome to Confluence. Here’s what has our attention this week at the intersection of generative AI and corporate communication:
Why Apple AI Is a Big Deal
Ethan Mollick’s Latest Thinking on Doing Stuff with AI
Revisiting NotebookLM After Gemini 1.5
Does Generative AI Have Theory of Mind?
Why Apple AI Is a Big Deal
The future is about to get much more evenly distributed.
It’s official: more AI is coming to your iPhone (and iPad, and Mac). And it’s not just going to exist as a single app or utility. Last week, Apple shared its plans for Apple Intelligence, the company’s overarching approach to bring advanced generative AI capabilities to its products. In a recent edition of Confluence, we asked, “Are devices the next frontier?” Apple’s announcements suggest that the answer to that question is a resounding yes.
Readers interested in seeing everything Apple announced can watch the entire keynote video from the Worldwide Developers Conference here or jump straight to the Apple Intelligence segment at the 1:04:30 mark. In the introduction to that segment, Tim Cook highlights the five principles at the core of Apple Intelligence: powerful, intuitive, integrated, personal, and private. These principles are evident in four features of Apple Intelligence which, if executed well, may lead to a significant shift in how we interact with AI.
Most of the AI processing will occur directly on your device, with more advanced processes taking place in a “private cloud” or being securely shared with OpenAI (more on that below). As we noted last week, recent surveys indicate that privacy remains a major concern for many individuals and organizations. Apple’s privacy architecture and assurances are likely to assuage some of those concerns and set the standard for others to follow.
Siri will soon be grounded in your personal context, with the ability to access and reference any relevant data from across all of your apps. Apple will use an “on-device semantic index that can organize and surface information from across your apps” to make this possible. One of the major pain points with popular tools like ChatGPT or Claude is that they don’t really “know” the user (even with ChatGPT’s custom instructions and memory). The ability to interact with an AI that seamlessly accesses your data could redefine the user experience.
Siri will be able to take action on your behalf within and across your apps. One of the major trends we’re observing in the generative AI landscape is the shift toward agents, or models that can not only interact with information and generate content, but also take action. Apple’s announcement marks the most significant instance to date of an AI agent becoming available at scale.
Apple Intelligence will be able to use GPT-4o for more complex tasks. One of the most anticipated parts of Apple’s Worldwide Developmers Conference was the announcement of its partnership with OpenAI, and we finally got some of the details on what that entails. GPT-4o, the model that powers ChatGPT and arguably the most powerful AI model in the world, will be available to iPhone users — for free and without requiring a ChatGPT account. Apple’s own LLMs will handle simpler queries while Siri can tap into ChatGPT for more complex use cases (and will always ask for permission before doing so).
That’s not an exhaustive list, but each of these items marks a major step and has implications for what we can expect as we move forward. If we take Apple’s announcements as a directional indication of where we are heading, we can begin to envision the near future of human-AI interaction. We’re moving toward a world where you don’t “go to” AI but instead, AI will be where you are: on the devices and in the applications that you use every day. As Will Knight put it in Wired, Apple showed a vision of “AI as a feature, not a product.” Just as important, the AI will “know” you in a way that most tools don’t today, which will not only save an incredible amount of time, but should lead to more powerful, relevant output. And lastly, AI will shift from something that interacts with information to something that takes action. Even if the initial launch of Apple Intelligence falls short of the promises Apple laid out last week, we expect companies will continue moving in this direction and that the experience will only get better over time.
All of that is coming to anyone with an iPhone, iPad, or Mac later this year. As we’ve reiterated many times in recent weeks, once employees get a taste of these capabilities in their personal lives, they’re going to want to use them in their work. Apple’s announcements mean that hundreds of millions more people are about to get a real taste of these capabilities very soon.
Ethan Mollick’s Latest Thinking on “Doing Stuff with AI”
Keep inviting AI to the table — but have fun as you do.
Regular readers know that we often point to the work of Ethan Mollick, a professor at Wharton and a prominent voice in the generative AI space. He’s also the author of the recent book “Co-Intelligence: Living and Working with AI” — our summary of which you can find here, although we do think people should read the entire book. We continue to consider his practical, opinionated guidance an invaluable resource.
About every six months Mollick writes a post about practical uses for generative AI, and he’s posted his latest iteration: Doing Stuff with AI: Opinionated Midyear Edition. He suggests starting with fun applications like creating songs using Suno or Udio or listening to AI-generated NPR-style radio interviews on Google's Illuminate demo. These “fun” entry points not only delight, but also reveal the quirks and limitations of AI systems in a low-stakes environment — an essential step in understanding how to effectively incorporate AI into your work.
When it comes to more serious applications, Mollick reiterates the importance of using the most advanced models, such as Claude 3 Opus (our current favorite), Gemini 1.5, and GPT-4o. He notes that these models have significantly expanded capabilities compared to just six months ago, including internet connectivity, image generation, code execution, data analysis, image and video understanding, and document processing — some of which we discussed last week.
The heart of Mollick’s advice: “Always invite AI to the table.” He encourages users to engage AI in conversation and use it for anything that would be helpful in their day-to-day, from summarizing meetings and generating ideas to writing reports and discussing strategy. The best way to understand AI’s capabilities and limitations, Mollick reminds us, is by bringing the technology to every aspect of our work.
Mollick closes by acknowledging that even this latest comprehensive guide may soon be outdated as AI systems continue to improve at a remarkable pace — all the more reason for people to put some of Mollick’s insights into practice.
Revisiting NotebookLM After Gemini 1.5
Another example of frontier models making their way into other tools.
Late last year, we highlighted Google’s NotebookLM as an example of generative AI extending beyond the typical chatbot interface. What’s NotebookLM? From Google’s post for the product:
NotebookLM is an experimental product designed to use the power and promise of language models paired with your existing content to gain critical insights, faster. Think of it as a virtual research assistant that can summarize facts, explain complex ideas, and brainstorm new connections — all based on the sources you select.
A key difference between NotebookLM and traditional AI chatbots is that NotebookLM lets you “ground” the language model in your notes and sources. Source-grounding effectively creates a personalized AI that’s versed in the information relevant to you … you can ground NotebookLM in specific Google Docs that you choose, and we’ll be adding additional formats soon.
The other week, Google updated NotebookLM so that it is now powered by Gemini 1.5, its leading model. In addition to improved capabilities and intelligence, Gemini 1.5 introduces several new features to NotebookLM. Google's overview provides more details and valuable use cases, but a few key points stood out to us.
One notable improvement is that NotebookLM now offers more guidance when uploading sources. By clicking “Notebook guide” near the bottom of the screen, users can access suggested questions based on the content within the sources, a summary of the sources, and options to generate FAQs, briefing documents, and more.
While the quality of output can be uneven, we noticed that it tends to perform better when drawing from a single source rather than analyzing multiple sources simultaneously. At the same time, you can continue to nudge NotebookLM to provide more and richer content. If we were to use these features for our work, we’d likely take output from NotebookLM (which we’ve found to be quite accurate) and shift to using Claude 3 Opus to refine the text itself.
Another new feature worth mentioning is NotebookLM's ability to extract information from webpages and analyze images within its sources (although it doesn't accept image files directly). It handles webpages effectively, with one caveat — many popular websites have blocked AI access, preventing their use as sources in NotebookLM. The image analysis feature, however, shows real promise and utility. In the example below, we asked NotebookLM to describe a graph from a research paper and explain its significance. The response is spot on.
If you haven’t spent time with NotebookLM since Google’s update, or at any point, it’s worth spending some time with now. Though NotebookLM’s immediate utility may have limitations, it provides a further glimpse into the future of AI-powered tools. Generative AI excels at working with and referencing existing content, potentially allowing employees to interact with dense, complex materials using natural language, asking questions, and gaining clarity on its implications for their work. While we’re not quite there yet, the advancements seen in NotebookLM, Office 365 Copilot, and other tools show we are taking real steps toward that future.
Does Generative AI Have Theory of Mind?
Gen AI might think what you think that it thinks you think better than you think.
Does generative AI have Theory of Mind? We can expect that you might think that’s an obscure question for a Sunday morning, but we believe it’s worth mulling over with your morning cup.
Let’s start with this: What is “Theory of Mind”? We actually exercised it in the first paragraph. Theory of Mind is the ability to understand and attribute mental states — such as thoughts, beliefs, desires, intentions, and emotions — to oneself and others. It is the capacity to recognize that other people have their own unique perspectives, knowledge, and feelings that may differ from one’s own. Our inferring that you might think Theory of Mind is an abstract topic for a Sunday morning is an example of us exercising Theory of Mind. Here’s what Claude 3 Opus has to say on the topic:
Theory of Mind is the awareness that:
People have thoughts and feelings that guide their behavior.
Different people can have different thoughts and feelings about the same situation.
People’s thoughts and feelings can differ from reality.
People’s actions are influenced by their beliefs, even if those beliefs are false.
Having a well-developed Theory of Mind allows individuals to better understand, predict, and respond to the behavior of others in social situations. It is a crucial aspect of social cognition and plays a significant role in effective communication, empathy, and social interaction.
Children typically develop Theory of Mind gradually, with a major milestone occurring around the age of 4-5 years old when they begin to understand that others can have beliefs that differ from their own and from reality.
An example of Theory of Mind in action could be a child playing hide-and-seek with their friend. Let's call the child playing the game “Sarah.”
Sarah and her friend decide that Sarah will hide first, and her friend will count to 20 before trying to find her. Sarah runs into the bedroom and hides under the bed. While hiding, Sarah giggles, thinking about how her friend will never find her in this great hiding spot.
However, Sarah’s giggle was loud enough for her friend to hear. When her friend finishes counting, they follow the sound and easily find Sarah under the bed.
In this scenario, Sarah demonstrates an incomplete Theory of Mind. She understands that her friend doesn’t know where she is hiding (which is why she chose to hide there), but she fails to consider that her giggling could reveal her location to her friend. Sarah’s friend, on the other hand, uses their Theory of Mind to infer that the giggling sound is likely coming from Sarah, and uses this information to guide their search.
Theory of Mind also has orders of degree: I can infer what you think, but I can also infer what you think I think. And I can infer what you think I think you think I think. And so on.
Why does Theory of Mind matter? Because it’s central to making sense of and navigating the world. As we develop a more advanced Theory of Mind, we become better at considering what others might know or perceive based on the information available to them. This allows us to engage in more complex social interactions, such as anticipating how others might react or what people may assume — as well as things like keeping secrets and understanding and engaging in deception.
An open question in the generative AI community has been about whether large language models have Theory of Mind. And it’s not an academic consideration. As these tools gain greater practical influence, and as people engage with them in more conversational ways, their ability to make inferences about the user’s internal states matters to their ability to be helpful (and perhaps, deceptive).
Which is why the title of this recent paper, “LLMs Achieve Adult Human-Level Performance on Higher-Order Theory of Mind Tasks” got our attention. Here’s the abstract:
This paper examines the extent to which large language models (LLMs) have developed higher-order theory of mind (ToM); the human ability to reason about multiple mental and emotional states in a recursive manner (e.g. I think that you believe that she knows) … We find that GPT-4 and Flan-PaLM reach adult-level and near adult-level performance on ToM tasks overall, and that GPT-4 exceeds adult performance on 6th order inferences. Our results suggest that there is an interplay between model size and finetuning for the realisation of ToM abilities, and that the best-performing LLMs have developed a generalised capacity for ToM. Given the role that higher-order ToM plays in a wide range of cooperative and competitive human behaviours, these findings have significant implications for user-facing LLM applications.
We’ll save the details for the footnotes1, but in brief, the study asked both people and models like GPT-4 to read short stories of about 200 words describing social interactions of three to five characters, and then answer 20 true or false statements. The statements tested for Theory of Mind in increasing orders of complexity, from one to six (level five, for example is, “I believe that you think that I imagine that you want me to believe,” so it gets complicated fast). This chart summarizes the results:
As you can see, the most powerful model in the study, GPT-4, matched human performance closely across the board. What does this mean, in simple terms? That at least two large language models, GPT-4 and Flan-PaLM, have demonstrated an ability to understand and reason about the thoughts and perspectives of others at a level comparable to adult humans. The authors believe these models are exercising genuine Theory of Mind reasoning, rather than simply relying on superficial patterns in the data they were trained on — although the authors are cautious about definitively concluding that the models possess Theory of Mind in the same way humans do. Recognizing this fact is an important step in assessing the potential risks and benefits associated with advanced language model capabilities.
What it means for you and us: The best of these models are already as good as we are at reasoning about internal states, with high levels of complexity. This is part of why they can seem so human, and why research continues to come out suggesting that blind testers can find LLMs more empathetic than real people. In terms of having tools that can generate excellent content, act as collaborators, play the role of an audience, be your friendly skeptic, and more, this is great news. But a strong Theory of Mind is also essential to deception, and increases the potential for these models to be used for harm rather than good. It’s a frontier of their development we watch closely, and encourage you to do the same.
We’ll leave you with something cool: an AI generated song, Signal and the Noise, about a “mission to more broadly share the ideas, insights, and thinking at the intersection of AI and communication that capture our attention.”
AI Disclosure: We used generative AI in creating imagery for this post. We also used it selectively as a creator and summarizer of content and as an editor and proofreader.
The summary of the paper courtesy our Digest Bot:
Introduction
The paper investigates the capability of large language models (LLMs) in developing higher-order Theory of Mind (ToM), which involves understanding and reasoning about multiple mental and emotional states in a recursive manner (e.g., "I think that you believe that she knows"). It introduces a novel test suite called Multi-Order Theory of Mind Question & Answer (MoToMQA) to evaluate the performance of five LLMs against an adult human benchmark. The results indicate that GPT-4 and Flan-PaLM achieve adult-level or near-adult-level performance on ToM tasks, with GPT-4 exceeding adult performance on 6th-order inferences. The findings have implications for the use of LLMs in user-facing applications.
Materials and Methods
The study introduces MoToMQA, based on a ToM test for human adults, consisting of short stories and true/false questions about characters in the stories. The test assesses ToM from orders 2-6 and compares LLM performance on these tasks to human performance and factual tasks of equivalent complexity. The study includes a detailed description of the procedures for both human participants and LLMs, highlighting differences in their testing conditions.
Results
The results show significant performance differences among the models, with GPT-4 and Flan-PaLM outperforming others. Humans performed better than Flan-PaLM but not significantly different from GPT-4. The study also reveals performance differences across ToM orders and factual levels, with GPT-4 and Flan-PaLM showing strong performance at higher orders. The paper notes that factual recall tasks are generally easier for both humans and LLMs compared to ToM tasks.
Discussion
The discussion highlights the implications of LLMs achieving high ToM capabilities, including potential benefits and ethical concerns. The paper suggests that the ability of LLMs to understand and infer mental states could enhance their interaction with humans but also poses risks of manipulation and exploitation. It calls for further research to understand how LLM ToM manifests in real-world interactions and to develop safeguards against potential misuse.
Limitations and Future Research
The study acknowledges limitations such as the scope and size of the benchmark, the use of only English language materials, and the maximum ToM order tested being 6. It suggests future research should include culturally diverse and comprehensive benchmarks, extend beyond 6th-order ToM, and adopt multimodal paradigms that reflect the embodied nature of human ToM.
Conclusion
The paper concludes that GPT-4 and Flan-PaLM exhibit higher-order ToM capabilities comparable to adult humans, with GPT-4 showing better-than-human performance on 6th-order ToM tasks. The findings suggest that these LLMs have developed ToM reasoning abilities beyond superficial statistical manipulation. However, the study refrains from making strong claims about whether this indicates true cognitive ability akin to human ToM.
Critique
Methodology
The methodology of the study is robust, employing a novel benchmark (MoToMQA) to evaluate the ToM capabilities of LLMs and compare them to human performance. The use of short stories and true/false questions to assess ToM from orders 2-6 is well-designed to test recursive reasoning abilities. However, there are a few areas where the methodology could be improved:
Control for Language Proficiency: The study screens human participants for English proficiency, but it does not address potential variations in linguistic complexity that might affect LLM performance. Ensuring that the language used in the stories and questions is uniformly complex across all orders and levels would provide a more rigorous assessment.
Diversity of LLMs: The study tests five LLMs, but it does not include some of the latest models like Google’s Gemini due to technical limitations. Including a wider variety of models, particularly those with different architectures and training paradigms, could yield more comprehensive insights into the factors contributing to ToM capabilities.
Evaluation Metrics: The study primarily uses log probabilities (logprobs) to measure LLM performance. While this approach adds robustness by considering multiple ways of providing correct responses, it might still miss nuances in model comprehension and reasoning. Incorporating qualitative analysis of LLM responses could provide deeper insights into their reasoning processes.
Arguments and Evidence
The paper presents strong evidence that GPT-4 and Flan-PaLM exhibit higher-order ToM capabilities comparable to adult humans. The use of a large sample size (29,259 human responses) and rigorous statistical analysis (Cochran’s Q test, McNemar’s test) strengthens the validity of the findings. However, the arguments could be further supported by:
Longitudinal Analysis: Including a longitudinal component to track changes in LLM performance on ToM tasks over time, particularly as models undergo further training and fine-tuning, would provide valuable insights into the development of ToM capabilities.
Comparative Analysis with Children: Comparing LLM performance not only with adults but also with children of different age groups, as done in some referenced studies, would offer a more nuanced understanding of how LLM ToM capabilities align with human cognitive development stages.
Potential Biases and Limitations
The study acknowledges several limitations, including the scope and size of the benchmark, the use of English-only materials, and the maximum ToM order tested being 6. Additionally, potential biases include:
Pretraining Data Contamination: There is a possibility that some of the test materials might overlap with the pretraining data of the LLMs, which could artificially inflate performance results. Conducting thorough contamination checks and using entirely novel test materials could mitigate this risk.
Task-Specific Performance: The study focuses on ToM tasks within a specific experimental setup. The LLMs' performance might vary in real-world scenarios where social interactions are more dynamic and less structured. Future studies should explore LLM ToM capabilities in more naturalistic settings.
Implications
The study highlights significant practical and ethical implications of LLMs achieving high ToM capabilities. These include potential benefits in user-facing applications and risks related to manipulation and exploitation. However, the paper could delve deeper into:
Ethical Safeguards: Proposing specific technical guardrails and design principles to mitigate the risks associated with advanced LLM ToM capabilities would strengthen the discussion on ethical implications.
Applications in Various Domains: Exploring the potential applications of LLM ToM capabilities in diverse domains such as education, mental health, and conflict resolution would provide a broader perspective on the societal impact of these advancements.
Interpretations, Observations, and Inferences
Interpretations
Higher-Order ToM Capabilities in LLMs: The study suggests that LLMs, specifically GPT-4 and Flan-PaLM, have developed higher-order ToM capabilities that are comparable to adult human performance. This indicates that with sufficient scale and fine-tuning, LLMs can generalize complex social reasoning skills. The models' performance on 6th-order ToM tasks, where GPT-4 even surpasses human accuracy, highlights the potential for these models to handle intricate social interactions and mental state attributions.
Role of Model Size and Fine-Tuning: The positive correlation between model size, fine-tuning, and ToM performance underscores the importance of extensive training and parameter tuning. The distinction between base models (like PaLM) and instruction-tuned models (like Flan-PaLM) suggests that specific training methodologies significantly enhance ToM capabilities. This reflects the intricate interplay between computational resources and algorithmic training in achieving advanced cognitive functions in LLMs.
Observations
Variability Across Models: While GPT-4 and Flan-PaLM show impressive ToM performance, other models like GPT-3.5 and LaMDA exhibit significantly lower capabilities. This variability highlights the uneven progress across different LLMs and points to the critical factors of model architecture and training processes in achieving higher-order cognitive abilities.
Human-Like Cognitive Patterns: The performance patterns observed in GPT-4 and Flan-PaLM, particularly the decline and subsequent improvement in ToM task accuracy at higher orders, mirror human cognitive processes. This suggests that these models might be learning and applying recursive reasoning strategies in a manner similar to humans, further emphasizing their potential to simulate human-like understanding and reasoning.
Inferences
Implications for AI-Human Interaction: The development of advanced ToM capabilities in LLMs can significantly enhance their utility in applications requiring nuanced social interactions, such as virtual assistants, educational tools, and therapeutic aids. These models can better understand user intentions, adapt responses based on inferred mental states, and provide more personalized and effective interactions.
Ethical and Regulatory Considerations: The ability of LLMs to infer and manipulate human mental states raises critical ethical concerns. There is a need for robust regulatory frameworks and ethical guidelines to prevent misuse, such as manipulation or exploitation of users. Developing transparent and accountable AI systems, along with technical safeguards, will be essential to balance the benefits and risks associated with these capabilities.