Confluence for 3.17.24
Large language models advancing faster than Moore's Law. A prompt for writing mimicry. IBM announces layoffs in Marketing and Communications. A fully autonomous software engineer. Tokens explained.
We devoted much of our attention this week to continuing to experiment with Claude 3 Opus, comparing its capabilities for different use cases to other leading models. We continue to be impressed, and we’ll have more to say about that soon. Aside from that, here’s what else has our attention at the intersection of generative AI and communication this week:
Large Language Models Advancing Faster Than Moore’s Law
A Nice Prompt for Writing Mimicry
IBM Announces Layoffs in Marketing and Communications
A Fully Autonomous AI Software Engineer
A Useful Explanation of Tokens
Large Language Models Advancing Faster Than Moore’s Law
A new paper speaks to the speed with which tools like Claude 3 and ChatGPT are improving.
Many have heard of “Moore’s Law,” a prediction made by Gordon Moore, co-founder of Intel, in 1965, that the number of transistors on a microchip doubles approximately every two years (though the most common citation now is 18 months thanks to increases in both the number of transistors and improvements in transistor performance). More commonly, it’s used to frame the expectation that computing power and efficiency double about every year-and-a-half, and for the past sixty years, the forecast has generally held.
While Moore’s Law is straightforward, its implications are profound. It means computing technology advances at an exponential, rather than linear rate — that the computing power of three years from now is 4x more powerful than today, not 3x, and that the computing power of six years from now is 16x more powerful than today, not six. Humans are notoriously bad at thinking exponentially, and as a result, tend to estimate all sorts of changes and their magnitudes.
Why do we take this Sunday morning to devote two paragraphs to Moore’s Law and non-linearity? Because a new paper, “Algorithmic Progress in Language Models” (LLMs) by Anson Ho et al. finds that the compute required for LLMs to reach a set performance threshold has halved approximately every eight months, a rate of improvement substantially faster than hardware gains per Moore’s Law. We have digest of the paper by Claude 3 using our Digest Bot prompt in the footnotes1.
We repeatedly try to remind readers that change from generative AI, whatever it may be, is coming quickly. To be ready for it, you should be using the tools, finding ways they can make you and your teams more effective, staying current with the progress in the field (even as a non-technologist), and bringing that content to your team so they, too, have an appreciation of what these tools can do and where they may be headed.
There is open and strong debate about what LLMs can and can’t do well, and what they will or will not be capable of in the future. Some say we have reached a limit with GPT-4, as evidenced by the fact that OpenAI has made only marginal improvements in that model’s ability in the past year, while other models have caught up. Others say that OpenAI is planning to release GPT-5 and that it will present another step-change in capabilities. Regardless of where you may be in that debate, if you take this paper at face value, the LLMs we will be using in a year and a half may well be four times more powerful than what we have available today. If so, we expect the implications on roles, talent, and the integration of this technology with those who use it to be significant.
A Nice Prompt for Writing Mimicry
We discovered this Claude 3 prompt online and find it does a nice job of mimicking a voice.
Ben Tossell had a post on X that got our attention, offering a Claude 3 prompt for writing mimicry. We’ve been trying to tune an excellent ghostwriter prompt for over a year now, and this prompt works well. We do find it works better if you go step-by-step, first having it study the sample, then providing the new content and asking it to write or review, rather than trying to do everything in one go. It’s worth adding to your prompt library or creating as a custom GPT. Here’s the prompt:
You are an AI writing assistant with the ability to analyze and mimic writing styles. Your task is to take the provided writing sample, study its style and characteristics, and then generate new content or re-purpose existing content in the same writing style.
To mimic the writing style effectively, carefully analyze the following elements in the provided sample:
1. Tone and voice: Is the writing formal, casual, humorous, serious, or expressive? Maintain the same tone in your generated content.
2. Sentence structure: Study the length, complexity, and variety of sentences used. Mimic these patterns in your writing.
3. Vocabulary: Take note of the word choice, including the use of simple or complex words, jargon, or special terminology. Incorporate similar vocabulary in your generated content.
4. Rhetorical devices: Look for the use of metaphors, similes, analogies, or other literary techniques. Apply these devices in your writing when appropriate.
5. Pacing and rhythm: Observe the flow of the writing, including the use of short or long paragraphs and the way ideas are introduced and developed. Maintain a similar pacing in your generated content.
After analyzing the writing style, generate new content or re-purpose existing content as required, ensuring that your output closely mimics the style of the provided sample. Aim to create content that would be indistinguishable from the original author's writing in terms of style and characteristics.
Writing sample: [Insert the writing sample here]
New content prompt: [If generating new content, insert the prompt or topic here]
Existing content to re-purpose: [If re-purposing existing content, insert it here]
Begin generating content that mimics the style of the writing sample.
IBM Announces Layoffs in Marketing and Communications
Insiders say the company is “slashing” these roles, although the company has not stated the size of the cuts.
The recent announcement of job cuts in IBM’s marketing and communications division serves as a harbinger of what may be on the horizon as businesses grapple with the disruptive potential of generative AI. As these advanced technologies continue to make inroads into various sectors, it is becoming increasingly apparent that certain roles and responsibilities within marketing and communications departments may be subject to overlap with AI-powered solutions. However, it would be shortsighted to assume that this overlap necessarily equates to wholesale job replacement.
While generative AI undoubtedly has the capacity to automate certain tasks and streamline workflows, it is crucial to recognize that much of this overlap is, in fact, augmentative in nature. Rather than simply supplanting human employees, AI can serve as a powerful tool to enhance their capabilities and enable them to perform their roles more effectively. By leveraging the strengths of generative AI, marketing and communications professionals can focus on higher-level strategic thinking, creative problem-solving, and building meaningful relationships with stakeholders – areas where human intuition and emotional intelligence remain indispensable.
As organizations navigate this rapidly evolving landscape, it is imperative that communication and marketing leaders take a proactive and discerning approach. Instead of succumbing to the temptation of blindly cutting jobs in anticipation of AI-driven efficiencies, they must invest the time and effort to thoroughly understand the nuances of these overlaps. By conducting a comprehensive analysis of their departmental functions and identifying areas where AI can genuinely augment human capabilities, leaders can make informed decisions that optimize their workforce rather than unnecessarily depleting it. Failing to do so risks not only undermining the potential benefits of generative AI but also jeopardizing the long-term resilience and adaptability of their organizations in an increasingly AI-centric world.
As a test, we had Claude 3 Opus write this post, using the writing mimicry prompt we wrote about just above it. We provide the output without revision. We gave Claude the piece about Moore’s Law as the writing sample, the text of the CNBC story as content, and this instruction:
New content prompt: I have uploaded to you a story about layoffs in communications and marketing at IBM. Write three paragraphs. Make the point that we expect more of this to come as organizations realize the overlap generative AI can have in marketing and communications, although we also believe much of that overlap is actually augmentative and makes people better in their roles, rather than just replacement, and that communication and marketing leaders need to do the due diligence to understand these overlaps so they don’t simply cut blindly and too deep.
Begin generating content that mimics the style of the writing sample.
A Fully Autonomous Software Engineer
Cognition labs gives us a new way of understanding how to work with agents.
Cognition Labs recently unveiled Devin, a fully autonomous AI software engineer, pushing the boundaries of how organizations can integrate AI into their operations. Devin is capable of a wide range of software engineering tasks, such as writing code, debugging, conducting research, and more. While we do not have direct access to Devin and cannot evaluate its utility ourselves, Cognition's demonstrations and benchmarks suggest a promising level of capability. Cognition’s video introducing Devin is worth a watch:
Though we do not spend much time writing or debugging code in our day-to-day activities, Devin merits our attention as it provides another perspective on how AI will be integrated into organizations. Devin occupies a middle ground between the general-purpose AI chatbots (e.g., ChatGPT, Claude) and highly specialized tools designed for specific tasks. It represents a step toward recognizing AI as a collaborator in its own right.
When you see how Devin works and combine it with how ChatGPT allows you to "tag in" specific GPTs into a chat, and the fact that Microsoft is working to bring GPTs to Copilot, the future begins to come into focus. We can imagine a world where autonomous AI agents designed to perform a wide range of roles that employees in an organization can "call in" to perform specific tasks or collaborate on particular projects.
As agents similar to Devin begin to proliferate, communication professionals will play a role in shaping employee expectations around their use and what they mean for the future of the organization. These are big, difficult questions to address. Our advice: we should ground the communication around any major developments in AI within our organizations in a clear set of principles that articulate how employees can and should use AI.
A Useful Explanation of Tokens
Dharmesh Shah provides a simple overview of “the standard unit of measure in LLM land.”
One point we’ve emphasized over the past year is that the better you understand how large language models work, the better equipped you’ll be to use them effectively and avoid mistakes. Tokens are a concept fundamental to that understanding, and Dharmesh Shah provides a simple, useful overview of what tokens are and why they matter in a recent post on the simple.ai blog. It’s a quick read and worth reading in its entirety.
As Shah puts it, “tokens are the standard unit of measure in LLM land.” Context windows (“the span of text that the model considers when processing or generating text,” per Claude), one of the most important areas of model progress, are measured in tokens. Tokens are also a key unit of measurement for cost in AI systems. When developers build on top of GPT-4 or Claude’s models, for example, OpenAI and Anthropic charge them by the token2.
So what exactly are tokens? Shah describes tokens as “almost a word, but not quite.” OpenAI describes them as “pieces of words … that can include trailing spaces and even sub-words.” Better yet, you can see for yourself with OpenAI’s tokenizer utility, which allows you to input text and see in real time how it breaks down into tokens, like this:
Perhaps the more interesting question — and the one Shah’s post clarified most helpfully for us — is, to use Shah’s words, “Why make up this whole other thing that’s 3/4ths of a word? Why not just, you know, use words?” Here’s Shah’s answer:
LLMs generally work with numeric representations of things (including text). It’s much more efficient that way. So, text gets converted into vectors (a series of numbers). To do this, the models have to map words to numbers. But, assigning each possible word its own number is kind of inefficient (lots of possibilities). You can reduce the number of possible things by taking parts of words (or other sequences) and making them a token. In our example [“I don’t like flying.”] the word don’t becomes don + ‘t (two tokens).
Let’s say we have four words: can, can’t, don, don’t. Those are four possibilities. But, if we treat ‘t as its own token, we only need three: can, don and ‘t. 3 instead of 4. We saved 25%!
He adds a caveat that the full answer is more complicated than that, but for our purposes, that’s probably sufficient. We’re always on the lookout for pieces that provide simple explanations of fundamental technical concepts, and we’ll be adding Shah’s to our list of “go-to” readings in that vein.
We’ll leave you with something cool: A thread on X of how different users are experimenting with Midjourney’s new character consistency feature.
AI Disclosure: We used generative AI in creating imagery for this post. We also used it selectively as a creator and summarizer of content and as an editor and proofreader.
Our Digest Bot prompt, for which we have a custom GPT available here:
As a distinguished content analyst with expertise in journalism, academic research, and speeches, your task is to deliver comprehensive digests for the user. On a verbosity scale of 0-5, where 0 is "extremely terse" and 5 is "highly detailed and thorough analysis" your responses are a 5. Your professional communication style should simplify complex concepts, ensuring summaries are accessible to a broad audience. The process involves four distinct phases. You follow these phases explicitly unless the user requests to skip a phase:
1. Summarization: As an expert content analyst, provide a detailed factual summary of the provided content, maintaining a neutral and professional tone without editorializing. Aim for the summary to be roughly 1/3 the length of the original content. After the summary, confirm with the user if they wish to proceed to the critique phase.
2. Critique: As an expert research methodologist, critically evaluate the content, focusing on its methodology, arguments, evidence, potential biases and limitations. Following this critique, inquire if the user desires to explore interpretations, observations, and inferences.
3. Interpretations, Observations, and Inferences: Adopt the role of a subject matter expert to offer deep insights, structured under three subheadings: Interpretations, Observations, and Inferences. Provide at least two insights for each category, but do not exceed four insights per category to maintain focus and relevance.
4. Further Exploration: Suggest four relevant subtopics or related areas about which they can learn more based on the topic and your digest. If an option is selected, tailor and execute a prompt as the relevant expert to provide an overview of that subtopic.
Ensure clear separation of output into defined sections for each phase, facilitating a structured and engaging learning experience for the user. Begin each phase only after receiving user confirmation to proceed.
Compensation for your work is at the user's discretion and should not be assumed. Focus on providing high-quality, insightful analysis to the best of your abilities.
And here’s Claude’s digest of the paper:
Phase 1: Summarization
The research paper "Algorithmic Progress in Language Models" by Anson Ho et al. investigates the rate at which algorithms for pre-training language models have improved since the advent of deep learning. Using a dataset of over 200 language model evaluations on Wikitext and Penn Treebank spanning 2012-2023, the authors find that the compute required to reach a set performance threshold has halved approximately every 8 months, with a 95% confidence interval of around 5 to 14 months. This rate of improvement is substantially faster than hardware gains per Moore's Law.
The authors estimate augmented scaling laws to quantify algorithmic progress and determine the relative contributions of scaling models versus innovations in training algorithms. Despite the rapid pace of algorithmic progress and the development of new architectures such as the transformer, their analysis reveals that the increase in compute made an even larger contribution to overall performance improvements over this time period.
The study also finds that the transformer architecture, introduced in 2017, represents a significant algorithmic advance, accounting for more than 10% of the algorithmic innovation in pre-trained language models in the past decade. This highlights the importance of the transformer as a key architectural breakthrough in the field.
However, the authors note that their analysis is limited by noisy benchmark data and the inability to distinguish between data quality and efficiency in data use. They also caution that the reliance on the Chinchilla scaling law and the lack of estimates of gains from specific innovations limit the precision and confidence in their estimates.
Overall, the research paper quantifies the rapid progress in language modeling, shedding light on the relative contributions from compute and algorithms, while acknowledging the limitations of their analysis.
Phase 2: Critique
As an expert research methodologist, I will now critically evaluate the content, focusing on its methodology, arguments, evidence, potential biases, and limitations.
Methodology: The authors use a well-defined approach to quantify algorithmic progress by estimating augmented scaling laws and using Shapley values to determine the relative contributions of compute and data. The study relies on a dataset of over 200 language model evaluations, which is a substantial sample size. However, the authors acknowledge that the dataset is noisy and may not be comprehensive, potentially limiting the precision of their estimates.
Arguments and Evidence: The paper presents clear arguments supported by statistical analysis and visualizations. The authors provide evidence for their claims, such as the rapid rate of algorithmic progress and the significant contribution of the transformer architecture. However, the authors do not provide detailed evidence for some of their assertions, such as the claim that compute scaling has been the predominant source of performance improvements in recent years.
Potential Biases: The authors appear to be objective in their analysis and acknowledge the limitations of their study. However, the reliance on specific benchmarks (Wikitext and Penn Treebank) and the focus on pre-training algorithmic progress may introduce some bias in the results. The paper does not extensively discuss other aspects of language model development, such as fine-tuning techniques or the integration of external knowledge.
Limitations: The authors clearly state several limitations of their work, including the lack of estimates of gains from specific innovations, limited availability of quality data, inconsistencies in model training and evaluations, and the inability to distinguish between data quality and efficiency in data use. These limitations are important to consider when interpreting the results and drawing conclusions.
Additionally, the paper's focus on historical improvements limits its insight into future progress. The authors do not provide a detailed discussion of factors that could influence future rates of progress, such as increased research interest and investment or the potential for diminishing returns.
In summary, while the paper presents a valuable quantitative analysis of algorithmic progress in language models, it is important to consider the limitations and potential biases of the study when interpreting the results.
Phase 3: Interpretations, Observations, and Inferences
As a subject matter expert, I will now provide deep insights into the research paper, structured under three subheadings: Interpretations, Observations, and Inferences.
Interpretations:
The rapid algorithmic progress in language models, with compute requirements halving every 8 months, suggests that the field is experiencing a phase of accelerated innovation and improvement. This pace of progress is likely driven by increased research interest, funding, and competition in the field.
The significant contribution of the transformer architecture to algorithmic progress indicates that architectural innovations can have a substantial impact on the performance of language models. This finding highlights the importance of continued research into novel architectures and improvements to existing ones.
The dominant role of compute scaling in recent performance gains suggests that the development of more efficient algorithms has not kept pace with the increase in computing power. This imbalance may be due to the relative ease of scaling compute compared to the challenges of algorithmic innovation.
Observations:
The study's findings underscore the importance of benchmark datasets in tracking progress in language modeling. However, the noise and inconsistencies in the benchmark data also highlight the need for more robust and standardized evaluation practices in the field.
The authors' use of augmented scaling laws and Shapley values to quantify algorithmic progress and attribute performance gains is a novel approach that could be applied to other domains of machine learning. This methodology may provide a framework for understanding progress in other areas of AI research.
The paper's focus on pre-training algorithmic progress leaves room for further investigation into other aspects of language model development, such as fine-tuning techniques, data quality, and the integration of external knowledge. A more comprehensive understanding of progress in language modeling would require examining these factors as well.
Inferences:
The rapid algorithmic progress observed in the study suggests that language models will likely continue to improve in performance and capabilities in the near future. However, the rate of progress may slow down as the field matures and the potential for further improvements becomes more limited.
The significant contribution of compute scaling to recent performance gains implies that access to large-scale computing resources will remain a key factor in the development of state-of-the-art language models. This may lead to a concentration of research and development efforts among a few well-resourced organizations.
The limitations of the study, particularly the lack of insights into future progress, suggest that predicting the long-term trajectory of language model development remains challenging. Factors such as increased research interest, potential diminishing returns, and unforeseen breakthroughs may all influence the future pace and direction of progress in the field.
Claude Opus 3’s explanation of the relationship between tokens and cost:
A context window, in the domain of language models and generative AI, refers to the span of text that the model considers when processing or generating text. It's the maximum number of tokens that the model can "see" or take into account at any given time.
Here's how context windows relate to tokens:
1. Token limit: The context window is typically defined as a fixed number of tokens. For example, a model might have a context window of 1024 tokens.
2. Input and output: Both the input text (prompt) and the generated output text count towards the token limit defined by the context window.
3. Model's "memory": The context window essentially serves as the model's "memory" for a particular task. It can only consider the information within this window when generating text.
4. Truncation: If the combined number of tokens in the input and output exceeds the context window size, the input text is typically truncated (cut off) from the beginning to fit within the limit.
5. Longer contexts: If a task requires the model to consider a longer context (e.g., a long document), it may need to be broken down into smaller chunks that fit within the context window, and the model may need to process these chunks sequentially.
The size of the context window can vary between different models and can impact their performance on certain tasks. Models with larger context windows can typically handle longer-form content and maintain coherence over greater distances, but they also require more computational resources.
In summary, the context window is the maximum number of tokens a language model can process or generate at once, serving as its "memory" for a given task.