“Speed is a feature,” Google cofounder Larry Page once told me. “Speed can drive usage as much as having bells and whistles on your product. People really underappreciate it.”
I thought of Page’s remark when I tried out a chatbot from the startup Groq last week. (The name comes from grok, golden era sci-fi writer Robert Heinlein’s term for deep understanding; it’s unrelated to Elon Musk’s chatbot Grok—more on which later.) Groq makes chips optimized to speed up the large language models that have captured our imaginations and stoked our fears in the past year. You never might have thought of these LLMs as particularly slow. In fact, it’s pretty impressive that when you offer a prompt, even a relatively complicated one, a detailed answer comes within seconds, not minutes.
The experience of using a chatbot that doesn’t need even a few seconds to generate a response is shocking. I typed in a straightforward request, as you do with LLMs these days: Write a musical about AI and dentistry. I had hardly stopped typing before my screen was filled with a detailed blueprint for the two-act Mysteries of the Mouth. It included a dramaturgically complete book, descriptions of a complete cast, and the order of the songs, each of which advanced the action and defined the characters. It was something a clever theater kid might have handed in at the end of a full-semester course in Outlining the Broadway Musical. It’s no longer surprising to get stuff like that from a chatbot, and Groq uses modified versions of several open source LLMs, from places like Mistral or Meta. The revelation was how quickly The Mysteries appeared, fully developed, on my screen. It took all of a second. (OpenAI’s ChapGPT, which proposed a musical called The AIgnificent Smile took around four seconds.)
That speedy turnaround left me disoriented. When there’s a pause between prompt and output, the feeling is that some artificial brain is cranking away at the answer, which comes as the result of gobs of computation—a process similar to human thought but faster. But when the answers just … show up, you get a different feeling. Was the musical there all along? Did Groq have a day pass to all possible versions of the multiverse?
When I described my impressions to Jonathan Ross, the CEO and primary inventor of Groq’s hardware, he was delighted. Formerly he was at Google, where he was the key inventor of its Tensor Processing Unit AI chips that have helped the company make leaps in AI research and products. On a Zoom with him, I asked how Groq worked; he went straight to the source and asked the chatbot powered by his creation. This is usually an annoying ploy in an interview, but since Groq is super fast I tolerated it, listening as the model explained that graphics chips like Nvidia’s, which work in parallel, are ideal for delivering images. Think of a lot of people filling in a paint-by-numbers picture at the same time, the bot said. But they’re not as efficient in crunching through or generating language, which proceeds sequentially.
Ross then cut Groq off, asking what would happen if you put an entire LLM into the memory onboard a chip. “Oh, you’re talking about something like a dedicated AI chip?”said Groq, reacting as nimbly as an alert human conversant. “That’s actually a thing! If you had a chip specifically designed for large language models, and you could program it, you could potentially make it faster.”
Ross confirmed that Groq had grokked the question correctly. His chip, he says, is simpler, streamlined, and custom-built to optimize for LLMs. He also noted that the conversation he just had with his creation was so fast that he and the chatbot were almost overlapping, even though he was in Brussels at the moment and the Groq Language Processing Unit (LPU) chips running the model—792 of them, all interconnected—were in Washington state. When you get that kind of speed, he says, you will be motivated to try things that you would otherwise not bother with.
Speed opens all sorts of possibilities. Groq developers are working on apps where you can jam with AI-generated music in real time, or speed up the process of chip development. Others in the AI community envision even deeper applications. Andrew Ng, a founder of the Google Brain research group and now CEO and founder of Landing AI, says that the biggest consumers of a superfast LLM might not be humans but autonomous agents. “People can read only so fast, so beyond a certain point there’s little value to generating fast text if it’s intended only for human consumption,” he wrote me in an email. “But LLM-based agents are using text generation to ‘think’ through multiple steps of reasoning before reaching a conclusion, and being able to speed this up would be immensely beneficial to speeding up the thinking speed of agents.”
I probed Ross to see if there’s a hidden catch. Critics who have looked at Groq’s scheme, which requires more chips per user than the standard approach, say it will be too costly to implement once it’s out of the demo stage. ChatGPT Plus costs $20 a month—how much more would anyone pay for faster responses? But Ross argues that the lower operating cost of Groq chips will more than cancel out the initial expenses. “Everyone looks at the size of the deployment and goes, gosh, that must be expensive,” he says. “But that’s like saying that a factory is expensive and therefore the car must be expensive.” He says his approach will save money, because it’s more power-efficient. “If you take all the power needed to deploy the GPUs anticipated over the next year or two, you need all the power of Argentina. But doing it with LPUs instead of GPUs, you’d only need the power of Guatemala.” For those not familiar with the CIA World Factbook, Argentina’s population is more than twice as large.
Another potential obstacle: If Groq chips catch on, how will the company manage to provide enough computing for its customers, given that we’re in a global shortage? Right now, Groq is contracted to build chips at a foundry in Malta, New York—one that recently got a billion-dollar US government grant courtesy of the Biden-backed CHIPS and Science Act. But that fab can’t provide chips at the scale of Taiwan-based TSMC, the world’s largest foundry. Ross professes to be unworried, saying that while Groq is on track to increase the number of chips it produces, a much bigger boost will come from improved software on the chip.
If all that is true—and there’s no way to tell right now—Groq might well pose a threat to the dominance of Nvidia. Ross is careful when discussing this. “Let’s be clear—they’re Goliath, and we’re David,” he says. “It would be very, very foolish to say that Nvidia is worried about us.” When asked about Groq, though, Nvidia’s prompt response indicates that the startup is indeed on its radar. With near-Groq-like speed, the Goliath’s PR team sent me a statement indicating that Nvidia’s AI advantage is not only in its chips but other services it provides to customers. like AI software, memory, networking, and other goodies. “AI compute in the data center is a complex challenge that requires a full-stack solution,” it says, implying that its unnamed competitor might be stack-challenged.
In any case, Ross says he’s not competing with Nvidia but offering an alternative experience—and not just in terms of speed. He’s on a mission to make sure that Groq will deliver fair results unsullied by political point of view or pressure from commercial interests. “Groq will never be involved in advertising, ever,” he says. “Because that’s influencing people. AI should always be neutral, it should never tell you what you should be thinking. Groq exists to make sure everyone has access. It’s helping you make your decision, not its decisions.” Great sentiments, but even the Groq chatbot, when I quizzed it about early-stage idealism, is skeptical about such claims. “The pressure to generate profits and scale can lead even well-intentioned founders to compromise on their ideals,” it promptly replied.
One other thing. You may have heard that Elon Musk has given the name “Grok” to the LLM created by his AI company. This took Ross by surprise, since he says he trademarked the name of his company when he founded it in 2016, and he believes it covers the phonetically identical original term. “We called dibs,” he says. “He can’t have it. We’ve sent a cease-and-desist letter.” So far he hasn’t gotten a response from Musk.
When I asked Groq about the name dispute, it first cautioned me that it doesn’t provide legal opinions. “However, I can provide some context that may help you understand the situation better,” it said. The bot explained that the term grok has been used in the industry for decades, so Musk would be within his rights to use it. On the other hand, if Groq trademarked the term, it might well have an exclusive claim. All accurate and on the mark—everything you’d expect from a modern LLM. What you would not expect was that the reply appeared in less than a second.
In my book on Google, In the Plex, I explained how the company, and its cofounder Larry Page, prioritized speed and recognized that faster products are used not only more often, but differently. It became an obsession within Google.
Engineers working for Page learned quickly enough of [his speed] priority. “When people do demos and they’re slow, I’m known to count sometimes,” he says. “One one-thousand, two one-thousand. That tends to get people’s attention.” Actually, if your product could be measured in seconds, you’d already failed. Paul Buchheit remembers one time when he was doing an early Gmail demo in Larry’s office. Page made a face and told him it was way too slow. Buchheit objected, but Page reiterated his complaint, charging that the reload took at least 600 milliseconds. (That’s six-tenths of a second.) Buchheit thought, You can’t know that, but when he got back to his own office he checked the server logs. Six hundred milliseconds. “He nailed it,” says Buchheit.
The data in Google’s logs justified the obsession with speed. When things go slowly, says Urs Hölzle, “people are unconsciously afraid of doing another search, because it’s slow. Or they are more likely to try another result than rephrase the query. I’m sure if you ask them, none of them would tell you, but in aggregate you really see that.” On the other hand, when you speed things up, they search more. Hölzle would cite Google’s experience when the company boosted the performance of its Picasa web-based photo service, making slideshows run three times as fast. Even though there was no announcement of the improvement, traffic on the site increased 40 percent the first day it was implemented. “It just happened,” says Hlzle. “The only thing we changed was the speed.”
In 2007, Google conducted some user studies that measured the behavior of people whose search results were artificially delayed. One might think that the minuscule amounts of latency involved in the experiment would be negligible—they ranged between 100 and 400 milliseconds. But even those tiny hiccups in delivering search results acted as a deterrent to future searches. The reduction in the number of searches was small but significant, and were measurable even with 100 milliseconds (one-tenth of a second) latency. What’s more, even after the delays were removed, the people exposed to the slower results would take a long time to resume their previous level of searching.
Walt asks, “Will we ever see another astonishing run of new hardware (and some software) products like we saw from Apple under Steve Jobs from 1998 through 2010?”
Thanks for the question, Walt. No one knows better than you about that run of products. Right now it’s hard to identify a single company that could do what Apple accomplished under Steve Jobs in that time frame. But I do think we are on the cusp of a Cambrian-style explosion of groundbreaking products arising from generative AI. They will probably come from different companies but in the aggregate will have an impact exceeding the amazing devices from Apple’s run. A decade from now the remarkable multimodal LLMs of today (which give output in different media) will look like crude test items. And there will be an AI-powered successor to the smartphone.
It’s reasonable to expect that in the 12-year horizon you reference, there will be breakthroughs on the level of transformers to make generative AI even more accomplished. The product of the year in 2034 will be something that is unimaginable today. I can’t wait for the review of that gadget from that era’s Walt Mossberg. The only question will be whether the reviewer is human.
You can submit questions to [email protected]. Write ASK LEVY in the subject line.
Stop humming the song, “It Never Rains in California.” Instead give a spin to: “Who’ll Stop the Rain?” Or “Mudslide Slim.”
Our Gadget Lab podcast explains how Nvidia got to be so powerful.
Here’s all the crazy stuff that debuted at the Mobile World Congress.
We know Meta loves ads targeted to its users. But who knew that they are adored by both the Pentagon and Vladamir Putin?
Why the Apple Car died. RIP.