The explosion of audio and video content and interfaces over the past few years is obvious, but the means to manage all of this media behind the scenes hasn’t quite caught up. AssemblyAIpowered by $28 million in new funding, aims to become the go-to solution for analyzing speech, providing ultra-simple API access to transcribe, summarize, and otherwise understand what’s happening in thousands of audio streams at once.
Multimedia has become the norm for so many things in an incredibly short time: phone calls and meetings have become video calls, social media posts have become 10-second clips, chatbots have learned to speak and understand speech. Countless new applications are emerging, and like any new and growing industry, users need to be able to work with the data produced by these applications in order to run them well or create something new on top of them.
The problem is that audio isn’t naturally easy to use. How do I “search” for an audio stream? You can watch the waveform or browse through it, but you’re more likely to want to transcribe it first and then search for the resulting text. This is where AssemblyAI comes in: although there are many transcription services out there, it is often not easy to integrate them into your own application or business process.
“If you want to moderate content, research or summarize audio data, you need to transform that data into a format that is more flexible and upon which you can build business functionality and processes,” said AssemblyAI CEO and co-founder. Dylan Fox. “So we said to ourselves, let’s create a super-accurate voice analysis API that anyone can call, even at a hackathon – like a Twilio or Stripe-style integration. People need a lot of help to create these features, but they don’t want to bundle a bunch of vendors together.
AssemblyAI offers a handful of different APIs that you can call extremely simply (a line or two of code) to do things like “check this podcast for banned content”, or “identify speakers in this conversation”, or ” summarize this meeting in less than 100 words.
You may very well, like me, be skeptical that a single small business could produce work tools to accomplish so many tasks so simply, considering how complex those tasks are once you get in there. Fox acknowledged it was a challenge, but said the technology has come a long way in a short time.
“There has been a rapid increase in the accuracy of these models, especially in recent years,” he said. “Summary, feeling identification… they’re all really good now. And we’re actually pushing the state of the art – our models are better than what’s out there, because we’re one of the few startups that’s really doing large-scale deep learning research We will be spending over $1 million on GPUs and compute for R&D and training in the next few months alone.
This may be more difficult to grasp intuitively because it’s not so easily demonstrable, but language models have arisen as have things like image generation (This ___ doesn’t exist) and computer vision (Face IDs, security cameras). Of course, GPT-3 is a familiar example, but Fox pointed out that understanding and generating the written word is practically an entirely different area of research than analyzing conversation and informal speech. So while the same advances in machine learning techniques (like transformers and new, more efficient training frameworks) have helped both, they are apples and oranges in many ways.
The result, in any case, was that it is possible to perform efficient moderation or summarization processes on an audio clip of a few seconds or an hour, simply by calling the API. This is extremely useful when you’re creating or embedding a feature like, say, a short video – if you expect a hundred thousand clips to be downloaded every hour, what’s your process for a first pass to ensure that they are not porn, or scams, or duplicates? And how long will the launch be delayed while you build this process?
Instead, Fox hopes companies in this position will seek a simple and efficient route, as they would if faced with the addition of a checkout process. Of course, you can build one from scratch – or you can add Stripe in about 15 minutes. Not only is this somehow fundamentally desirable, but it clearly separates them from the more complex multi-service packages that define audio analysis products from major vendors such as Microsoft and Amazon.
The company already has hundreds of paying customers, tripled its revenue last year and now processes one million audio streams per day. “We are 100% live. There’s a huge market and a huge need, and the customer spending is there,” Fox said.
The $28 million round was “led by Accel, with participation from Y Combinator, John and Patrick Collison (Stripe), Nat Friedman (GitHub), and Daniel Gross (Pioneer)”. The plan is to spread all those zeros into recruiting, R&D infrastructure, and building the product pipeline. As noted by Fox, the company is spending a million on GPUs and servers over the next few months, a bunch of Nvidia A100s that will power the incredibly compute-intensive research and training processes. Otherwise, you’re stuck paying for cloud services, so it’s best to remove that band-aid early.
As for recruiting, I suggested that they might struggle to recruit in direct competition with Google and Facebook, who of course are hard at work on their own audio analytics pipelines. Fox was optimistic, however, believing the culture there can be slow and stifling.
“I think there’s definitely a desire among really good AI researchers and engineers to want to work at the cutting edge of technology – and at the cutting edge of technology in production,” he said. “You come up with something innovative, and a few weeks later you put it into production…a startup is the only place you can do things like that.”