OpenAI's Whisper And The Voice Interface Revolution

Is writing with keyboards and touchscreens on the way out?

Aug 01, 2023

“St. Mark writes his Gospel at the dictation of St. Peter”. Attributed to Pasquale Ottini, Public domain, via Wikimedia Commons.

The Hidden Revolution

In the past seven months, the world’s imagination has been captured by language generation technologies like ChatGPT and Claude. In the meantime, another AI-enabled revolution is quietly unfolding.

It is the advent of unprecedented speech recognition capabilities, thanks to AI models such as OpenAI’s Whisper. These models allow us to easily communicate with our devices and AI helpers via voice.

The importance of OpenAI’s Whisper was first brought to my attention by

Ethan Mollick

’s tweets and this video by AI Explained, which I strongly recommend.

In this article:

I explore what Whisper can do and why it is significantly better than other speech recognition tools.
I show you how you can start using Whisper in your daily life to completely transform the way you work and interact with your devices.
I present three prototypes that I’ve personally built to fully exploit Whisper and integrate it into my work.
Finally, I explore the broader implications for the future of communication technology.

What Whisper Can Do

Whisper can transcribe and translate audio in multiple languages. Source: https://openai.com/research/whisper

For the past seventy years, writing on a keyboard (or more recently, tapping on a touchscreen) has been the main interface for communicating with computers.

Clearly, this is not how humans naturally communicate. We are biologically and culturally primed to communicate with our voices. And in every respectable sci-fi production, people talk to their computers.

In fact, we’ve had ‘decent’ speech-to-text solutions for quite a while now, like Apple’s Siri and Amazon’s Alexa. But they were never quite good enough to ignite an interface revolution. With interface technology, a good enough solution is usually not sufficient. To get people to change their habits, you need a solution that is almost perfect.

Is OpenAI’s Whisper ‘almost perfect’? The best way to find out is to experiment for yourself. From my part, I find Whisper to be extremely accurate. For instance, 95% of the text you are reading was directly dictated to Whisper, with very little manual intervention.

Whisper has several features which make it a superior speech recognition tool:

It can transcribe at least 57 languages besides English. I have tried transcribing Italian, Spanish, German, and Romanian; the transcriptions seemed quite accurate.
When you talk to Whisper, the text comes out fully structured, with correct punctuation and syntax. It is not a shapeless blob of words like with most transcription tools. That is much less time invested in revising and correcting the text.
Whisper is able to accurately transcribe technical names like ChatGPT, PyPy package management, SSRIs. To do this, it uses contextual knowledge about the world that it has acquired during training.
You can direct and personalize Whisper’s output by providing a written prompt. For example, you can choose between a well-structured text and a more literal transcription that involves speech artifacts like “mmm”, “ahh”, and so on.
Whisper can deal with noisy environments and overlapping voices. From what I’ve experienced, Whisper can identify the main voice in the audio and transcribe it without getting distracted by background noise.
Like ChatGPT, Whisper is available via API for developers to build on it. Whisper is also open-source, so you could host the model locally without depending on OpenAI.
Whisper is relatively cheap. Currently, it costs $0.36 to transcribe one hour of audio via OpenAI’s Whisper endpoint.

How To Use Whisper

ChatGPT Phone Applications

The ChatGPT Android/iOS app uses Whisper for voice input.

Whisper handles voice input in the ChatGPT app for Android and iOS. This is the best way to try Whisper for free. These apps have been released very recently, and not many users know that they contain a state-of-the-art speech recognition model. I’ve already made extensive use of this feature: I find that it allows me to quickly explain my problem to ChatGPT and get a response in under one minute. Which makes ChatGPT feel a bit less like a lifeless chat and more like a trusted assistant or a support agent.

Given Whisper’s frontier capabilities, I was excited to integrate it more tightly with my workflows. Since I couldn’t find any apps that do this, I built a few prototypes that showcase Whisper’s power.

Prototype 1: Whisper As Complementary Keyboard

I wanted Whisper to be a second keyboard for my laptop. I built wkey, a Python plugin, to make it happen. It is best to see it in action in this one-minute video demo.

At any time I can press a button on my laptop, talk, and the transcription will be sent directly where my cursor is. This allows me to use Whisper to write documents, chat, or run Google searches: in short, for everything. In fact, that’s how I wrote 95% of this article!

wkey is now available as a Python package.

Prototype 2: Whisper As Chat Companion

While having Whisper in the ChatGPT app is great, I wanted to integrate it with all my chats. I built a Telegram bot for this purpose, which I call WhisperBot. I send voice messages to WhisperBot and it responds with a transcription.

WhisperBot transcribing my voice message on Telegram.

Whisper can directly transcribe or translate at least 57 languages, so you can probably talk to it in your language.

Here’s how I’ve been using WhisperBot:

One semi-hidden source of social conflict for chats is that people prefer sending voice messages, but they prefer receiving text messages. Talking is more convenient than typing, and reading is more convenient than listening. With WhispeBot I can:
- Turn my voice messages into text. I personally like to make long voice messages with my thoughts, but people are usually not excited to get a 3-minute voice message from me. Now, I send my voice message to WhisperBot and then forward the transcription. It usually doesn’t require any manual revision.
- Transcribe other people’s voice messages. When I get a voice message from someone, I usually forward it to WhisperBot and just read the transcription.
Note down my thoughts. I already had a private telegram chat where I wrote down thoughts and ideas. It is more fun to do it with my voice and have it transcribed immediately. Once your thoughts are immediately turned into text, you can send them to your knowledge management system or to ChatGPT for further refinement.
Write while I’m on the go. I can now write an entire blog post or email during a walk. I explain my thoughts in a voice message to WhisperBot, get the transcript, put it into ChatGPT, and have it reformatted as a post. I can also include direct indications, such as “ChatGPT, add two more examples of situations in which communication is crucial”. This system greatly reduces the time to produce a document, and it is very fun to write while on a walk.

The code for WhisperBot is freely available in my repository.

Prototype 3: Whisper As Translator and Interpreter

Since Whisper can easily understand 57+ languages and translate them to English, it is clear that it can be used to build an amazing translator and interpreter application. This is important for me because my parents and partner don’t speak the same language, and I want them to be able to communicate independently.

Therefore, I’ve built a prototype for an app that allows people to speak on Telegram without knowing each other’s language, crossing cultural and language barriers. That is a topic for future posts, but I wanted to mention it here since it’s a great use case for Whisper.

GitHub Copilot Voice: A New Programming Paradigm

Copilot Voice is an upcoming GitHub Copilot feature. I warmly recommend watching the demo.

Copilot Voice will allow programmers to vocally describe what they want, while the AI takes charge of the low-level task of writing the code. It combines Whisper with Copilot, which uses GPT to translate the programmer’s intent into code.

This is a revolutionary way to rethink coding. As a professional developer, I believe that it will radically transform how we work, and this transformation will occur within the next year.

I have applied for the Copilot Voice waitlist, but I’m still waiting to try it. Hear that, GitHub? 😊

The Future of Computing Interfaces

We have explored how powerful Whisper is and how it can be used for daily life. Now let us zoom out a little and reflect on what this means for the future of computing interfaces.

First, note that writing on a keyboard or tapping a touchscreen is not an efficient way for a human to communicate. The average typing speed is about 40 words per minute. The average writing speed can go up to 150 words per minute. Talking can be three times faster than typing.

Talking feels more fun and natural. People who are unfamiliar with technology must learn how to type on a keyboard. They don’t need to learn how to talk because they spent the first few years of their lives learning that. It is likely that we will spend more time producing text if we can do it with our voice.

An excellent speech recognition system allows you to communicate without your hands. You could be on a walk, driving, or you could be a doctor engaged in a delicate surgical operation. Again, this makes communication easier and more natural.

It is important to note that your voice contains much more information than some hits on a keyboard. Here’s ChatGPT’s take:

Voice messages possess a rich depth of communication attributes such as tone, pitch, volume, cadence, rhythm, and emphasis, which can significantly influence the meaning of a message. Moreover, voice messages also carry non-verbal sounds, such as laughter, sighs, pauses, or breaths, which can further augment the communicated sentiments. Other nuanced auditory elements like accent, pronunciation, and voice quality can reveal cultural background, education, or even health status. Furthermore, temporal aspects, such as the pace of speech, the length of pauses, and the use of filler words can offer insights into the speaker's thought process, comfort level, or their familiarity with the topic.

While Whisper doesn’t analyze this information now, you can bet that future speech recognition tools will differentiate themselves by their ability to use this information to better understand and assist the user.

Voice understanding capabilities will also lead to more immersion. It will be possible to speak with AI entities via live voice communication, phone, and video calls, or within a video game.

In fact, it is perhaps a historical accident that the main interface for communicating with AI models is currently a written chat. It is likely that a year from now, most people will interact with AI via voice.

Moreover, the lines between media types will blur because any content that exists in the form of audio will instantly be available in the form of text at almost zero cost. Language barriers will also blur because it will be possible to instantly and accurately transcribe and translate content from any language, whether written or audio. These changes will transform the way we produce and consume content.

Conclusion

In the near future, computer devices will become more usable, accessible, fun to use, and immersive. Of course, this also applies to AI entities like ChatGPT and Claude. We will still produce a lot of text in our digital lives, but most of this text will be made via voice and not via mechanical typing. Content barriers and language barriers will fall apart. Programmers will work with their voices more than with their hands, and their offices will become as noisy as salespeople’s offices.

These changes will require adaptation, especially for ‘dinosaurs’ like me who have been using the mechanical typing interface for more than 20 years. After I integrated Whisper with my keyboard, many activities became faster and more efficient. The biggest limit on exploiting this new capability was… my constant forgetting that it was there and that I should use it! The force of habit is extremely strong, and I am used to doing everything with the keyboard. If I want to leverage this power, I have to retrain myself and completely alter the way that I relate to my computer.

Finally, we will have to update our social norms and expectations to handle the reality of people constantly speaking to their devices in the office, at home, and on the street. If this seems weird to you, remember that wearing headphones or talking to a phone in public was once considered weird too.

If you’ve enjoyed this post, consider subscribing to support me and get updates.

AI Primer

Discussion about this post