How Could You Use a Speech Interface?

Last month in San Francisco, my colleagues at Mozilla took to the streets to collect samples of spoken English from passers-by. It was the kickoff of our Common Voice Project, an effort to build an open database of audio files that developers can use to train new speech-to-text (STT) applications.

What’s the big deal about speech recognition?

Speech is fast becoming a preferred way to interact with personal electronics like phones, computers, tablets and televisions. Anyone who’s ever had to type in a movie title using their TV’s remote control can attest to the convenience of a speech interface. According to one study, it’s three times faster to talk to your phone or computer than to type a search query into a screen interface.

Plus, the number of speech-enabled devices is increasing daily, as Google Home, Amazon Echo and Apple HomePod gain traction in the market. Speech is also finding its way into multi-modal interfaces, in-car assistants, smart watches, lightbulbs, bicycles and thermostats. So speech interfaces are handy — and fast becoming ubiquitous.

The good news is that a lot of technical advancements have happened in recent years, so it’s simpler than ever to create production-quality STT and text-to-speech (TTS) engines. Powerful tools like artificial intelligence and machine learning, combined with today’s more advanced speech algorithms, have changed our traditional approach to development. Programmers no longer need to build phoneme dictionaries or hand-design processing pipelines or custom components. Instead, speech engines can use deep learning techniques to handle varied speech patterns, accents and background noise – and deliver better-than-ever accuracy.

The Innovation Penalty

There are barriers to open innovation, however. Today’s speech recognition technologies are largely tied up in a few companies that have invested heavily in them. Developers who want to implement STT on the web are working against a fractured set of APIs and support. Google Chrome supports an STT API that is different from the one Apple supports in Safari, which is different from Microsoft’s.

So if you want to create a speech interface for a web application that works across all browsers, you would need to write code that would work with each of the various browser APIs. Writing and then rewriting code to work with every browser isn’t feasible for many projects, especially if the code base is large or complex.

There is a second option: You can purchase access to a non-browser-based API from Google, IBM or Nuance. Fees for this can cost roughly one cent per invocation. If you go this route, then you get one stable API to write to. But at one cent per utterance, those fees can add up quickly, especially if your app is wildly popular and millions of people want to use it. This option has a success penalty built into it, so it’s not a solid foundation for any business that wants to grow and scale.

Opening Up Speech on the Web

We think now is a good time to try to open up the still-young field of speech technology, so more people can get involved, innovate, and compete with the larger players. To help with that, the Machine Learning team in Mozilla Research is working on an open source STT engine. That engine will give Mozilla the ability to support STT in our Firefox browser, and we plan to make it freely available to the speech developer community, with no access or usage fees.

Secondly, we want to rally other browser companies to support the Web Speech API, a W3C community group specification that can allow developers to write speech-driven interfaces that utilize any STT service they choose, rather than having to select a proprietary or commercial service. That could open up a competitive market for smart home hubs–devices like the Amazon Echo that could be configured to communicate with one another, and other systems, for truly integrated speech-responsive home environments.

Where Could Speech Take Us?

Voice-activated computing could do a lot of good. Home hubs could be used to provide safety and health monitoring for ill or elderly folks who want to stay in their homes. Adding Siri-like functionality to cars could make our roads safer, giving drivers hands-free access to a wide variety of services, like direction requests and chat, so eyes stay on the road ahead. Speech interfaces for the web could enhance browsing experiences for people with visual and physical limitations, giving them the option to talk to applications instead of having to type, read or move a mouse.

It’s fun to think about where this work might lead. For instance, how might we use silent speech interfaces to keep conversations private? If your phone could read your lips, you could share personal information without the person sitting next to you at a café or on the bus overhearing. Now that’s a perk for speakers and listeners alike.Speech recognition using lip-reading

Want to participate? We’re looking for more folks to participate in both open source projects: STT engine development and the Common Voice application repository.

If programming is not your bag, you can always donate a few sentences to the Common Voice Project. You might read: “It made his heart rise into his throat” or “I have the diet of a kid who won $20.” Either way, it’s quick and fun. And it helps us offer developers an open source option that’s robust and affordable.


Share on Twitter