{"id":62347,"date":"2017-07-28T00:00:00","date_gmt":"2017-07-28T00:00:00","guid":{"rendered":"http:\/\/blog.mozilla.org\/foxtail\/2017\/07\/28\/machine-learning-speech-recognition\/"},"modified":"2021-02-09T05:38:22","modified_gmt":"2021-02-09T05:38:22","slug":"machine-learning-speech-recognition","status":"publish","type":"post","link":"https:\/\/blog.mozilla.org\/en\/mozilla\/machine-learning-speech-recognition\/","title":{"rendered":"How Could You Use a Speech Interface?"},"content":{"rendered":"<p>Last month in San Francisco, my colleagues at Mozilla took to the streets to collect samples of spoken English from passers-by. It was the kickoff of our <a href=\"https:\/\/voice.mozilla.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">Common Voice Project<\/a>, an effort to build an open database of audio files that developers can use to train new speech-to-text (STT) applications.<\/p>\n<p>What\u2019s the big deal about speech recognition?<\/p>\n<p>Speech is fast becoming a preferred way to interact with personal electronics like phones, computers, tablets and televisions. Anyone who\u2019s ever had to type in a movie title using their TV\u2019s remote control can attest to the convenience of a speech interface. According to <a href=\"http:\/\/mashable.com\/2016\/08\/29\/baidu-deep-speech-2-fast-speech-recognition\/#FGjjswH73mqF\" target=\"_blank\" rel=\"noopener noreferrer\">one study<\/a>, it\u2019s three times faster to talk to your phone or computer than to type a search query into a screen interface.<\/p>\n<p>Plus, the number of speech-enabled devices is increasing daily, as Google Home, Amazon Echo and Apple HomePod gain traction in the market. Speech is also finding its way into multi-modal interfaces, in-car assistants, smart watches, lightbulbs, bicycles and thermostats. So speech interfaces are handy &#8212; and fast becoming ubiquitous.<\/p>\n<p>The good news is that a lot of technical advancements have happened in recent years, so it\u2019s simpler than ever to create production-quality STT and text-to-speech (TTS) engines. Powerful tools like artificial intelligence and machine learning, combined with today\u2019s more advanced speech algorithms, have changed our traditional approach to development. Programmers no longer need to build phoneme dictionaries or hand-design processing pipelines or custom components. Instead, speech engines can use deep learning techniques to handle varied speech patterns, accents and background noise \u2013 and deliver better-than-ever accuracy.<\/p>\n<h3><b>The Innovation Penalty<\/b><\/h3>\n<p>There are barriers to open innovation, however. Today\u2019s speech recognition technologies are largely tied up in a few companies that have invested heavily in them. Developers who want to implement STT on the web are working against a fractured set of APIs and support. Google Chrome supports an STT API that is different from the one Apple supports in Safari, which is different from Microsoft\u2019s.<\/p>\n<p>So if you want to create a speech interface for a web application that works across all browsers, you would need to write code that would work with each of the various browser APIs. Writing and then rewriting code to work with every browser isn\u2019t feasible for many projects, especially if the code base is large or complex.<\/p>\n<p>There is a second option: You can purchase access to a non-browser-based API from Google, IBM or Nuance. Fees for this can cost roughly one cent per invocation. If you go this route, then you get one stable API to write to. But at one cent per utterance, those fees can add up quickly, especially if your app is wildly popular and millions of people want to use it. This option has a success penalty built into it, so it\u2019s not a solid foundation for any business that wants to grow and scale.<\/p>\n<h3><b>Opening Up Speech on the Web<\/b><\/h3>\n<p>We think now is a good time to try to open up the still-young field of speech technology, so more people can get involved, innovate, and compete with the larger players. To help with that, the <a href=\"https:\/\/research.mozilla.org\/machine-learning\/\" target=\"_blank\" rel=\"noopener noreferrer\">Machine Learning<\/a> team in Mozilla Research is working on an <a href=\"https:\/\/github.com\/mozilla\/DeepSpeech\" target=\"_blank\" rel=\"noopener noreferrer\">open source STT engine<\/a>. That engine will give Mozilla the ability to support STT in our Firefox browser, and we plan to make it\u00a0freely available to the speech developer community, with no access or usage fees.<\/p>\n<p>Secondly, we want to rally other browser companies to support the <a href=\"https:\/\/dvcs.w3.org\/hg\/speech-api\/raw-file\/tip\/speechapi.html\" target=\"_blank\" rel=\"noopener noreferrer\">Web Speech API<\/a>, a W3C community group specification that can allow developers to write speech-driven interfaces that utilize any STT service they choose, rather than having to select a proprietary or commercial service. That could open up a competitive market for smart home hubs\u2013devices like the Amazon Echo that could be configured to communicate with one another, and other systems, for truly integrated speech-responsive home environments.<\/p>\n<h3><b>Where Could Speech Take Us?<\/b><\/h3>\n<p>Voice-activated computing could do a lot of good. Home hubs could be used to provide safety and health monitoring for ill or elderly folks who want to stay in their homes. Adding Siri-like functionality to cars could make our roads safer, giving drivers hands-free access to a wide variety of services, like direction requests and chat, so eyes stay on the road ahead. Speech interfaces for the web could enhance browsing experiences for people with visual and physical limitations, giving them the option to talk to applications instead of having to type, read or move a mouse.<\/p>\n<p>It\u2019s fun to think about where this work might lead. For instance, how might we use <a href=\"https:\/\/en.wikipedia.org\/wiki\/Silent_speech_interface\" target=\"_blank\" rel=\"noopener noreferrer\">silent speech interfaces<\/a> to keep conversations private? If your phone could read your lips, you could share personal information without the person sitting next to you at a caf\u00e9 or on the bus overhearing. Now that\u2019s a perk for speakers and listeners alike.<a href=\"https:\/\/youtu.be\/fa5QGremQf8\" target=\"_blank\" rel=\"noopener noreferrer\"><img decoding=\"async\" loading=\"lazy\" class=\"alignright wp-image-10702 size-medium\" src=\"https:\/\/blog.mozilla.org\/wp-content\/uploads\/2017\/07\/Lipread-300x167.png\" alt=\"Speech recognition using lip-reading\" width=\"300\" height=\"167\" srcset=\"https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2017\/07\/Lipread-300x167.png 300w, https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2017\/07\/Lipread-768x428.png 768w, https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2017\/07\/Lipread-600x334.png 600w, https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2017\/07\/Lipread.png 860w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p>Want to participate? We\u2019re looking for more folks to participate in both open source projects: <a href=\"https:\/\/github.com\/mozilla\/DeepSpeech\" target=\"_blank\" rel=\"noopener noreferrer\">STT engine development<\/a> and the <a href=\"https:\/\/github.com\/mozilla\/voice-web\" target=\"_blank\" rel=\"noopener noreferrer\">Common Voice application repository<\/a>.<\/p>\n<p>If programming is not your bag, you can always donate a few sentences to the <a href=\"https:\/\/voice.mozilla.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">Common Voice Project<\/a>. You might read: \u201cIt made his heart rise into his throat\u201d or \u201cI have the diet of a kid who won $20.\u201d Either way, it\u2019s quick and fun. And it helps us offer developers an open source option that\u2019s robust and affordable.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Last month in San Francisco, my colleagues at Mozilla took to the streets to collect samples of spoken English from passers-by. It was the kickoff of our Common Voice Project, an effort to build an open database of audio files that developers can use to train new speech-to-text (STT) applications. What\u2019s the big deal about [&hellip;]<\/p>\n","protected":false},"author":1495,"featured_media":10706,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[5],"tags":[],"coauthors":[],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v22.5 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>How Could You Use a Speech Interface?<\/title>\n<meta name=\"description\" content=\"Mozilla&#039;s Machine Learning team is working on an open source speech recognition engine, so more folks can develop speech interfaces for a range of devices.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/blog.mozilla.org\/en\/mozilla\/machine-learning-speech-recognition\/\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/blog.mozilla.org\/en\/mozilla\/machine-learning-speech-recognition\/\",\"url\":\"https:\/\/blog.mozilla.org\/en\/mozilla\/machine-learning-speech-recognition\/\",\"name\":\"How Could You Use a Speech Interface?\",\"isPartOf\":{\"@id\":\"https:\/\/blog.mozilla.org\/en\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/blog.mozilla.org\/en\/mozilla\/machine-learning-speech-recognition\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/blog.mozilla.org\/en\/mozilla\/machine-learning-speech-recognition\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2017\/07\/Common-Voice-large.png\",\"datePublished\":\"2017-07-28T00:00:00+00:00\",\"dateModified\":\"2021-02-09T05:38:22+00:00\",\"author\":{\"@id\":\"https:\/\/blog.mozilla.org\/en\/#\/schema\/person\/43ad8add452ec163df03646f5cc26eea\"},\"description\":\"Mozilla's Machine Learning team is working on an open source speech recognition engine, so more folks can develop speech interfaces for a range of devices.\",\"breadcrumb\":{\"@id\":\"https:\/\/blog.mozilla.org\/en\/mozilla\/machine-learning-speech-recognition\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/blog.mozilla.org\/en\/mozilla\/machine-learning-speech-recognition\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/blog.mozilla.org\/en\/mozilla\/machine-learning-speech-recognition\/#primaryimage\",\"url\":\"https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2017\/07\/Common-Voice-large.png\",\"contentUrl\":\"https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2017\/07\/Common-Voice-large.png\",\"width\":1920,\"height\":1080},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/blog.mozilla.org\/en\/mozilla\/machine-learning-speech-recognition\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/blog.mozilla.org\/en\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"How Could You Use a Speech Interface?\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/blog.mozilla.org\/en\/#website\",\"url\":\"https:\/\/blog.mozilla.org\/en\/\",\"name\":\"The Mozilla Blog\",\"description\":\"News and Updates about Mozilla\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/blog.mozilla.org\/en\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/blog.mozilla.org\/en\/#\/schema\/person\/43ad8add452ec163df03646f5cc26eea\",\"name\":\"Kelly Davis\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/blog.mozilla.org\/en\/#\/schema\/person\/image\/37abd3db0af06a248bdeb4b708ff6160\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/57875eda51e3d83baa983b46c7041923?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/57875eda51e3d83baa983b46c7041923?s=96&d=mm&r=g\",\"caption\":\"Kelly Davis\"},\"url\":\"https:\/\/blog.mozilla.org\/en\/author\/kdavismozilla-com\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"How Could You Use a Speech Interface?","description":"Mozilla's Machine Learning team is working on an open source speech recognition engine, so more folks can develop speech interfaces for a range of devices.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/blog.mozilla.org\/en\/mozilla\/machine-learning-speech-recognition\/","schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/blog.mozilla.org\/en\/mozilla\/machine-learning-speech-recognition\/","url":"https:\/\/blog.mozilla.org\/en\/mozilla\/machine-learning-speech-recognition\/","name":"How Could You Use a Speech Interface?","isPartOf":{"@id":"https:\/\/blog.mozilla.org\/en\/#website"},"primaryImageOfPage":{"@id":"https:\/\/blog.mozilla.org\/en\/mozilla\/machine-learning-speech-recognition\/#primaryimage"},"image":{"@id":"https:\/\/blog.mozilla.org\/en\/mozilla\/machine-learning-speech-recognition\/#primaryimage"},"thumbnailUrl":"https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2017\/07\/Common-Voice-large.png","datePublished":"2017-07-28T00:00:00+00:00","dateModified":"2021-02-09T05:38:22+00:00","author":{"@id":"https:\/\/blog.mozilla.org\/en\/#\/schema\/person\/43ad8add452ec163df03646f5cc26eea"},"description":"Mozilla's Machine Learning team is working on an open source speech recognition engine, so more folks can develop speech interfaces for a range of devices.","breadcrumb":{"@id":"https:\/\/blog.mozilla.org\/en\/mozilla\/machine-learning-speech-recognition\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/blog.mozilla.org\/en\/mozilla\/machine-learning-speech-recognition\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/blog.mozilla.org\/en\/mozilla\/machine-learning-speech-recognition\/#primaryimage","url":"https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2017\/07\/Common-Voice-large.png","contentUrl":"https:\/\/blog.mozilla.org\/wp-content\/blogs.dir\/278\/files\/2017\/07\/Common-Voice-large.png","width":1920,"height":1080},{"@type":"BreadcrumbList","@id":"https:\/\/blog.mozilla.org\/en\/mozilla\/machine-learning-speech-recognition\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/blog.mozilla.org\/en\/"},{"@type":"ListItem","position":2,"name":"How Could You Use a Speech Interface?"}]},{"@type":"WebSite","@id":"https:\/\/blog.mozilla.org\/en\/#website","url":"https:\/\/blog.mozilla.org\/en\/","name":"The Mozilla Blog","description":"News and Updates about Mozilla","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/blog.mozilla.org\/en\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/blog.mozilla.org\/en\/#\/schema\/person\/43ad8add452ec163df03646f5cc26eea","name":"Kelly Davis","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/blog.mozilla.org\/en\/#\/schema\/person\/image\/37abd3db0af06a248bdeb4b708ff6160","url":"https:\/\/secure.gravatar.com\/avatar\/57875eda51e3d83baa983b46c7041923?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/57875eda51e3d83baa983b46c7041923?s=96&d=mm&r=g","caption":"Kelly Davis"},"url":"https:\/\/blog.mozilla.org\/en\/author\/kdavismozilla-com\/"}]}},"_links":{"self":[{"href":"https:\/\/blog.mozilla.org\/en\/wp-json\/wp\/v2\/posts\/62347"}],"collection":[{"href":"https:\/\/blog.mozilla.org\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.mozilla.org\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/en\/wp-json\/wp\/v2\/users\/1495"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/en\/wp-json\/wp\/v2\/comments?post=62347"}],"version-history":[{"count":0,"href":"https:\/\/blog.mozilla.org\/en\/wp-json\/wp\/v2\/posts\/62347\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.mozilla.org\/en\/wp-json\/wp\/v2\/media\/10706"}],"wp:attachment":[{"href":"https:\/\/blog.mozilla.org\/en\/wp-json\/wp\/v2\/media?parent=62347"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.mozilla.org\/en\/wp-json\/wp\/v2\/categories?post=62347"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.mozilla.org\/en\/wp-json\/wp\/v2\/tags?post=62347"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/blog.mozilla.org\/en\/wp-json\/wp\/v2\/coauthors?post=62347"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}