AI for Speech Recognition – Current Companies, Technology, and Trends

Ayn de Jesus

Ayn serves as AI Analyst at Emerj - covering artificial intelligence use-cases and trends across industries. She previously held various roles at Accenture.

AI for Speech Recognition - Current Companies, Technology, and Where Its Headed 1

Speech recognition is technology that can recognize spoken words, which can then be converted to text. A subset of speech recognition is voice recognition, which is the technology for identifying a person based on their voice.

Facebook, Amazon, Microsoft, Google and Apple — five of the world’s top tech companies — are already offering this feature on various devices through services like Google Home, Amazon Echo and Siri.

With a number of voice recognition products on the market, we decided to look into the business implication of voice recognition. By researching the speech recognition technology of these companies, we try to answer the following questions for our readers:  

  • How is speech recognition driving business value for these companies?
  • Why are they investing in speech recognition?
  • What could this technology look like in a few years?

We start with some context on how and why the tech giants are developing voice recognition technology. followed by a rundown of voice recognition technology from Facebook, Amazon, Microsoft, Google and Apple.

Potential Reasons for Developing Speech Recognition Technology

Technology companies are recognizing interests in speech recognition technologies and are working toward making voice recognition a standard for most products. One goal of these companies may be to make voice assistants speak and reply with greater accuracy around context and content.

Research shows that the use of virtual assistants with speech recognition capabilities is forecast to keep increasing in the next year, from 60.5 million people in the United States in 2017 to 62.4 million in 2018,. By 2019, 66.6 million Americans are projected to be using speech or voice recognition technology.

To build a robust speech recognition experience, the artificial intelligence behind it has to become better at handling challenges such as accents and background noise. Today, developments in natural language processing and neural network technology have improved the speech and voice technology, so much so that today it is reportedly on par with humans. In 2017. For example, the word error rate for Microsoft’s voice technology has been recorded at 5.1 percent by the company, while Google reports that it has reduced its rate to 4.9 percent.

Research firm Research and Markets reported that the speech recognition market will be worth $18 billion by 2023. As the voice recognition technology gets bigger and better, the research estimates that it could be incorporated into everything from phones to refrigerators to cars. A glimpse of that was seen at the annual CES 2017 show in Las Vegas where new devices with voice were either launched or announced.

In an effort to show insights on how the leaders in voice recognition compare, we have created a list highlighting each, as well as its features.

While all applications have very similar features and integration opportunities, we have clustered them based on what our research points to as the primary focus areas of each. The two focus areas we will note in this piece are:

  • Smart Speaker and Smart Home: Highlighting Amazon, Google and Microsoft
  • Mobile Device Applications: Highlighting Apple’s Siri and Facebook’s speech recognition integrations.

Smart Speaker and Smart Home

Amazon Echo and Alexa

Until recently, Amazon’s voice-powered virtual assistant, Alexa, was available only on commercial products made by Amazon. However, Amazon Web Services, has made the voice assistant available to other companies. Amazon partnered with Intel to launch a the Alexa Voice Service Device Software Development Kit that could allow third-party companies to embed Alexa capabilities into their devices. This partnership is a result of Amazon’s “Alexa Everywhere” strategy, which the company says aims to make the technology behind Alexa ubiquitous available to manufacturers of various smart and wearable device.

At the CES 2018 in Las Vegas, Sony, TiVo and Hisense unveiled smart home skills that integrated Alexa, enabling customers to control the TV by voice. Home appliance makers such as Whirlpool, Delta, LG and Haier have also added Alexa’s  voice-recognition skills to help people control all aspects of their home, from TVs and microwaves to air conditioning units and faucets. According to the Amazon Alexa site, more than 13,000 smart home devices from over 2,500 brands can be controlled with Alexa.

Including additions from other companies, Alexa now comes with 30,000 skills. While Apple has Siri and Google has its unnamed virtual assistants built into smartphones and speakers, Amazon integrated Alexa into its intelligent speaker called Echo. Amazon does not disclose final sales numbers, Forrester predicted that it would have sold 22 million Echo units by the end of 2017. Hitting this sales number would make Echo the largest selling voice assistant in the US, according to Forrester.

To enable an Alexa skill for beginners, users can navigate to the Skills section of the Alexa app to view the catalog of available capabilities. Once the user has selected a skill, tap “Enable Skill”. the user can also enable the skills by voice.

As a virtual assistant, Amazon claims that Amazon offers Alexa for Business can help professionals manage their schedule, keep track of tasks, and set reminders. When integrated into devices such as meeting consoles, the application is able to control the conference room settings by the speaker’s voice. Alexa-enabled devices can also act as audio conferencing devices in smaller conference rooms, or control equipment in larger rooms.

Logitech built Alexa into its Harmony remote units to control home entertainment systems and smart home devices. The remote units are activated when customers say simple commands such as “Alexa, turn on the TV,” or, “Alexa, play a DVD.” Alexa then sends the request to Harmony, which relays the request to the home devices via infrared, Bluetooth or IP.

According to Amazon, the prototype team consisted of one senior software architect at Logitech, who took two hours to integrate Alexa into Harmony. Once the prototype was ready, teams from across the Logitech prepared the skill for launch. Amazon reports that building from prototype to production-level skill took less than two weeks, according to Logitech. No other details or numbers were provided in this case study.

Other products that integrate Alexa include Alarm.com, Ecobee and Haiku Home.

On a more basic level, Amazon also offers Transcribe, an automatic speech recognition (ASR) service that enables developers to add speech-to-text capability to their applications. Once the voice capability is integrated into the application, end users can analyze audio files and in return receive a text file of the transcribed speech.

Hassan Sawaf is the Director for Artificial Intelligence at Amazon Web Services, where he leads the building of service and technology initiatives related to human language technology and machine learning. He earned his doctorate in computer science, focusing on speech and language processing, from the RWTH Aachen University in Germany.

Google Home and Assistant

Google Assistant is Google’s voice-activated virtual assistant whose skills include tasks such as sending and requesting payments via Google Pay, or troubleshooting its Pixel 2 XL phone.

Assistant is available on devices such as Android or iOS phones, smart watches, Pixelbook laptops, Android smart TVs/displays and Android auto-enabled cars. Users can also type commands to Assistant when quiet is needed in places like libraries.

For children and families, the Google Assistant offers 50 voice-related games. For example, children can command Assistant to play space trivia with them.

Google and Target have also partnered to enable shoppers to buy products through Assistant.

The spectrum of Google smart speakers that carries Assistant includes Home. Google claims that the speaker works with more than 5,000 smart home devices — such as coffee machines, lights, and thermostats — from more than 150 brands including Sony, Philips, LG and Toshiba.

In Q1 of 2018, Google reportedly sold 3.2 million of its Home and Home Mini devices, outperforming Alexa-powered Echo devices at 2.5 million. Both companies have not released official figures.

To make the Assistant more ubiquitous, Google has opened the software development kit  through Actions, which allows developers to build voice into their own products that support artificial intelligence.

The 3-minute video below shows how developers can create custom device actions with the Google Assistant interface and allow users to interact with devices using their voice.

Google also recently launched the Assistant Investments program, which invests in startups working to advance voice and assistance technologies, whether in hardware or software, and focused on the travel, games, or hospitality industries.

Under the program, Google will provide support in terms of technical, business development, and product leads aspects. The startups will also receive first access to Assistant’s new features and programs; credits for Google products including Google Cloud; and potential co-marketing opportunities, according to Google.

One company that has enlisted into this program is Botsociety, which designs chat applications using Google Assistant, Facebook Messenger and Slack.

Botsociety does not feature case studies in its website, but posts testimonials from Microsoft, Hubspot, Finn.ai, Convrg, and Black Ops, which the company claims as its clients.

Botsociety also claims to serve AXA, Accenture and PWC.

Aside from Botsociety, other startups in this program are Go Moment, Edwin, and Pulse Labs.

Another of Google’s speech-recognition product is the AI-driven Cloud Speech-to-Text tool which enables developers to convert audio to text through deep learning neural network algorithms. Working in 120 languages, the tool enables voice command-and-control, transcribe audio from call centers, process real-time streaming or pre-recorded audio.

The 3-minute video below shows how developers can create voice commands. The first step is to record an audio and create a request in the Speech to Text application programming interface (API) in a JavaScript Object Notation (JSON) format. The developer then sends the JSON request to the speech API and awaits the response.

Ashwin Ram is the Technical Director of AI at Google. Prior to Google, he worked as an Adjunct Professor at the College of Computing, Georgia Institute of Technology for six years. He also served as senior manager for Alexa AI at Amazon for two years. Ashwin holds a doctorate in computer science from Yale University.

Cortana by Microsoft

Microsoft debuted also released its own voice-activated virtual assistant named Cortana in October 2017.

The Cortana home speaker and mobile device application gives a user reminders; keeps notes and lists; and can help manage a calendar, according to Microsoft. It is downloadable from the Apple Store and Google Play and can run on personal computers, smart speakers and mobile phones.

On a Microsoft home speaker called Invoke, Cortana is programmed in to help users voice-control music, queue playlists, turn the volume up or down; and stop or start tracks. However, it does not support major music streaming services outside of Spotify. Microsoft says the smart speaker also answers various questions; makes and receives Skype calls; and checks the latest news and weather.

On PC, Microsoft claims Cortana can manage the user’s emails across Office 365, Outlook.com and Gmail accounts. Cortana clients, or skill partners, include Domino’s, Spotify, Capital One, Philips and FitBit, according to Microsoft.

As a skill example, users could use Cortana to connect with Domino’s Pizza to place an order, reorder their most recent Domino’s order, and track their orders using Domino’s Tracker. Users can authorize the skill by signing into or signing up for a Domino’s profile.

Capital One says its users can also manage their account from a Cortana speaker. To use this feature, Capital One customers must connect their accounts by clicking “Connect” on the Capital One app interface within the Cortana web or mobile platform. Once they accept the terms and conditions, they are prompted to enter your Capital One username and password.

As explained in the video 55-minute below, developers looking to create new Cortana skills for business must first set up the development environment such as cloud resources, development tools on their computers, Android or iOS mobile device or Harman Kardon Invoke speaker, and the Cortana application itself.

A partnership between Cortana and Alexa is underway, allowing Amazon’s smart speakers is access Microsoft’s Office Suite with the help of Cortana. Conversely, Microsoft says users will have access to Alexa’s vast skills and intelligence, and will be able to shop on Amazon. A project launch date has not yet been announced.

The 4-minute video below demonstrates the integration of Cortana and Alexa in one device. To navigate between the two technologies, the speaker must speak the virtual assistant’s name and voice the skill. Alexa can be asked to activate Cortana, and vice versa.

At the core of Microsoft’s speech recognition technology is the Speech to Text interface, which transcribes audio streams into text. This is the same technology that created Cortana, Office, and other Microsoft products. Microsoft says the service recognizes the end of speech and which offers formatting options, including capitalization and punctuation, as well as language translation.

Harry Shum, the Executive VP of Artificial Intelligence and Research at Microsoft, leads the company’s overall AI strategy and initiatives for Cortana and Bing. He received his PhD in Robotics from the Carnegie Mellon University School of Computer Science.

Mobile Device Applications

Siri by Apple

When Apple first integrated Siri into the iPhone 4 in 2011, the virtual assistant connected to a host of web services and offered voice-powered capabilities such as ordering taxis through TaxiMagic, pulling up concert details from StubHub, looking for movie reviews from Rotten Tomatoes, or sifting through restaurant data from Yelp.

Today, Siri’s capabilities include translating, playing a song, booking rides and transferring funds between bank accounts. Because of its machine learning capabilities, it can be programmed with new commands, according to Apple.

While Siri was launched ahead of Google Assistant and Amazon Alexa, there have still been concerns about its accuracy when responding to commands or questions compared to the other on-the-market technologies.

In a 2-minute video, Cnet.com tested Siri against Google Assistant and Amazon’s Alexa. At one point, Alexa responds more accurately and specifically to a command. In our research, we also found much longer video reviews that show Siri falling behind with accurate responses to questions asked to all three voice technologies.

In June 2018, Apple released changes to Siri, launching a new dedicated Shortcuts app that users can download. With these changes, Apple claims users can command Siri to perform more actions through voice command, text or tap. It is currently available on the iPhone, iPad, Apple Watch and HomePod. The actions include connecting with and activating third-party application functionalities, such as the Tile application to find keys, or obtaining travel information from the Kayak app.

Apple says that users can also use Shortcuts to remotely activate or control smart home gadgets like thermostats and fans, or save a podcast or radio station. Siri could also be asked by a user to inform family members when they are traveling and how long the trip will take, according to Apple.

The 2-minute video below demonstrates how a user can create a playlist shortcut with Siri.

According to the video, Siri asks the user to configure the parameters of the playlist shortcut. This could involve asking Siri to incorporate recently played music or a genre. The app also asks the user to further customize other settings such as the icon that will appear on the home screen. The user begins creating this shortcut by giving Siri a verbal command, such as “Make me a playlist.”

Siri Shortcuts is said to be able to read the user’s contextual data, such as calendar events and GPS locations in order to offer new shortcuts. For instance, with one shortcut. Siri can be asked to go into Do Not Disturb mode if the user schedules a time to see a movie on a certain date. The user’s time and the locator data determines that the user is indeed inside the theater. Another example would be a reported shortcut that can notify another contact that the user is running late, based on the calendar event and device location.

Third-party developers can create and integrate shortcuts into their own applications through the SiriKit. Some have already created a website where the shortcuts they created can be shared with other users.

Other companies have used Siri for their own business. One of which is ClaraLabs, which paid Apple for Clara, a rebranded version of the Siri virtual assistant technology.

ClaraLabs management realized it took them over 9 hours and an average of 135 sent emails to schedule and reschedule 27 meetings among themselves and their recruiters, a total of 18 employee schedules. They say they sought help from Apple to build its virtual assistant, which could schedule interviews for recruiters and meetings with company stakeholders through simple voice command, according to the company.

In a ClaraLabs blog post, Head of Revenue at ClaraLabs Briana Burgess claims that Clara helped her company set 27 meetings with 14 companies, which nearly eliminated the 9 hours of writing and sending scheduling emails.

Other businesses that use Siri include Kasisto and DigitalGenius.

John Giannandrea is Chief of Machine Learning and AI Strategy at Apple, where he leads advancements in Core ML and Siri technologies. Prior to this, he was senior vice president at Google for eight years where he led the machine intelligence, research and search teams. He earned his Bachelor of Science with Honors in Computer Science from the University of Strathclyde in Scotland, where he was awarded a Doctorate Honoris Causa.

Facebook Speech Recognition Projects

While Facebook has expanded on and refined its facial recognition capabilities, it also purchased Wit.ai, a company that offers a natural language development tool, in 2015.

At the time of the acquisition, Wit.ai was a 16-month-old startup. Since the acquisition, Wit.ai claims its speech recognition technology has been used by 160,000 developers and integrated into mobile applications, robots, wearable devices and smart home appliances such as thermostats, refrigerators and lighting.

The video below demonstrates how Wit.ai speech recognition is integrated into the Nao robot using in collaboration with the Choregraphe program which allows developers to create animations, behaviors and dialogs. According to the video, Wit.ai enables the Nao robot to perform tasks such as walking, shaking hands and scheduling through voice commands.

The company claims in a blog post that the platform will remain open, which potentially indicates that Facebook is keen on widespread adoption.

Facebook today has the capability to automatically caption video ads through speech recognition. The video below explains that adding subtitles to video ads enable Facebook users to see the topic of the ad as they scroll down the newsfeed. Facebook advertisers can add the subtitles by going to Power Editor and choosing “generate automatically” as instructed.

Facebook also acquired Oculus, a virtual reality headset maker, for $2 billion in 2014. In March 2017, Oculus announced that it had integrated voice and speech recognition into its headset to enable users to easily navigate virtual reality. The application, available in English on Rift and Gear VR headsets, allows wearers to conduct voice searches from Oculus Home to navigate games, apps, and experiences.

The video below shows how the Oculus headset user speaks voice commands, starting with “Hey Oculus” and stating simple instructions such as “find”, “cancel”, “launch”, etc.

Facebook hired Yann LeCun from New York University in 2013 to lead the Facebook Artificial Intelligence Research group. At the NYU, LeCun researched and taught machine learning, AI, data science, computer vision, robotics, computational neuroscience, and knowledge extraction from data for 15 years.

Concluding Thoughts

The $55-billion voice recognition industry has been forecast to grow at a rate of 11 percent from 2016 to 2024.  

The technology has found good use in other industries among the smaller and lesser-known firms, in the form of transcription applications. Currently in healthcare, medical professionals use speech to text transcription applications such as Dolbey to create electronic medical records for patients.

In law enforcement and legal sectors, companies such as Nuance provide transcription applications for accurate and quick documentation is a critical need, transcription is also used to document incident reports. In media, journalists use transcription applications such as Recordly as a tool in to record and transcribe information in aid of more accurate news reports. In education, Sonix helps researchers transcribe their qualitative interviews.

Among the five leading technology companies offering speech and voice recognition capabilities — Google, Amazon, Microsoft, Apple, and Facebook — the similar capabilities revolve around scheduling, reminders, managing playlists, connecting with retailers, managing emails, making food orders, and online searches.

These are all offered on mobile, personal computers, and most in their own branded home speakers. Amazon’s Alexa is on Echo, Apple’s Siri is on HomePod, Google Assistant is on Google Home, Microsoft’s Cortana is on Invoke. Only Facebook has diverged from this trend by offering speech recognition capability through the Oculus virtual reality headset, and subtitles on video advertisements.

Although Apple was a trailblazer in this area, Siri has proved to be less smart than Amazon’s Alexa and Google Assistant, with limited features compared to the others. In terms of general knowledge, a study consisting of nearly 5,000 questions showed that Google Assistant is the smartest among the four applications.

However in terms of skills, a separate report showed Alexa having the most number of skills at 25,785, Google Assistant at 1719, and Cortana at 235. Siri was not included in this report. The growing number of skills could be attributed by these companies offering business versions of these applications. Software development kits (SDK) have been made available to developers, enabling startups and small businesses to build customized skills for their clients.

Here is a rundown of how we believe the companies are competing in the industry based on our research:

  • Google set up the Assistant Investments program to fund startups, with the aim of advancing speech and voice recognition technology.
  • Facebook hired an AI industry expert and acquired several speech recognition startups.
  • Microsoft partnered with Amazon to potentially strengthen the chances of Cortana’s survival.

 

Header image credit: Szifon

Stay Ahead of the AI Curve

Discover the critical AI trends and applications that separate winners from losers in the future of business.

Sign up for the 'AI Advantage' newsletter:

Subscribe