Hey Car, I’m Talking to You

There is this idea that speaking to your car, having your car understand you, and having the car correctly do your bidding should just be. It’s assumed an “isness”, a current condition state, not just the future, but the now. May people are genuinely vexed that their vehicle is not KITT.

I have never seen Knight Rider. It is not part of my assumed future.

I am much more likely to watch the movie Her. But I get it. Voice seems like a natural input and output modality, especially when you need to be doing something else with your eyes and your hands, like accurately steering a 3,190 pound vehicle around that annoying Google autonomous SUV that insists on not speeding down El Camino Real despite the fact that the light is green and 35 miles per hour on a four lane road is a stupid law to follow at 3:30 pm when traffic is especially light for a Tuesday. But that may just be my problem.

I’ve read many studies and articles that discuss the problem with voice as modality for the vehicle. These studies often state that it is not voice or it’s lack of robustness as a technology (more on that in a minute), but it is the heavy cognitive load required to speak to your car that is the largest barrier. Ultimately, no matter how good the voice system or digital personal assistant, your brain cannot do too many things all at once.

Here is the rub. People do not believe that they can only drive and do nothing else.

The evidence is overwhelming. People say one thing, but act different. People drink coffee while driving, change the radio station, take off their jacket, and speak on the phone. I’ve seen a man shave while driving and once I saw a woman eat a large bowl of noodles with chopsticks while driving. The majority of people actually think they are good and safe drivers and are perfectly capable of doing other things while driving, especially if the task does not require them to take their hands off the steering wheel or eyes off the road for an extended period of time. This is where the promise of voice-enabled assistance systems comes into play and where people expect voice recognition to be of help. Not just to be of help, but to actually be safer, to save them. And because people are crazy behind the wheel, yours truly included, those of us in the automotive industry owe it to the public to do as much as we possibly can to make driving safer.

Sadly, voice recognition in the vehicle has mostly failed to just simply work. According to JD Powers, it remains a top most hated feature, ahead of Bluetooth pairing of phones. Voice recognition systems fail to recognize the request, are often unresponsive and slow, and all too often do not accomplish the task requested.

I do believe that despite any cognitive load constraint, voice-enabled assistance systems, or the term I prefer, intelligent personal assistance systems (IPAs) should and can make driving safer for everyone. And I do believe we are at a point where the technology is finally getting good enough to make a difference, but it is not easy.

Let me break down how I view an IPA system and why it’s so damn hard. I’m sure many better-versed peoples can, should, and hopefully will correct any errors that follow.

IPAs are essentially a subset of artificial intelligence and should be an elegant and effective way for humans and machines, including vehicles, to interact. IPAs should increase the capability of people to approach a complex problem or situation, gain comprehension to suit the particular need, and derive solutions to the problem at hand. They should be helpful and involve meaningful delegation.

At a high level, an IPA is an integrated system comprised of 4 primary parts:

  1. Speech recognizer
  2. Text to speech engine
  3. Dialogue manager
  4. Information retrieval system

Sophisticated IPAs also include Natural Language Understanding/Processing (NLU) or Conversational Systems. NLU systems make allowances for the way in which people speak – pauses, utterances such as ummms, ahhhs, and the variety of ways to express the intent for a particular phrase. NLUs combined with dialog managers also track the context of the conversation.

Here is a rudimentary list of components:

Component Function
Speech recognizer – Automatic Speech Recognition (ASR) Sound converted to text
Text to speech engine Converts written text to audio sounds
Dialogue manager Logic to keep track of context, state of conversation, and sentence structure
Information retrieval system A process for obtaining information resources relevant to an information need from a collection of information resources
Natural language processing Machine reading comprehension that handles occurrences of unknown and unexpected features in the input

The most popular and well known, consumer-facing IPAs are mobile applications or part of mobile phone operating systems including Apple’s Siri and Google Now.

There are several ways IPAs work and the distinctions lie predominately on the location for speech recognition and information retrieval: 1) entirely native on a device 2) entirely off-board 3) a hybrid of the two.

Speed and responsiveness is of the essence. An IPA loses its efficacy if it takes too long. Speech recognition can be very resource heavy and if the information/solution sought resides on the Internet, a hybrid approach is ideal. IPAs can exist as a fully embedded system that retrieves information/solutions inside a closed system. With enough computing resources and no need to update the system for grammar, languages, commands, controls, etc. a fully embedded system can be effective.

Type of System Pros Cons
  • Response time likely high with enough dedicated computing resources
  • Requires significant computing resources
  • Not easily updatable
  • Limited features and functionality
  • Ability to stay current – software and information
  • The collective learning potential
Reliance on communication network of any kind (cellular, WiFi, satellite) will introduce latency and times when the feature will not function
  • Abstracting computational power and content/resources can make the feature efficient, current, and future proof
  • Potentially lower power consumption
  • Reliance on communication network of any kind (cellular, WiFi, satellite) will introduce latency and times when the features relying on connectivity will not function
  • The performance of the system can be inconsistent (flexible and dynamic computing is challenging and nascent)

While there are many obvious use cases for IPA technology to be deployed in the vehicle, more breakthroughs in speech and artificial intelligence technologies are needed and I think are on the way.

According to SRI International, a company involved heavily in developing IPA technologies:

…“future IPA conversations will be about more complex tasks with multiple steps and more nuances. This next wave of IPAs will be also able to maintain the context of the conversation for long periods of time, reason with clarity about what you discuss, provide answers to your questions, execute tasks for you, and all along the way learn from you and noticeably improve with use. The experience will be more personalized that what you experience with Siri today, and it will have greater depth. IPA’s will also be more proactive, constantly discovering things that you might care about and even starting conversations with you about what they find.“ (emphasis added)

In my simpleton’s mind, the ultimate test for any IPA is simple:

  1. Did the system understand me
  2. Did it perform correctly
  3. And was it fast

Once the answer to these three questions is consistently yes for whoever is using the system, then you have a winner.

I’ve always felt there was one more hurdle.

Before a few months ago, you would have heard me say something like, IPAs fail because people feel awkward speaking to a machine. It’s just too strange, foreign, and those people who don’t live in the Silicon Valley are too self-conscious to be okay speaking with any machine on a regular basis.

Something has happened in the ensuing years since I helped bring NLU systems into vehicles (with Voice Box Technologies); the systems have gotten much, much better, more wide spread, and I think are finally at, dare I say it, a Tipping Point* and being accepted and used by your average Joe and/or Jane.

I was visiting a dear high school friend in Washington, DC recently. My friend is brilliant, successfully, beautiful, could likely drink Winston Churchill under the table were he still alive. She widely uses technology, but is not “technological”. She doesn’t give a rat’s ass how it works; she just wants it to work and if it doesn’t, she won’t use it. As we were walking down the street, she takes out her iPhone and begins dictating a text to her wife. I was floored and shocked and amazed. I asked her how often she uses the voice rec feature on her phone and she said, “all the time.” All the time! No awkward, too self-conscious, can’t talk to machines hang-ups going on there! After that encounter, I started asking around and engaging in a small, personal ethnographic study. I noticed A LOT of people using the IPA feature on their phone. I am beginning to think the majority of people are willing to engage with machines using their voice. If anyone has a formal study, please let me know. I really hope I am not wrong about this one.

I’ve also noticed that there are a lot more companies in the business of offering voice recognition systems and selling into automotive companies. After years of VR-type companies and product lines being shed and shuttered, new startups are coming out of the wood work and older companies who were on pause or life support are getting restarted and downright energized. We may not just stuck with Nuance! Apologies to my Nuance peeps, it’s just well, Nuance is so big, owns so much of the market, buys everything in sight, and yet VR in cars still suck. Car companies need choice. The IPA market needs competition to foster innovation, to drive down prices, to become more widespread and better. I hope some of these smaller companies can survive and become huge successes. Being successful in the automotive industry takes stamina and stamina takes money and IPA can be a money sink given the complexity.

Anyhoo, I’m not putting on my economist hat at this particular moment. It just seems like the conditions are finally right – better technology, lower costs, greater customer acceptance, overwhelming need — for really great voice-enabled intelligent assistance systems to become wide spread in vehicles. I am extremely hopeful we are finally approaching a time when VR actually makes driving a car safer and a better experience for everyone. It would be really great if the promise of voice recognition finally came to fruition.

* Credit here obviously goes to Malcom Gladwell and yes, I am quoting a business book I’ve never read and will never read because given my limited time on this earth, I’m going to read Willa Cather’s My Antonia and reread Tolkien’s Lord of the Rings trilogy, and eventually make it through everything Joan Didion’s ever written. Yes, you folks that know me and have heard me say I don’t read business books can mock me as the irony is not lost and yes, I deserve it.