Hi, I'm Simon! How may I help you today?"
"Er… we're calling about a lost bag on the United flight from London to New Orleans…"
"I'm sorry. I didn't understand that. Are you calling about a reservation? If so, please say YES."
"Simon is happy to help. Are you calling about lost baggage?"
"I understand you are calling about lost baggage. Please tell us where you are."
"New Orleans, Louisiana."
"I'm sorry, I didn't understand that. Please say the state clearly..."
This would have gone on forever, but after three such calls we had finally figured out how to beat the system. We would switch to loud, abusive Hindi, and after three failed attempts, the United Airlines toll-free voice-response would give up and transfer us to a human being who would, amazingly, be sitting on the other side of the planet, back home in Gurgaon, and who would actually understand us and help out!
That was 2000 AD. Fast forward to 2009.
"Engage Autopilot on Mach One point five, altitude six thousand," said Flt Lt Mathews. "Roger, autopilot on Mach One...," the cockpit voice response system repeated, and the RAF Eurofighter Typhoon climbed up and then levelled off six kilometres above sea level, cruising at one and a half times the speed of sound.
"Caperberry Bangalore," I said, touching a button on my phone. Six seconds later, a name, address and location map popped up on my phone screen, along with the number 080-2559-4567. One click, and it dialled the number. A half-minute down, I'd booked a table for two at this tony downtown tapas lounge that I hadn't heard of till that day.
We were both doing the same thing. Okay, there were a few minor differences. The RAF fighter-pilot was talking to a $90 million delta-wing multi-role aircraft, and I was in a Meru cab talking to a piece of free software from Google in a Rs 15,000 mobile phone handset. But we were both using voice recognition that actually worked - and which rapidly gave us usable information, or let us command a system that responded.
LESS THAN MAINSTREAM
Speech recognition has quite a few real-world uses today, where it works well. These are mostly situations with a small vocabulary of very well defined, structured speech, such as in the Eurofighter example (though very few aircraft have as yet implemented voice-assist for flight operations). Another area is training, where clear speech and response is a key part of operations, such as air traffic control operations, where the need for a full-time "actor" to converse with every trainee is avoided. Speech recognition and synthesis plays a key role here.
It might seem odd that the most obvious mainstream applications are exactly where speech recognition hasn't taken off yet. Take dictation and transcription.
You'd think one could simply speak to a computer, and it would transcribe whatever you said into a neat little document, so that you didn't ever have to type anything in. In practice, speech recognition, even two decades after it was born, hasn't evolved to the stage where normal (and widely varying) speech from different people with different accents can be reliably transcribed.
In fact, you hear the term 'voice recognition' more often, where the recognition system is trained to a particular speaker - such as in most PC voice recognition software. So it works best with a particular voice, a speaker who trains the system, just as with the Eurofighter that's trained by a particular pilot. Speech recognition is broader, and describes systems that can recognise any speech - such as a call centre system. These have improved a lot since my Simon experience 10 years ago, but are still limited in their vocabulary and fussy about pronunciation.
Voice synthesis, on the other hand, is thoroughly mainstream. Lots of systems convert text to speech, including, probably, your mobile phone (many have an SMS and menus read-loud option). You probably use Acrobat Reader to view PDF files; click on View and Read Out Aloud to have the document read to you in a moronic monotone. The Kindle reads out the text of books you load onto it.
My favorite voice-command system is Google's Mobile App (get it from m.google.com on your data-enabled phone). It lets you speak out search terms, and recognises them rather well if you speak clearly and don't have much background noise. It also integrates Gmail, Maps, Picasa and more into one simple application on your home screen.
The author is chief editor and green evangelist at CyberMedia, publisher of 15 specialty titles and sites such as LD2.in. email@example.com, twitter.com/prasanto