Sunday, June 26, 2005

Voice Applications – A revolution in waiting.

Consider these interesting statistics: PC shipments in 2005 are expected to reach (just) 199 million as compared to 625 million mobile phones for the same period. In 2009 according to Gartner, there would be 2.5 billion mobile users in the world which will far out perform the personal computer reach. Interestingly enough a mobile phone is a tool whose primary interface is voice as against text in a PC. Hence it is quite natural to deduct that the man-machine interface of the future would be voice. (I do believe that the final destination would be ‘thoughts’ but that would take another 7-10 years probably.).

If ‘voice’ is the next wave, then why it is taking so much time to be pervasive? To me the reasons are both technological as well as social. On the technological side the accuracy growth in ‘Speech Recognition’ engines were slower than expected. You might think that 99% accuracy was achieved by many products a few years back itself, but considering the variety and variations among people speaking a specific language, these accuracies were limited to a small percentage of people who speak perfect version of the language (if anything like that exists). Essentially the accuracy claims by vendors were no different from the fuel efficiencies promised by automobile vendors in ideal conditions.

On the social front, the personal computers did not provide enough privacy for one to interact with ones system/network using voice technologies. Ofcourse you do not want to talk to your banking application with the whole world hearing it. Neither did you want to disturb all around you when you attempt browsing the web. It is in this social context that the privacy friendly telephonic instruments (mobile and land line) assumes greater significance in enabling voice to be the primary man-machine interface. The growth of speech technologies like Advanced Speech Recognition (ASR), Text-to-Speech converters (TTS) and Voice XML (VXML) and telephony technologies like Interactive Voice Response (IVR), Computer Telephony Integration (CTI)and Voice over IP (VoIP) aids this transformation in a big way.

Take the case of a typical IVR solution. You want to find the balance in your savings account. You call your bank’s tele-banking number and the voice on the other side gives you root menu options like ‘press zero for Savings Account, press one for checking account…’and so on and so forth. You would have to go through a series of linear menus before you arrive at the required information. (Incidentally this is not much different from the typical linear menus in your Windows operating system, except that with voice interface you do not have the entire menu presented in one shot and you would have to stretch your grey cells to really remember your menu options.) You make a mistake in one selection, and boy, you need to go through the whole rigmarole again!!

Now consider the alternative scenario. You call your tele-banking number and the voice on the other side prompts ‘Hello, How may I help you today?’. You may not even realize that this is the server armed with the TTS technology answering your call. You make your requirement clear by answering ‘I want to know the balance in my Savings account’. ‘Would you please give me your savings account no.’, pat comes the next prompt. After an authentication step you get your balance read out to you. In essence you get a typical manual help- desk experience from this automated help desk. This is one of the new genres of applications spawned by the voice technologies.
Adea ( recently migrated the ordering and customer self help functions of a very large Telecom company in North America to a voice enabled system using some of the technologies discussed above. This enables its customers to interact with the company (Ordering, recording issues etc.) through voice interfaces without going through elaborate menus saving precious time and improving overall experience.

One of the core technologies used in this solution was the VoiceXML (VXML). VXML is an open standards approach to voice applications using a combination of Internet and speech technologies. It is derived from XML. In functionality, a web browser and a VXML interpreter are similar. A web browser renders HTML documents visually while a VXML interpreter renders VXML documents audibly. This is the major difference. A VXML interpreter can be considered as a telephone-based voice browser. VXML documents have web URIs and can be located on any web server. But there is a major difference here. A web browser runs locally on your machine, whereas the VXML interpreter is run remotely—at the VXML hosting site. Basically you use your telephone to access the VXML interpreter.

VXML is very different from IVR. While not mandatory, VXML applications support speech recognition by default. And as I have stated earlier, VXML user interfaces are not limited to a series of numerical answers from 1 to 10. Further, The VXML specification is maintained by the W3C, not any specific vendor and is not tied to any proprietary telephony hardware. It can support multiple speech recognition and text to speech engines from different vendors.

A complementary standard to VXML is SALT-Speech Application Language Tags (Many experts consider it as competing). They have different technical goals. VXML focuses on the development of telephony-based speech applications. SALT focuses on adding speech/telephony to web-based applications and turning them into multimodal applications and its primary goal is to support multi-modal interactions that becomes possible with the new converged devices (for e.g. speaking to a device might generate a list of information on its display). SALT currently lacks the broad support and maturity level of VXML, but is rapidly acquiring popularity

After VoiceXML, the next important component of the solution is the Speech Engine which comprises of a Speech Recognition Engine and a Text-To-Speech (TTS) Engine. The Speech Recognition Engine enables the system to understand what a person is saying and TTS enables the system to speak information fed to it.

From a user’s perspective what is critical is the Voice User Interface (VUI) akin to the Graphical User Interface in your day-to-day applications. VUI design is a very involved process covering the designing of dialogs (What the systems speaks as prompts), designing of application (including ‘Grammars’, the sets of words or phrases that the voice recognition engine 'listens' for in each part of the voice application) and designing of information presentation (for e.g. voice of the prompts or TTS converter).

The application not only resulted in better customer experience but also in reduced costs and improved productivity. I will say this is just the tip of the iceberg. Voice Technologies can provide much more. For e.g. imagine a day when an application similar to Microsoft Money® runs on your mobile phone and you update, say your expenses by talking to the phone while making a purchase. That is what you can call as instantaneous data capture (at the time of a business event) with almost no effort…and mind you, that happens on a device which you always carry with you. That day is almost here…. There are a number of visionary companies worldwide like ADEA, designing and developing unique voice based applications today. Some of these could totally change our lives soon.

To me voice technologies are still at a nascent phase. But the major difference is that the stage is set.