Startup Saturday: giving artificial intelligence a sensible voice
Shaila Apte aims at creating better speech recognition tools by converting emotional text to speech and vice versaUpdated: Nov 17, 2018 14:38 IST
“Bandra minus Worli sealink road” is what the google maps audio blares out. If you are new to the city you will wonder what Bandra minus Worli means? What it means is that the technology converting the text to voice is not smart enough. And, as more and more of our lives get dominated by artificial intelligence (AI), as more ‘Alexas’ enter our lives the more we will need better speech recognition tools.
Which is what 65-year-old Shaila Apte aims to do with her new startup Anubhooti. Her company aims to make machines talk properly and make them convert speech to text correctly. Sounds simple, but it is far from that. Shaila says, “I have done my PhD in electronics and am a professor teaching electronics at the Rajashri Sahu college of engineering Pune. Two years back I thought of starting my own company.”
Shaila had worked for a company in Delhi that was dealing in personalized voice synthesis, which is converting a voice into your own. This can be used for telling stories in a parent’s voice. “I thought to myself why not I start my own company? I have enough experience and see a lot of possibilities in this voice business.” said Shaila.
So she set up Anubhooti, a proprietorial firm. Being a professor all her life she had the goodwill that she leveraged. “Whatever human resource I needed was easily available to me either from my current students who interned happily with me or from past students who were also happy to be of help.”
With ₹15 lakhs of her own savings Shaila filed for five patents and so far has been granted one. The rest are pending. “our work involved using algorithms that can do several things. Like we can recognize four emotions that machines generally cannot. We can reproduce voice with these four emotions, angry, happy, sad and neutral”. added Shaila
What her algorithms can do now is also read words from a much smaller data. Says Shaila, “Others (read google) will have a large data of words that will be read and understood to mean a particular thing. So it is data that ratifies a word. For example you are doing a medical transcription and the doctor says ‘pneumonia’, now this word they have got after reading from a humongous library. My software uses algorithms to understand this from a much, much smaller database.”
There are other advantages too. “Our software does not need large space and can be saved on the client’s server, it does not need a cloud. This is a huge benefit for some people who find it a more secure way to store confidential information. Also sometimes there can be tech problems with the cloud”.
It can convert emotional text to speech and vice versa. It can convert large amounts of speech like one or two hours to text. “Currently existing softwares can convert for a few minutes but we can do it for hours.” Anubhooti also has a voice effects generator that can be used in cartoon films.
While all this is great, what remains most important is getting to the market and selling. How does Shaila plan to do that? Again it is her student base that has come to her aid. Says Shaila, “there are more than a thousand students that I have taught. Many of the earlier batch students are now in their fifties and holding top spots in different companies. Selling to these companies is simple for me since these are my students.”
Financially Shaila still has to get out of the red. “My efforts are still focused on doing different projects. Like we are working on one where we can synthesize any voice into the voice of a celebrity. This can be used for dubbing. This given that we can get emotion into the voice. We are also working on a project on whispered speech.”
Shaila is hopeful that various business entities like the media, defence, medical transcription and even in speech pathologies her software will find varied applications. She does not aim to raise any funds as of now.“So far whatever projects we have got have helped raise finance to fund our next innovation. We have managed quite well so far” and she hopes to continue in this fashion.
Projects Undertaken By Anubhooti
Voice effects generation: the software system synthesizes the voice effects like helium effect, hexa fluoride effect, cave, echo, telephone and whispered voice effect. The software can be used for synthesizing the voice for a cartoon character. –idea is patented
Synthesizing voice for a celebrity: the system uses already recorded speech of a celebrity and can synthesize any speech in the voice of a celebrity. Our product namely emotional natural sounding text to speech system with Indian accent is useful. This uses limited storage and hence can reside on your laptop.
Personalized voice synthesis using phoneme concatenation method: We could synthesize the speech in a person’s voice with recordings of about 250 words. The work is patented.
Emotion detection from speech-(Wakonnati): We have considered four emotions, happy, angry, sad and neutral. The system is tested on the Berlin database for emotional speech. The recognition accuracy is as follows: angry- 99%, happy -94%, neutral- 91% and sad- 81%. This is probably the highest accuracy reported on Berlin database. People report accuracy by comparing the extreme emotions like happy and sad. Our system is rule based and we have extracted the features that are more emotion specific. The work is patented.
Conversion of neutral speech to emotional speech - (Wagwilasini): We take the neutral utterance from the user or the system may use already stored sentences in neutral mode. The user is asked to select the emotional effects from the following. 1) start and end effect for the sentence 2) time gap adjustment 3) effect of question mark 4) effect of exclamatory mark 5) effect of fast speech 6) sad emotion 7) happy emotion 8) angry emotion 9) normal speaking style 10) context dependent adjustment 11) lifting of the pitch. This is already patented.
Context dependent emotional speech synthesis- (Wagwikasini): The system records the context dependent spoken words from the user and stores small parts of the words. The context for each word decides the small parts of words to be concatenated to form another word. The system synthesizes different words and generates the sentence. The emotional effects stated in Wagwilasini are added to synthesize the emotions. The special marks in the text will guide the system to add a particular emotion effect for some words in the sentence. The synthesized sentence is the emotional speech as required by the user. The synthesis is personalized for each user recorded voice. Work is patented.
Other ongoing works include speaker authentication, Speech disorder classification, handwritten character recognition, writer verification, image enhancement and processing, watermarking of speech and conversion of whispered speech to normal speech.