An insight in voice controlled devices

Alexandra Misca
6 min readMay 14, 2017

Grab phone, keys, wallet, laptop bag. Get out of the door. Take your seat in the car. You whisper, half-asleep, ‘start navigation’, followed by a pause. You wait to see the 7-inch screen blinking and firing the navigation system and continue — ‘take me to work’. Then it strikes you: this is how much technology has evolved. Looking at the bigger picture, factoring in its applications in areas such as medical research, civil engineering or even assisting people who suffer from disabilities, it is clear that modern technology has a massive potential as a force for good. According to this timeline, we’ve come a long way from 1950 to the technology we are currently using. Siri’s great grandmother was a program produced by IBM which could only understand 16 English words. The question is not how did we get here, but instead what other parts of this amazing universe still remain hidden.

This type of technology, known as speech recognition is divided into two: dictation and interactive speech (and is also considered to be a natural user interface, specifically because it provides answers to specific questions asked verbally). First of all, you need to consider that a Dragon can do your work and this not by spitting fire, but by spitting words. Dragon Professional may just be the smartest pet you would ever have. Just like a parrot, it is able to identify the phrases you regularly use. Not only is this impressive in as of itself, but it can also understand your accent and ignore the noise around you; and type everything you say into your documents, emails, and even sticky notes with incredible accuracy. It can learn to recognise your abbreviations, format phone numbers the way you like them, and add the likes of underlines or italics to your text. The technology, provided by Nuance, goes to such an extent that they provide a cloud based database for storing your documents. This means that your personal library can be accessed and edited by voice from within an application on PC, Android or iOS.

Sounds like witchcraft, doesn’t it? It isn’t. Neither does it have a brain — there isn’t an animal hidden somewhere behind the screen, just waiting to type up every word you say. However, the technology involved does mimic the brain in some ways. Notably, it is based on the way in which our neurons interact with one another . The idea is that the program must respond to a certain question without being programmed to do so. It receives some general questions, and is told whether its answers are correct. This is called training data. The more training data it gets, the better it becomes at answering similar questions, up to a point where it will become capable of doing so on its own. What makes this so special is that from the moment you buy the programme, it starts to be modelled by your own voice, every day, with every task you perform. The program learns while analyzing the recordings you are providing, trying to connect the words you say to the functions from the program, while understanding your accent and preferences. You are continuously feeding it with data and, eventually, it will reach a point where you would just say “sign the building contract and email it to John” and it will do so, all the whilst you are driving, cooking, or working on a different project or even just petting your cat.

Secondly, interactive speech is another way to use voice commands in technology. ‘Siri, tell me a joke’, or ‘Alexa, how’s the weather?’ are not just lines for lonely people. Actually, most nerds like myself find the answers pretty fascinating. Why? The amount of data introduced for training Siri is huge, and it is collected from all the Apple users in order to come up with the right response for the questions. The fact that I can ask for a restaurant, or that Siri just shows the news from my area, based on my location, is something that it was considered sci-fi ten years ago. Many users complained that Amazon Echo cannot understand certain accents or commands, but as it is based on the combination of you with the application, it takes some time until it is really customised to your preferences, as the main principle is similar to the Nuance software. Now, the story goes on and from saying ‘take me to work’ you get to angrily scream ‘Alexa, what’s my shopping cart on Amazon?’ or ‘Siri, call John’.

Moving forward, these are not the only gadgets that use machine learning for voice patterns. A quick look at a few websites and a stroll through Kickstarter or Indiegogo reveals a lot of potential gadgets that could make our life easier if we just.. Speak. You could literally scream at the morning alarm to shut up, you could ask the TV for another program, you could yell at a vacuum cleaner to clean the house, or you could just manage the lighting or heating by talking to your house. This domain is growing and those that benefit are not just tech nerds. For example, in medicine, it is used as part of treatment for spinal cord injuries. There have been prostheses developed that are activated vocally in order to help cope with severe injuries to the spinal cord. Some argue that this could be a cheap option for those suffering from the loss of a limb.

Furthermore, dictation software and all of these gadgets would reduce the amount of physical effort involved in day to day tasks, and could provide a solution for those in physical pain.

The voice has become a hub and it seems like there is only a matter of time until we will lose contact with the screen and become attached to our personal assistant trained specifically for us.

In spite of all these good outcomes, there are always reported issues in these kinds of technology and they are still a long way from perfection. In January, it was reported that Amazon Alexa ordered dollhouses for people after hearing a certain commercial on TV, the technology often returns no response for certain questions, simply because it cannot understand the accent or access some relevant information (some may argue this is partly related to it using Microsoft’s Bing as search engine, hehe).

In my opinion, starting from a macroscopic point of view, every domain is influenced by technology. It is the link between everything and only when taking a closer look do you get the fascination and the thrill from understanding how your Facebook feed is organised. Only when you know the technology behind how images are retrieved in a Google search for a car you can understand that you are the only one that recognises the car, because to the software it is just a pattern. Only once you realise that for you it is your mother’s voice and for the computer it is a set of wavelengths, only then you will be capable of understanding that going in depth in this domain could lead to a gold mine. Because any human trait becomes an idea to be implemented. And it is only a matter of time until machines will detect smells or emotions better than we do.

Trying to analyse the implications of this, there are two possible outcomes. One would see the technically inclined person having no privacy, with all their personal information being sold to corporations as training data, and the other would be, well, the modern caveman. Now I assume you are desperately looking for a middle ground. There is none. You still allow some applications to access all your information, whether you know it or not. The only thought that could comfort you at night is that your lack of privacy is making this technology flourish. Moreover, it will always be a good solution in terms of accessibility.

What are we still waiting for, besides machines with feelings? Overcoming the language barrier. Not all of us speak English! Baidu, the ‘Google of China’, is a service that provides 96% recognition for mandarin words, after being trained with thousands of hours of mandarin recordings. Personally, I am impatiently waiting for support for the most spoken languages, alongside intonation. Obviously if this technology is implemented worldwide, without support for other languages, it can lead to the extinction of a lot of them because all users would be required to speak English.

In conclusion, whether you are going for Dragon, Windows Speech Recognition, Cortana, Google Voice, TalkTyper, Tazti, or any other software for dictation or interaction, I suggest that you keep a close eye on your personal information. We surely want to be in touch with the latest discoveries, and may think that privacy matters are not important, but bear in mind that, as Edward Snowden put it, “Arguing that you don’t care about privacy because you have nothing to hide is no different than saying you don’t care about free speech because you have nothing to say”.

Originally published at medium.com on May 14, 2017.

--

--

Alexandra Misca

Product Designer at TravelPerk, previously at Booking.com and THG. All the shenanigans are here. The serious stuff is here: misca.info