The value of a smile: why artificial intelligence systems are learning to recognize human emotions?
Emotion recognition market value is worth approximately $ 20-30 billion, but vendors do not know exactly how to approach it
A surge of public and scientific interest in the subject of the emotions detection and recognition, as well as a boom based on the data technology solutions occurred in the second half of 2000 - the beginning of the 2010s. There processes reached peak values by the 2015-2016. Back to that time two technological giants - Microsoft and Google Corp. - made their pilot projects available for ordinary users. The demo version of the Microsoft program (Project Oxford), for example, quickly became popular. The idea of the program is that you upload a photo, and the program tries with some degree of probability to detect an emotion based on the person's microexpressions, choosing from six basic ones: contempt, disgust, fear, happiness, sadness, surprise or neutral state.
Another global player - Apple Inc. - acquired Emotient Inc, an artificial intelligence startup which had developed technology to scan faces and read people's emotions. At the same time, interesting products from small companies appeared on the market. For example, Affectiva, which grew out of the laboratory of emotional calculations (affective computing) of the Massachusetts Institute of Technology. Back in 2013, the American Forbes called the product developed by Affectiva one of the five most breakthrough technologies. A similar project - Emotient - is made by graduates of the California Institute in San Diego. Emotient had raised $8 million from investors including Intel Capital.
Over the course of time algorithms have become better, but the question of the practical benefits of technology is still relevant. How to apply the tagging of friends' emotions on a photo in Facebook or the transformation of a smile into a personalized emoji? It is possible to use these technologies quite successfully for neuromarketing purposes. For example, the Affectiva use one of its products exclusively for conducting market research. However, there is a significant problem. All these products, whether platform or cloud, usually detect a person's emotions by the facial expressions on his face. Roughly speaking, a certain number of points are selected on the face and compared with a bank of photographs, on which the emotion is already been recognized. However, this is not enough.
When you watch videos without sound you can clearly detect one particular emotion. As soon as you turn the sound on, the picture can change: the perception of emotion will be completely different. One channel of information sets rather strict restrictions and as long as the program for recognizing emotions remains more in the status of "toys", mistakes are not so critical. But if we want to make an accurate non-contact detection of emotions, behavior patterns and physiological responses of the body, then we need additional sources of information about human experiences. For example falsehood - it's kind of hiding emotions. And if we understand what kind of emotions the person tries to hide from the eyes of the interlocutor or the camera lens, then we will be able to reveal when he is lying directly, dodging or keeping back. Imagine a top manager or a politician speaking at a press conference. He can recite from the platform with a completely stony face, but the intonation of the voice, the movements of his eyes or body, certainly at some point will give him out.
For correct emotion recognition many channels are required, at least not less than four, analyzed simultaneously. The first is the person's face. Mimicry is recognized quite clearly - international competitions and hackathons regularly take place, where teams from different countries show their solutions. Usually a 3D model of the head is built, and micro-motions of muscles are caught on it. The second channel is the oculomotor activity. Tracking the movements of the eyes, saccades and fixations, gives a lot of important information, because the visual perception of reality - in a sense, the main thing for a person. The third is the movements of the body, including small motor skills, barely perceptible with the naked eye: hands, fingers, knees, feet and so on. Well, the fourth channel is the voice. And here is a huge field for work: after all, you can not only recognize the intonation, but also isolate the emotional coloring of the speech, see what and how the person says.
In fact, there are other channels, particularly - physiology. For example, by speeding up the video recording and tracing the fluctuations of the head, you can measure the pulse. And to establish whether it is becoming more frequent in some situation or not. Expressions of "the shadow ran over the face" or "the face is white with horror" are embodied in reality - this can be seen on the video, but it should be of an ideal quality. Already, there are examples of technologies that track the change in the color of the pixels on the person's face - so you can understand how the blood vessels fill up. Such experiments were made in 2012 by scientists from MIT, and in their demo videos they showed how to measure the pulse, for example, of a sleeping child. But these are only isolated cases, and such technologies have not been widely used yet.
Using additional channels leads to the inevitable difficulties - sharply increases the variability. But this is the price paid for emotion recognition accuracy. To solve this problem, you can use deep learning of neural networks, a direction that is developing relatively recently, but has allowed the industry to recognize emotions rise to a fundamentally different level. Then it would seem simple: the neural network takes a picture in real time and compares it with a huge database of reference video where desired emotions found and known. And the algorithm does the job at a good level. But here one should take into account one circumstance: there are few of such datasets, and on some channels they sometimes do not exist at all. Creation of a large dataset is expensive and time-consuming, so the network even across research papers on the topic of working with small datasets - getting the maximum benefit from a small set.
To recognize emotions by images, you need an array of photos, to recognize emotions by video a set of videos. There are not so many scientifically well-founded datasets at our disposal, so specialists are often forced to operate on the same samples, replenishing them in parallel and moving the entire industry forward. Creating a dataset is a time-consuming process that requires considerable resources. First, it is necessary to think over scenarios for the actors or other involved persons in order to test them in the dialogue with the experimenter, and thus somehow express a different spectrum of emotions. The closer to natural conditions of everyday communications, the better.
During the conversation emotions are often changed or even mixed. The transition can be sharp or smooth, the intensity of expressions and other parameters are changing rapidly. After the materials are filmed, an annotator - a specialist, manually marking the emotions that are presented on the recording - begins to work with it. How many annotators are needed to be attract to get high-quality markup? One or two? In fact, not less than ten. There is a great importance on how the annotator marks emotions. He can simply indicate the time interval of a particular emotion or mark its intensity. In addition to the generally recognized basic emotions with which everything is more or less clear, an additional classifier is needed for coving the intermediate emotions and their combinations. Another aspect is the voice. When the emotionality of communication increases, people often interrupt each other, intercept the order of replicas, or speak simultaneously, or even overlap. Recognize the voice in this case is more difficult - in fact, you need to divide two or three audio tracks and isolate the desired, and only then work with it. This factor, by the way, must also be taken into account when drawing up scenarios for shooting.
So, the lab created the dataset, the annotators marked it up, the neural network learned to work with it. All is ready? Unfortunately, not really. Classifiers sometimes have to be retrained, or even taught anew, taking into account ethno-cultural and linguistic characteristics. However, in the modern world many things are universalized, they are smoothed out, although not completely. For example, an Indian who grew up in the US will express emotions similar to the Native Americans in a way, but there may be nuances. People inevitably adapt to the environment, adopt the gestures, manners, methods, rules and norms of expressing emotions. And they, of course, sometimes visibly differ. Try to observe from the side of the business negotiations of the Americans with the Chinese, and you will receive the most valuable material for comparative analysis.
The creation of a system that approaches to the human level of emotion recognition is an ambitious goal and an excitingly interesting process. What is at stake? Among the leaders of the industry the most visible corporations are, for example, Google and Facebook - earning on data on human behavior in the worldwide network. They have the most complete information about which sites we go to, where we are, what we like or not, what is our circle of communication. All this is converted into targeted "smart" advertising. But year by year the volumes of data are growing. Sooner or later we will be able to recognize a person by walking, to track the level of his stress and thoroughly understand his emotions. The systems of detection and recognition of emotions, emotional calculations, are in demand in many industries - from robotics to biometrics, from digital medicine to intelligent transport, from the Internet of things to the gaming industry, not to mention education and AR/VR.
Today, both the market giants and startups are involved in the race, they both compete and organically supplement each other. At the same time, emotional technologies based on neural networks and machine learning are still a new direction, and, whatever one may say, all such projects are experimental.
Estimates of the volume of the market for recognizing emotions vary a lot. Relatively systematized reports were issued in fact only at the turn of 2016-2017. For example, Markets & Markets believe that the volume of the emotion detection market will grow from $ 6.72 billion in 2016 to $ 36.07 billion by 2021, and the annual growth rate will be 39.9%. And according to Orbis Research, the global market for emotion recognition in 2016 was estimated at $ 6.3 billion, and by 2022 will reach $ 19.96 billion.
Challenges for the future
The problem of recognizing the emotions is a long time problem for various companies. What are the challenges in this area now? The boom of artificial intelligence (AI) and machine learning, which began just a couple of years ago, came here as well. All the big players are trying to learn how to use AI for emotions recognizing. For example, in January 2017, Google announced plans to embed AI-based face recognition and emotion into smart cards for Raspberry Pi robotics. Recently, Microsoft announced that it has taught artificial intelligence to recognize the emotional color of the text.
Machine learning is potentially capable of solving the problem, but it requires large samples for training, and they are not always available, so we need to come up with new ways of getting "raw materials" for training the system. In addition, Affectiva correctly argue that with the use of machine learning, the costs increase. And you have to figure out how to reduce them, but do not lose recognition.
However, the main challenge for scientists and developers now is to be able to link all the elements of the system into a single unit.