🥇OpenVINO hackathon: recognize voice and emotions on Raspberry Pi

November 30 - December 1 in Nizhny Novgorod OpenVINO Hackathon. Participants were asked to create a prototype of a product solution using the Intel OpenVINO toolkit. The organizers offered a list of approximate topics that could be used as a guide when choosing a task, but the final decision was up to the teams. In addition, the use of models that are not included in the product was encouraged.

In the article, we will tell you about how we created our prototype product, with which we eventually won first place.

More than 10 teams participated in the hackathon. It's nice that some of them came from other regions. The venue for the hackathon was the “Kremlin on Pochain” complex, where old photographs of Nizhny Novgorod were hung inside, entourage! (I remind you that at the moment the central office of Intel is located in Nizhny Novgorod). The participants were given 26 hours to write the code, at the end it was necessary to present their solution. A separate plus was the presence of a demo session to make sure that everything planned was really implemented, and not left as ideas in the presentation. Merch, snacks, food, everything was there too!

In addition, Intel optionally provided cameras, Raspberry PI, Neural Compute Stick 2.

Select a task

One of the hardest parts of preparing for a free-topic hackathon is picking a problem. We immediately decided to come up with something that is not yet in the product, since the announcement said that it was welcomed in every possible way.

Having analyzed model, which are included in the product in the current release, we come to the conclusion that most of them solve various computer vision problems. Moreover, it is very difficult to come up with a problem from the field of computer vision that cannot be solved using OpenVINO, and if one can be invented, then it is difficult to find pre-trained models in the public domain. We decide to dig in another direction - in the direction of speech processing and analytics. We consider an interesting task of recognizing emotions in speech. It must be said that OpenVINO already has a model that determines the emotions of a person by the face, but:

In theory, it is possible to make a combined algorithm that will work both on sound and on the image, which should give an increase in accuracy.
Cameras usually have a narrow field of view, in order to cover a large area, more than one camera is required, the sound does not have such a limitation.

We develop the idea: we will take the idea for the retail segment as a basis. You can determine customer satisfaction at the checkouts of stores. If one of the customers is dissatisfied with the service and begins to raise his tone, you can immediately call the administrator for help.
In this case, we need to add person recognition by voice, this will allow us to distinguish store employees from customers, to issue analytics for each individual. And besides, it will be possible to analyze the behavior of the store employees themselves, evaluate the atmosphere in the team, it sounds good!

We form the requirements for our solution:

Small target device
Real time work
Moderate price
Easy scalability

As a result, we select Raspberry Pi 3 c as the target device. Intel NCS 2.

Here it is important to note one important feature of NCS - it works best with standard CNN architectures, but if you need to run a model with custom layers on it, then expect low-level optimization.

The point is small: you need to get a microphone. A regular USB microphone will work, although it won't look good with the RPI. But even here the solution literally “lies at hand”. To record voice, we decide to use the Voice Bonnet board from the set Google AIY Voice Kit, which has a soldered stereo microphone.

Download Raspbian from AIY projects repository and upload it to a USB flash drive, test that the microphone works using the following command (it will record audio 5 seconds long and save it to a file):

arecord -d 5 -r 16000 test.wav

Immediately, I note that the microphone is very sensitive and picks up noise well. To fix this, go to alsamixer, select Capture devices and lower the input level to 50-60%.

We finalize the case with a file and everything fits in, you can even close the lid

Adding an indicator button

While parsing the AIY Voice Kit into parts, we recall that there is an RGB button there, the backlight of which can be controlled programmatically. We search for “Google AIY Led” and find the documentation: https://aiyprojects.readthedocs.io/en/latest/aiy.leds.html
Why not use this button to display the recognized emotion, we have only 7 classes, and the button has 8 colors, just enough!

We connect the button via GPIO to Voice Bonnet, load the necessary libraries (they are already installed in the distribution from AIY projects)

from aiy.leds import Leds, Color
from aiy.leds import RgbLeds

Let's create a dict in which each emotion will have a color in the form of an RGB Tuple and an object of the aiy.leds.Leds class, through which we will update the color:

led_dict = {'neutral': (255, 255, 255), 'happy': (0, 255, 0), 'sad': (0, 255, 255), 'angry': (255, 0, 0), 'fearful': (0, 0, 0), 'disgusted':  (255, 0, 255), 'surprised':  (255, 255, 0)} 
leds = Leds()

And finally, after each new emotion prediction, we will update the color of the button in accordance with it (by key).

leds.update(Leds.rgb_on(led_dict.get(classes[prediction])))

Button, burn!

Working with voice

We will use pyaudio to capture the stream from the microphone and webrtcvad to filter noise and detect voice. In addition, we will create a queue to which we will asynchronously add and take fragments with a voice.

Since webrtcvad has a limitation on the size of the fed fragment - it should be equal to 10/20/30ms, and the training of the model for emotion recognition (as we will learn later) was carried out on a 48kHz dataset, we will capture chunks of size 48000 × 20ms / 1000 × 1 ( mono)=960 bytes. Webrtcvad will return True/False for each of these chunks, which corresponds to the presence or absence of a voice in the chunk.

We implement the following logic:

We will add to the list those chunks where there is a vote, if there is no vote, then we increment the counter of empty chunks.
If the counter of empty chunks >=30 (600 ms), then we look at the size of the list of accumulated chunks, if it is >250, then we add it to the queue, if not, we consider that the length of the record is not enough to submit it to the model to identify the speaker.
If the counter of empty chunks is still < 30, and the size of the list of accumulated chunks has exceeded 300, then we add the fragment to the queue for a more accurate prediction. (because over time, emotions tend to change)

 def to_queue(frames):
    d = np.frombuffer(b''.join(frames), dtype=np.int16)
    return d

framesQueue = queue.Queue()
def framesThreadBody():
    CHUNK = 960
    FORMAT = pyaudio.paInt16
    CHANNELS = 1
    RATE = 48000

    p = pyaudio.PyAudio()
    vad = webrtcvad.Vad()
    vad.set_mode(2)
    stream = p.open(format=FORMAT,
                channels=CHANNELS,
                rate=RATE,
                input=True,
                frames_per_buffer=CHUNK)
    false_counter = 0
    audio_frame = []
    while process:
        data = stream.read(CHUNK)
        if not vad.is_speech(data, RATE):
            false_counter += 1
            if false_counter >= 30:
                if len(audio_frame) > 250:              
                    framesQueue.put(to_queue(audio_frame,timestamp_start))
                    audio_frame = []
                    false_counter = 0

        if vad.is_speech(data, RATE):
            false_counter = 0
            audio_frame.append(data)
            if len(audio_frame) > 300:                
                    framesQueue.put(to_queue(audio_frame,timestamp_start))
                    audio_frame = []

It's time to look for pre-trained models in the public domain, go to github, google, but remember that we have a limitation on the architecture used. This is a rather difficult part, because you have to test the models on your input data, and in addition, convert to the internal OpenVINO format - IR (Intermediate Representation). We tried about 5-7 different solutions from github, and if the emotion recognition model worked right away, then we had to sit longer with voice recognition - more complex architectures are used there.

We stop at the following:

Emotions from the voice https://github.com/alexmuhr/Voice_Emotion
It works according to the following principle: audio is cut into fragments of a certain size, for each of these fragments we select MFCC and then we submit them to the input to CNN
Voice recognition - https://github.com/linhdvu14/vggvox-speaker-identification
Here, instead of MFCC, we work with a spectrogram, after FFT we send a signal to CNN, where we get a vector representation of the voice at the output.

Next, we will talk about the conversion of models, let's start with the theory. OpenVINO includes several modules:

Open Model Zoo, models from which could be used and included in your product
Model Optimzer, thanks to which you can convert the model from various framework formats (Tensorflow, ONNX etc) into the Intermediate Representation format, with which we will work further
Inference Engine allows you to run models in IR format on Intel processors, Myriad chips and Neural Compute Stick accelerators
The most efficient version of OpenCV (with Inference Engine support)
Each model in IR format is described by two files: .xml and .bin.
Models are converted to IR format via the Model Optimizer as follows:
```
python /opt/intel/openvino/deployment_tools/model_optimizer/mo_tf.py --input_model speaker.hdf5.pb --data_type=FP16 --input_shape [1,512,1000,1]
```
--data_type allows you to select the data format with which the model will work. FP32, FP16, INT8 are supported. Choosing the optimal data type can give a good performance boost.
--input_shape indicates the dimension of the input data. The ability to dynamically change it seems to be present in the C ++ API, but we didn’t dig so far and just fixed it for one of the models.
Next, let's try to load the already converted model in IR format through the DNN module in OpenCV and make a forward to it.
```
import cv2 as cv
emotionsNet = cv.dnn.readNet('emotions_model.bin',
                          'emotions_model.xml')
emotionsNet.setPreferableTarget(cv.dnn.DNN_TARGET_MYRIAD)
```
The last line in this case allows you to redirect calculations to the Neural Compute Stick, basic calculations are performed on the processor, but in the case of the Raspberry Pi this will not work, you will need a stick.

Further, the logic is as follows: we will divide our audio into windows of a certain size (we have 0.4s), we will convert each of these windows into MFCC, which we will then feed to the grid:
```
emotionsNet.setInput(MFCC_from_window)
result = emotionsNet.forward()
```
After we take the most common class for all windows. A simple solution, but for a hackathon, you don't need to invent something too abstruse, only if you have time. We still have a lot of work to do, so let's move on - deal with voice recognition. We need to make some kind of database in which spectrograms of pre-recorded voices would be stored. Since there is not much time left, we solve this issue as best we can.

Namely, we create a script for recording a fragment of the voice (it works in the same way as described above, only when interrupted from the keyboard, it will save the voice to a file).

Let's try.:
```
python3 voice_db/record_voice.py test.wav
```
We record the voices of several people (in our case, three team members)
Next, for each recorded voice, we perform a fast fourier transform, get a spectrogram and save it as a numpy array (.npy):
```
for file in glob.glob("voice_db/*.wav"):
        spec = get_fft_spectrum(file)
        np.save(file[:-4] + '.npy', spec)
```
More details in the file create_base.py
As a result, when running the main script, we will get embeddings from these spectrograms at the very beginning:
```
for file in glob.glob("voice_db/*.npy"):
    spec = np.load(file)
    spec = spec.astype('float32')
    spec_reshaped = spec.reshape(1, 1, spec.shape[0], spec.shape[1])
    srNet.setInput(spec_reshaped)
    pred = srNet.forward()
    emb = np.squeeze(pred)
```
After receiving the embedding from the sounded segment, we can determine who it belongs to by taking the cosine distance from the fragment to all the voices in the database (the smaller, the more likely) - for the demo, we set the threshold to 0.3):
```
        dist_list = cdist(emb, enroll_embs, metric="cosine")
        distances = pd.DataFrame(dist_list, columns = df.speaker)
```
In the end, I would like to note that the inference speed was fast and allowed adding 1-2 more models (it took 7 seconds for a 2.5-second sample). We did not have time to add new models and focused on writing a prototype of a web application.

Web application

An important point: we take a router with us from home and set up our local area, it helps to connect the device and laptops over the grid.

The backend is an end-to-end communication channel between the front and the Raspberry Pi, based on websocket technology (http over tcp protocol).

The first step is to get the processed information from the raspberry, that is, predictives packed in json, which are saved to the database in the middle of their journey so that statistics can be generated about the user's emotional background for the period. Next, this packet is sent to the frontend, which uses the subscription and receipt of packets from the websocket endpoint. The whole backend mechanism is built on the golang language, the choice fell on it because it is well suited for asynchronous tasks that goroutines do well with.
When accessing the endpoint, the user is registered and entered into the structure, then his message is received. Both the user and the message are entered into a common hub, from which messages are already sent further (to the subscribed front), and if the user closes the connection (raspberry or front), then his subscription is canceled and he is removed from the hub.

We are waiting for a connection from the back

The front-end is a web application written in JavaScript using the React library to speed up and simplify the development process. The purpose of this application is to visualize data obtained using algorithms running on the back-end side and directly on the Raspberry Pi. The page has section routing implemented using react-router, but the main page of interest is the main page, where a continuous stream of data from the server comes in real time using WebSocket technology. The Raspberry Pi detects the voice, determines whether it belongs to a certain person from the registered database, and sends the probability list to the client. The client displays the latest up-to-date data, displays the avatar of the person who most likely spoke into the microphone, as well as the emotion with which he pronounces the words.

Main page with updated predictions

Conclusion

It didn’t work out to finish everything as planned, they just didn’t have time, so the main hope was on the demo, that everything would work. In the presentation, they talked about how everything works, what models they took, what problems they encountered. Then there was a part of the demo - the experts walked around the hall in random order and approached each team to look at the working prototype. We were also asked questions, each answered in his own way, the web was left on the laptop, and everything really worked as expected.

I note that the total cost of our solution was $ 150:
- Raspberry Pi 3 ~ $35
- Google AIY Voice Bonnet (respeaker fee available) ~ $15
- Intel NCS 2 ~ 100$
How to improve:
- Use registration from the client - ask to read the text that we randomly generate
- Add a few more models: gender and age can be determined by voice
- Separate voices that sound at the same time (diarization)
Repository: https://github.com/vladimirwest/OpenEMO

Tired but happy we are

In conclusion, I would like to thank the organizers and participants. Of the projects of other teams, we personally liked the solution for monitoring free parking spaces. For us, it was a wildly cool product immersion and development experience. I hope that more and more interesting events will be held in the regions, including those on AI topics.

Source: habr.com

OpenVINO Hackathon: Voice and Emotion Recognition on Raspberry Pi

Select a task

Adding an indicator button

Working with voice

Web application

Conclusion

Add a comment Отменить ответ