NgoNovemba 30 - Disemba 1 eNizhny Novgorod yabanjwa
Kulesi sihloko sizokutshela mayelana nendlela esidale ngayo i-prototype yethu yomkhiqizo, lapho ekugcineni sathatha indawo yokuqala.
Amaqembu angaphezu kwe-10 abambe iqhaza ku-hackathon. Kuhle ukuthi abanye babo baqhamuka kwezinye izifunda. Indawo ye-hackathon yayiyinxanxathela ye-"Kremlinsky on Pochain", lapho izithombe zasendulo zaseNizhny Novgorod zazilengiswe ngaphakathi, ngomhlangano! (Ngikukhumbuza ukuthi okwamanje ihhovisi eliphakathi le-Intel litholakala eNizhny Novgorod). Abahlanganyeli banikezwe amahora angu-26 ukuthi babhale ikhodi, futhi ekugcineni kwadingeka bethule isisombululo sabo. Inzuzo ehlukile kwakuwukuba khona kweseshini yedemo ukuze kuqinisekiswe ukuthi konke okuhleliwe kwenziwa ngempela futhi akuhlali kuyimibono esethulweni. Okuthengiswayo, ukudla okulula, ukudla, yonke into yayikhona!
Ngaphezu kwalokho, i-Intel inikezwe ngokuzithandela ngamakhamera, i-Raspberry PI, i-Neural Compute Stick 2.
Ukukhetha umsebenzi
Enye yezingxenye ezinzima kakhulu zokulungiselela i-hackathon yamahhala ukukhetha inselele. Ngokushesha sanquma ukuqhamuka nento eyayingakabi bikho emkhiqizweni, njengoba isimemezelo sathi lokhu kwamukeleka kakhulu.
Ngemva kokuhlaziya
- Ngombono, kungenzeka ukudala i-algorithm ehlangene ezosebenza kokubili umsindo nesithombe, okufanele unikeze ukwanda kokunemba.
- Amakhamera ngokuvamile ane-engeli yokubuka ewumngcingo; ikhamera engaphezu kweyodwa iyadingeka ukuze ihlanganise indawo enkulu; umsindo awunawo umkhawulo onjalo.
Masithuthukise umbono: ake sithathe umqondo wengxenye yezitolo njengesisekelo. Ungakala ukwaneliseka kwekhasimende ekukhokheni kwesitolo. Uma omunye wamakhasimende enganelisekile ngesevisi futhi eqala ukuphakamisa ithoni yawo, ungashayela umlawuli ngokushesha ukuze uthole usizo.
Kulesi simo, sidinga ukungeza ukuqashelwa kwezwi lomuntu, lokhu kuzosivumela ukuthi sihlukanise abasebenzi basesitolo kumakhasimende futhi sinikeze izibalo zomuntu ngamunye. Yebo, ngaphezu kwalokho, kuzokwazi ukuhlaziya ukuziphatha kwabasebenzi besitolo ngokwabo, ukuhlola isimo seqembu, kuzwakala kukuhle!
Sakha izidingo zesixazululo sethu:
- Usayizi omncane wedivayisi eqondiwe
- Ukusebenza kwesikhathi sangempela
- Intengo ephansi
- I-scalability elula
Ngenxa yalokho, sikhetha i-Raspberry Pi 3 c njengedivayisi eqondiwe
Lapha kubalulekile ukuqaphela isici esisodwa esibalulekile se-NCS - sisebenza kangcono ngezakhiwo ezijwayelekile ze-CNN, kodwa uma udinga ukusebenzisa imodeli enezendlalelo zangokwezifiso kuyo, lindela ukuthuthukiswa kwezinga eliphansi.
Kunento eyodwa nje encane ongayenza: udinga ukuthola imakrofoni. Imakrofoni ye-USB evamile izokwenza, kodwa ngeke ibukeke kahle uma ihlangene ne-RPI. Kodwa nalapha ikhambi ngokwezwi nezwi “liseduze.” Ukurekhoda izwi, sinquma ukusebenzisa ibhodi Yebhonethi Yezwi kukhithi
Landa i-Raspbian kusuka
arecord -d 5 -r 16000 test.wav
Kufanele ngiqaphele ngokushesha ukuthi imakrofoni izwela kakhulu futhi ibamba umsindo kahle. Ukuze ulungise lokhu, asiye ku-alsamixer, khetha okuthi Thwebula amadivayisi bese wehlise izinga lesignali yokufaka libe ngu-50-60%.
Silungisa umzimba ngefayela futhi yonke into ilingana, ungakwazi nokuyivala ngesivalo
Yengeza inkinobho yenkomba
Ngenkathi sihlukanisa i-AIY Voice Kit, siyakhumbula ukuthi kukhona inkinobho ye-RGB, ukukhanya kwayo okungemuva okungalawulwa yisoftware. Sisesha i-"Google AIY Led" futhi sithole imibhalo:
Kungani ungasebenzisi le nkinobho ukuze ubonise imizwa eyaziwayo, sinezigaba ezingu-7 kuphela, futhi inkinobho inemibala engu-8, eyanele!
Sixhuma inkinobho nge-GPIO ku-Voice Bonnet, silayishe imitapo yolwazi edingekayo (sevele ifakiwe kukhithi yokusabalalisa evela kumaphrojekthi we-AIY)
from aiy.leds import Leds, Color
from aiy.leds import RgbLeds
Masidale umyalo lapho imizwa ngayinye izoba nombala ohambisanayo ngendlela ye-RGB Tuple kanye nento yekilasi aiy.leds.Leds, lapho sizobuyekeza umbala:
led_dict = {'neutral': (255, 255, 255), 'happy': (0, 255, 0), 'sad': (0, 255, 255), 'angry': (255, 0, 0), 'fearful': (0, 0, 0), 'disgusted': (255, 0, 255), 'surprised': (255, 255, 0)}
leds = Leds()
Futhi ekugcineni, ngemva kokubikezela okusha komzwelo ngamunye, sizobuyekeza umbala wenkinobho ngokuhambisana nawo (ngokhiye).
leds.update(Leds.rgb_on(led_dict.get(classes[prediction])))
Inkinobho, shisa!
Ukusebenza ngezwi
Sizosebenzisa i-pyaudio ukuze sithwebule ukusakaza kusuka kumakrofoni ne-webrtcvad ukuze sihlunge umsindo futhi sizwe izwi. Ngaphezu kwalokho, sizodala umugqa lapho sizokwengeza khona ngokuzenzakalelayo futhi sisuse izingcaphuno zezwi.
Njengoba i-webrtcvad inomkhawulo kusayizi wocezu olunikeziwe - kufanele lilingane no-10/20/30ms, futhi ukuqeqeshwa kwemodeli yokuqaphela imizwa (njengoba sizofunda kamuva) kwenziwa kudathasethi engu-48kHz, sizokwenza. bamba izingcezu zosayizi 48000×20ms/1000×1(mono)=960 bytes. I-Webrtcvad izobuyisela Iqiniso/Amanga ngayinye yalezi ziqephu, ezihambisana nokuba khona noma ukungabikho kwevoti kusiqephu.
Masisebenzise i-logic elandelayo:
- Sizokwengeza ohlwini lezo ziqephu lapho kunevoti khona; uma lingekho ivoti, sizobe senyusa ikhawunta yezinhlamvu ezingenalutho.
- Uma i-counter of chunks engenalutho ingu->=30 (600 ms), khona-ke sibheka usayizi wohlu lwama-chunks anqwabelene; uma i>250, bese siyayengeza kulayini; uma kungenjalo, sibheka ukuthi ubude kwerekhodi akwanele ukuliphakela imodeli ukuhlonza isikhulumi.
- Uma ikhawunta yezingcezu ezingenalutho isengu-< 30, futhi usayizi wohlu lwezingxenye eziqoqiwe udlula u-300, sizobe sengeza isiqeshana kulayini ukuze uthole isibikezelo esinembe kakhudlwana. (ngoba imizwa ijwayele ukushintsha ngokuhamba kwesikhathi)
def to_queue(frames):
d = np.frombuffer(b''.join(frames), dtype=np.int16)
return d
framesQueue = queue.Queue()
def framesThreadBody():
CHUNK = 960
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 48000
p = pyaudio.PyAudio()
vad = webrtcvad.Vad()
vad.set_mode(2)
stream = p.open(format=FORMAT,
channels=CHANNELS,
rate=RATE,
input=True,
frames_per_buffer=CHUNK)
false_counter = 0
audio_frame = []
while process:
data = stream.read(CHUNK)
if not vad.is_speech(data, RATE):
false_counter += 1
if false_counter >= 30:
if len(audio_frame) > 250:
framesQueue.put(to_queue(audio_frame,timestamp_start))
audio_frame = []
false_counter = 0
if vad.is_speech(data, RATE):
false_counter = 0
audio_frame.append(data)
if len(audio_frame) > 300:
framesQueue.put(to_queue(audio_frame,timestamp_start))
audio_frame = []
Isikhathi sokubheka amamodeli aqeqeshwe ngaphambilini esizindeni somphakathi, iya ku-github, i-Google, kodwa khumbula ukuthi sinomkhawulo ekwakhiweni kwezakhiwo ezisetshenzisiwe. Lena ingxenye enzima kakhulu, ngoba kufanele uhlole amamodeli kudatha yakho yokufaka, futhi ngaphezu kwalokho, uwaguqulele kufomethi yangaphakathi ye-OpenVINO - IR (Ukumela Okuphakathi). Sizame mayelana nezixazululo ezihlukene ezingu-5-7 kusuka ku-github, futhi uma imodeli yokuqaphela imizwelo isebenza ngokushesha, khona-ke ngokuqaphela izwi kwakudingeka silinde isikhathi eside - basebenzisa izakhiwo eziyinkimbinkimbi.
Sigxila kokulandelayo:
- Imizwa evela ezwini -
https://github.com/alexmuhr/Voice_Emotion
Isebenza ngokuvumelana nalesi simiso esilandelayo: umsindo unqunywa ube yizingxenyana zobukhulu obuthile, ngayinye yalezi zigaba esizikhethayoI-MFCC bese uwathumela njengokufaka ku-CNN - Ukuqashelwa kwezwi -
https://github.com/linhdvu14/vggvox-speaker-identification
Lapha, esikhundleni se-MFCC, sisebenza nge-spectrogram, ngemuva kwe-FFT sondla isignali ku-CNN, lapho ekuphumeni sithola ukumelwa kwe-vector yezwi.
Okulandelayo sizokhuluma ngokuguqula amamodeli, siqale ngethiyori. I-OpenVINO ifaka amamojula amaningana:
- Vula i-Model Zoo, amamodeli angasetshenziswa futhi afakwe emkhiqizweni wakho
- I-Model Optimzer, sibonga ongaguqula ngayo imodeli isuka kumafomethi ohlaka ahlukahlukene (i-Tensorflow, i-ONNX njll) ibe ifomethi yokuMela okuPhakathi, esizoqhubeka nokusebenza ngayo.
- I-Inference Engine ikuvumela ukuthi usebenzise amamodeli ngefomethi ye-IR kuma-Intel processors, ama-Myriad chips nama-Neural Compute Stick accelerators.
- Inguqulo esebenza kahle kakhulu ye-OpenCV (ngokusekelwa kwe-Inference Engine)
Imodeli ngayinye ngefomethi ye-IR ichazwa ngamafayela amabili: .xml kanye ne-.bin.
Amamodeli aguqulelwa kufomethi ye-IR nge-Model Optimizer kanje:python /opt/intel/openvino/deployment_tools/model_optimizer/mo_tf.py --input_model speaker.hdf5.pb --data_type=FP16 --input_shape [1,512,1000,1]
--data_type
ikuvumela ukuthi ukhethe ifomethi yedatha imodeli ezosebenza ngayo. I-FP32, FP16, INT8 iyasekelwa. Ukukhetha uhlobo lwedatha olulungile kunganikeza ukuthuthukiswa okuhle kokusebenza.
--input_shape
ikhombisa ubukhulu bedatha yokufaka. Ikhono lokuyishintsha ngokushintshashintshayo libonakala likhona ku-C++ API, kodwa asizange simbe kude kangako futhi savele sayilungisela enye yamamodeli.
Okulandelayo, ake sizame ukulayisha imodeli esivele iguquliwe ngefomethi ye-IR ngemojula ye-DNN ku-OpenCV bese siyidlulisela kuyo.import cv2 as cv emotionsNet = cv.dnn.readNet('emotions_model.bin', 'emotions_model.xml') emotionsNet.setPreferableTarget(cv.dnn.DNN_TARGET_MYRIAD)
Umugqa wokugcina kuleli cala ikuvumela ukuthi uqondise kabusha izibalo ku-Neural Compute Stick, izibalo eziyisisekelo zenziwa kuphrosesa, kodwa esimweni se-Raspberry Pi lokhu ngeke kusebenze, uzodinga induku.
Okulandelayo, umqondo umi kanje: sihlukanisa umsindo wethu ube amafasitela osayizi othile (kithina 0.4 s), siguqulela ngalinye kulawa mawindi libe yi-MFCC, ebese siyiphakela kugridi:
emotionsNet.setInput(MFCC_from_window) result = emotionsNet.forward()
Okulandelayo, ake sithathe isigaba esivame kakhulu kuwo wonke amawindi. Isixazululo esilula, kodwa nge-hackathon awudingi ukuza nento engavamile kakhulu, kuphela uma unesikhathi. Sisenomsebenzi omningi okufanele siwenze, ngakho-ke asiqhubeke - sizobhekana nokuqashelwa kwezwi. Kuyadingeka ukudala uhlobo oluthile lwesizindalwazi lapho ama-spectrograms wamazwi arekhodiwe ngaphambilini azogcinwa khona. Njengoba isikhathi sincane esisele, sizoyixazulula le nkinga ngokusemandleni ethu.
Okungukuthi, sakha iskripthi sokurekhoda isiqephu sezwi (sisebenza ngendlela efanayo njengoba kuchazwe ngenhla, kuphela uma iphazanyiswa kukhibhodi izogcina izwi efayeleni).
Ake sizame:
python3 voice_db/record_voice.py test.wav
Siqopha amazwi abantu abambalwa (kithi, amalungu eqembu amathathu)
Okulandelayo, kuzwi ngalinye elirekhodiwe senza inguquko esheshayo emine, sithole i-spectrogram futhi siyilondoloze njenge-numpy array (.npy):for file in glob.glob("voice_db/*.wav"): spec = get_fft_spectrum(file) np.save(file[:-4] + '.npy', spec)
Imininingwane eyengeziwe efayelini
create_base.py
Njengomphumela, lapho sisebenzisa iskripthi esikhulu, sizothola okushumekiwe kusuka kulawa ma-spectrogram ekuqaleni impela:for file in glob.glob("voice_db/*.npy"): spec = np.load(file) spec = spec.astype('float32') spec_reshaped = spec.reshape(1, 1, spec.shape[0], spec.shape[1]) srNet.setInput(spec_reshaped) pred = srNet.forward() emb = np.squeeze(pred)
Ngemva kokuthola ukushumeka engxenyeni ekhandisiwe, sizokwazi ukunquma ukuthi ekabani ngokuthatha ibanga le-cosine ukusuka kuphaseji ukuya kuwo wonke amazwi akusizindalwazi (amancane, amathuba amaningi) - kwidemo sibeka umkhawulo. ku-0.3):
dist_list = cdist(emb, enroll_embs, metric="cosine") distances = pd.DataFrame(dist_list, columns = df.speaker)
Ekugcineni, ngingathanda ukuqaphela ukuthi isivinini sokucatshangelwa sasishesha futhi senza ngakwazi ukwengeza amamodeli engeziwe angu-1-2 (isampula imizuzwana engu-7 ubude kuthathe u-2.5 ukuze kufinyelelwe). Asisenaso isikhathi sokwengeza amamodeli amasha futhi sigxile ekubhaleni i-prototype yohlelo lokusebenza lwewebhu.
Isicelo sewebhu
Iphuzu elibalulekile: sithatha umzila ukusuka ekhaya futhi simise inethiwekhi yethu yasendaweni, kuyasiza ukuxhuma idivayisi namakhompyutha aphathekayo kunethiwekhi.
I-backend iyisiteshi somlayezo osuka ekupheleni uye ekupheleni phakathi kwengaphambili ne-Raspberry Pi, ngokusekelwe kubuchwepheshe be-websocket (http over tcp protocol).
Isigaba sokuqala siwukuthola ulwazi olucutshunguliwe oluvela ku-raspberry, okungukuthi, izibikezelo ezipakishwe ku-json, ezilondolozwa kusizindalwazi phakathi nohambo lwazo ukuze kwenziwe izibalo mayelana nesizinda somzwelo somsebenzisi saleso sikhathi. Leli phakethe libe selithunyelwa endaweni engaphambili, esebenzisa ukubhalisa futhi yamukela amaphakethe asuka ekugcineni kwesokhethi lewebhu. Yonke i-backend mechanism yakhiwe ngolimi lwesi-golang; yakhethwa ngoba ifaneleka kahle imisebenzi engavumelaniyo, okuyi-goroutines eyiphatha kahle.
Lapho ufinyelela iphuzu lokugcina, umsebenzisi ubhalisiwe futhi wangena esakhiweni, bese umlayezo wakhe utholwa. Kokubili umsebenzisi nomlayezo kufakwa kuhabhu evamile, lapho imilayezo isivele ithunyelwe khona (kuya ngaphambili obhalisiwe), futhi uma umsebenzisi evala uxhumano (okusajingijolo noma ngaphambili), ukubhalisa kwakhe kukhanseliwe futhi akhishwe ihabhu.
Silindele uxhumano oluvela ngemuvaI-Front-end uhlelo lokusebenza lwewebhu olubhalwe ku-JavaScript kusetshenziswa umtapo wezincwadi we-React ukusheshisa nokwenza lula inqubo yokuthuthukisa. Inhloso yalolu hlelo lokusebenza ukubona ngeso lengqondo idatha etholwe kusetshenziswa ama-algorithms asebenza ohlangothini olungemuva futhi ngqo ku-Raspberry Pi. Ikhasi linomzila wesigaba osetshenziswa kusetshenziswa i-react-router, kodwa ikhasi eliyinhloko elithakaselwayo yikhasi eliyinhloko, lapho ukusakazwa okuqhubekayo kwedatha kutholwa ngesikhathi sangempela kusuka kuseva kusetshenziswa ubuchwepheshe beWebSocket. I-Raspberry Pi ithola izwi, inqume ukuthi elomuntu othile osuka kusizindalwazi esibhalisiwe, bese ithumela uhlu lwamathuba eklayenti. Iklayenti libonisa idatha yakamuva efanele, libonisa i-avatar yomuntu okungenzeka ukuthi ukhulume kumbhobho, kanye nomuzwa aphimisa ngawo amagama.
Ikhasi lasekhaya elinezibikezelo ezibuyekeziweisiphetho
Kwakungenakwenzeka ukuqedela yonke into njengoba kwakuhleliwe, sasimane nje singenaso isikhathi, ngakho-ke ithemba elikhulu lalikudemo, ukuthi konke kuzosebenza. Esethulweni bakhulume ngokuthi yonke into isebenza kanjani, yiziphi izinhlobo abazithathile, yiziphi izinkinga abahlangabezane nazo. Okulandelayo kwaba ingxenye yedemo - ochwepheshe bazungeza igumbi ngokungahleliwe futhi basondela eqenjini ngalinye ukuze babheke isibonelo esisebenzayo. Baphinde basibuza imibuzo, wonke umuntu waphendula ingxenye yakhe, bashiya iwebhu kukhompyutha ephathekayo, futhi yonke into yasebenza njengoba bekulindelekile.
Ake ngiqaphele ukuthi izindleko eziphelele zesisombululo sethu zazingu-$150:
- I-Raspberry Pi 3 ~ $35
- I-Google AIY Voice Bonnet (ungathatha imali yesikhulumi) ~ 15$
- I-Intel NCS 2 ~ 100$
Indlela yokuthuthukisa:
- Sebenzisa ukubhaliswa kweklayenti - cela ukufunda umbhalo okhiqizwa ngokungahleliwe
- Engeza amamodeli ambalwa ngaphezulu: ungakwazi ukunquma ubulili neminyaka ngezwi
- Hlukanisa amazwi anomsindo ngesikhathi esisodwa (i-diarization)
Inqolobane:
https://github.com/vladimirwest/OpenEMO
Sikhathele kodwa sijabuleSengiphetha, ngithanda ukubonga kubahleli nababambe iqhaza. Phakathi kwamaphrojekthi wamanye amaqembu, thina mathupha sithande ikhambi lokuqapha izindawo zokupaka zamahhala. Kithina, kwakuyisipiliyoni esihle kakhulu sokucwiliswa emkhiqizweni nasekuthuthukisweni. Ngithemba ukuthi imicimbi eminingi ethokozisayo izobanjelwa ezifundeni, okuhlanganisa nezihloko ze-AI.
Source: www.habr.com