Ib qho nyuaj tshaj plaws ntawm kev npaj rau ib daim ntawv dawb-dawb hackathon yog xaiv qhov kev sib tw. Peb tam sim ntawd txiav txim siab los txog ib yam dab tsi uas tseem tsis tau muaj nyob rau hauv cov khoom, txij li thaum cov lus tshaj tawm hais tias qhov no yog heev txais tos.
Muaj kev txheeb xyuas qauv, uas muaj nyob rau hauv cov khoom nyob rau hauv qhov kev tso tawm tam sim no, peb tuaj mus rau qhov xaus hais tias lawv feem ntau daws tau ntau yam teeb meem tsis pom kev hauv computer. Ntxiv mus, nws yog ib qho nyuaj heev los mus nrog ib tug teeb meem nyob rau hauv lub teb ntawm lub computer tsis pom kev uas yuav daws tsis tau siv OpenVINO, thiab txawm yog ib tug yuav invented, nws yog ib qhov nyuaj rau nrhiav tau pre-kev kawm ua qauv nyob rau hauv pej xeem sau. Peb txiav txim siab khawb mus rau lwm qhov kev taw qhia - ntawm kev hais lus thiab kev tshuaj xyuas. Cia peb xav txog ib txoj haujlwm nthuav txog kev paub txog kev xav ntawm kev hais lus. Nws yuav tsum tau hais tias OpenVINO twb muaj tus qauv uas txiav txim siab tus neeg lub siab xav raws li lawv lub ntsej muag, tab sis:
Hauv txoj kev xav, nws muaj peev xwm tsim tau ib qho kev sib xyaw ua ke uas yuav ua haujlwm ntawm ob lub suab thiab duab, uas yuav tsum tau ua kom muaj qhov tseeb.
Cov koob yees duab feem ntau muaj qhov pom nqaim; ntau dua ib lub koob yees duab yuav tsum tau npog thaj tsam loj; suab tsis muaj qhov txwv.
Cia peb tsim lub tswv yim: cia peb coj lub tswv yim rau cov khw muag khoom raws li lub hauv paus. Koj tuaj yeem ntsuas cov neeg siv khoom txaus siab ntawm lub khw muag khoom. Yog tias ib tus ntawm cov neeg siv khoom tsis txaus siab rau qhov kev pabcuam thiab pib tsa lawv lub suab, koj tuaj yeem hu rau tus thawj tswj hwm tam sim ntawd kom pab.
Hauv qhov no, peb yuav tsum ntxiv tib neeg lub suab lees paub, qhov no yuav ua rau peb paub qhov txawv ntawm cov neeg ua haujlwm hauv khw los ntawm cov neeg siv khoom thiab muab kev tshuaj ntsuam xyuas rau txhua tus neeg. Zoo, ntxiv rau, nws yuav muaj peev xwm txheeb xyuas tus cwj pwm ntawm cov neeg ua haujlwm hauv khw lawv tus kheej, ntsuas qhov cua hauv pab pawg, suab zoo!
Peb tsim cov kev cai rau peb cov kev daws teeb meem:
Me me ntawm lub hom phiaj ntaus ntawv
Lub sijhawm ua haujlwm tiag tiag
Tus nqi qis
Yooj yim scalability
Yog li ntawd, peb xaiv Raspberry Pi 3 c ua lub hom phiaj ntaus ntawv Intel NCS 2.
Ntawm no nws yog ib qho tseem ceeb uas yuav tsum nco ntsoov ib qho tseem ceeb ntawm NCS - nws ua haujlwm zoo tshaj plaws nrog cov qauv CNN architectures, tab sis yog tias koj xav tau khiav ib tus qauv nrog cov txheej txheem kev cai ntawm nws, ces cia siab tias yuav ua kom zoo dua qib qis.
Tsuas muaj ib qho me me xwb: koj yuav tsum tau txais lub microphone. USB microphone li niaj zaus yuav ua, tab sis nws yuav tsis zoo ua ke nrog RPI. Tab sis txawm nyob ntawm no qhov kev daws teeb meem "nyob ze." Txhawm rau sau lub suab, peb txiav txim siab siv lub Rooj Tswjhwm Saib Lub Suab Bonnet los ntawm cov khoom siv Google AIY Voice Kit, nyob rau hauv uas muaj ib tug wired stereo microphone.
Download Raspbian los ntawm AIY project repository thiab xa nws mus rau lub flash drive, sim tias lub microphone ua haujlwm siv cov lus txib hauv qab no (nws yuav kaw suab 5 vib nas this ntev thiab khaws cia rau hauv cov ntaub ntawv):
arecord -d 5 -r 16000 test.wav
Kuv yuav tsum nco ntsoov tam sim ntawd tias lub microphone yog qhov rhiab heev thiab khaws cov suab nrov zoo. Txhawm rau txhim kho qhov no, cia peb mus rau alsamixer, xaiv Capture li thiab txo cov teeb liab tawm tswv yim rau 50-60%.
Peb hloov kho lub cev nrog cov ntaub ntawv thiab txhua yam haum, koj tuaj yeem kaw nws nrog lub hau
Ntxiv qhov taw qhia khawm
Thaum noj AIY Voice Kit sib nrug, peb nco ntsoov tias muaj lub pob RGB, lub teeb rov qab uas tuaj yeem tswj tau los ntawm software. Peb nrhiav "Google AIY Led" thiab nrhiav cov ntaub ntawv: https://aiyprojects.readthedocs.io/en/latest/aiy.leds.html
Vim li cas ho tsis siv lub pob no los tso saib qhov kev xav paub, peb tsuas muaj 7 chav kawm, thiab lub pob muaj 8 xim, txaus!
Peb txuas lub pob ntawm GPIO rau Lub Suab Bonnet, thauj cov tsev qiv ntawv tsim nyog (lawv twb tau teeb tsa hauv cov khoom siv faib khoom los ntawm AIY cov haujlwm)
from aiy.leds import Leds, Color
from aiy.leds import RgbLeds
Peb yuav ntxiv rau daim ntawv teev cov chunks qhov twg muaj kev pov npav; yog tias tsis muaj kev pov npav, ces peb yuav nce lub txee ntawm qhov khoob.
Yog hais tias lub txee ntawm khoob chunks yog> = 30 (600 ms), ces peb saib qhov loj ntawm daim ntawv teev cov chunks sau; yog hais tias nws yog> 250, ces peb ntxiv rau cov kab; yog tsis yog, peb xav tias qhov ntev. ntawm cov ntaub ntawv tsis txaus los pub rau tus qauv los txheeb xyuas tus neeg hais lus.
Yog hais tias lub txee ntawm khoob chunks tseem <30, thiab qhov loj ntawm daim ntawv teev cov chunks ntau tshaj 300, ces peb yuav ntxiv cov fragment rau cov kab rau ib tug ntau dua kev twv ua ntej. (vim txoj kev xav yuav hloov raws sij hawm)
def to_queue(frames):
d = np.frombuffer(b''.join(frames), dtype=np.int16)
return d
framesQueue = queue.Queue()
def framesThreadBody():
CHUNK = 960
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 48000
p = pyaudio.PyAudio()
vad = webrtcvad.Vad()
vad.set_mode(2)
stream = p.open(format=FORMAT,
channels=CHANNELS,
rate=RATE,
input=True,
frames_per_buffer=CHUNK)
false_counter = 0
audio_frame = []
while process:
data = stream.read(CHUNK)
if not vad.is_speech(data, RATE):
false_counter += 1
if false_counter >= 30:
if len(audio_frame) > 250:
framesQueue.put(to_queue(audio_frame,timestamp_start))
audio_frame = []
false_counter = 0
if vad.is_speech(data, RATE):
false_counter = 0
audio_frame.append(data)
if len(audio_frame) > 300:
framesQueue.put(to_queue(audio_frame,timestamp_start))
audio_frame = []
Nws yog lub sijhawm los nrhiav cov qauv kev cob qhia ua ntej hauv cov pej xeem sau npe, mus rau github, Google, tab sis nco ntsoov tias peb muaj kev txwv ntawm cov qauv siv. Qhov no yog ib qho nyuaj heev, vim hais tias koj yuav tsum sim cov qauv ntawm koj cov ntaub ntawv tawm tswv yim, thiab ntxiv rau, hloov lawv mus rau OpenVINO's sab hauv hom - IR (Intermediate Representation). Peb tau sim txog 5-7 qhov kev daws teeb meem sib txawv los ntawm github, thiab yog tias tus qauv rau kev paub txog kev xav tau ua haujlwm tam sim ntawd, nrog rau kev lees paub lub suab peb yuav tsum tau tos ntev dua - lawv siv ntau cov qauv tsim.
Peb tsom rau cov hauv qab no:
Kev xav ntawm lub suab - https://github.com/alexmuhr/Voice_Emotion
Nws ua haujlwm raws li lub hauv paus ntsiab lus hauv qab no: lub suab raug txiav mus rau hauv nqe lus ntawm ib qho me me, rau txhua nqe lus uas peb xaiv MFCC thiab tom qab ntawd xa lawv raws li cov tswv yim rau CNN
Lub suab paub - https://github.com/linhdvu14/vggvox-speaker-identification
Ntawm no, es tsis txhob MFCC, peb ua hauj lwm nrog ib tug spectrogram, tom qab FFT peb pub lub teeb liab rau CNN, qhov twg ntawm cov zis peb tau ib tug vector sawv cev ntawm lub suab.
Tom ntej no peb yuav tham txog kev hloov cov qauv, pib nrog kev xav. OpenVINO suav nrog ntau lub modules:
Model Optimzer, ua tsaug rau qhov koj tuaj yeem hloov tus qauv los ntawm ntau hom qauv (Tensorflow, ONNX thiab lwm yam) rau hauv Intermediate Representation format, uas peb yuav ua haujlwm ntxiv.
Inference Cav tso cai rau koj khiav cov qauv hauv IR hom ntawm Intel processors, Myriad chips thiab Neural Compute Stick accelerators
Qhov ua tau zoo tshaj plaws ntawm OpenCV (nrog Inference Cav txhawb nqa)
Txhua tus qauv hauv IR hom tau piav qhia los ntawm ob cov ntaub ntawv: .xml thiab .bin.
Cov qauv raug hloov mus rau IR hom ntawm Model Optimizer raws li hauv qab no:
--data_type tso cai rau koj xaiv cov ntaub ntawv hom uas tus qauv yuav ua haujlwm. FP32, FP16, INT8 tau txais kev txhawb nqa. Xaiv cov ntaub ntawv zoo tshaj plaws tuaj yeem ua rau muaj txiaj ntsig zoo. --input_shape qhia qhov loj me ntawm cov ntaub ntawv nkag. Lub peev xwm hloov pauv hloov pauv nws zoo li tam sim no nyob rau hauv C ++ API, tab sis peb tsis tau khawb qhov deb thiab tsuas kho nws rau ib qho ntawm cov qauv.
Tom ntej no, cia peb sim thauj cov qauv uas twb hloov lawm hauv IR hom ntawm DNN module rau hauv OpenCV thiab xa mus rau nws.
import cv2 as cv
emotionsNet = cv.dnn.readNet('emotions_model.bin',
'emotions_model.xml')
emotionsNet.setPreferableTarget(cv.dnn.DNN_TARGET_MYRIAD)
Cov kab kawg hauv qhov no tso cai rau koj hloov pauv kev suav mus rau Neural Compute Stick, cov kev suav yooj yim tau ua ntawm lub processor, tab sis nyob rau hauv rooj plaub ntawm Raspberry Pi qhov no yuav tsis ua haujlwm, koj yuav xav tau ib lub pas.
Tom ntej no, lub logic yog raws li nram no: peb faib peb lub suab rau hauv lub qhov rais ntawm ib tug tej yam me me (rau peb nws yog 0.4 s), peb hloov txhua lub qhov rais rau hauv MFCC, uas peb ces pub rau hauv daim phiaj:
emotionsNet.setInput(MFCC_from_window)
result = emotionsNet.forward()
Tom ntej no, cia peb coj cov chav kawm ntau tshaj plaws rau txhua qhov rais. Ib qho kev daws teeb meem yooj yim, tab sis rau hackathon koj tsis tas yuav tuaj nrog qee yam tsis meej, tsuas yog koj muaj sijhawm. Peb tseem muaj ntau txoj haujlwm ua, yog li cia peb mus ntxiv - peb yuav ua nrog lub suab lees paub. Nws yog ib qho tsim nyog los tsim qee yam ntawm cov ntaub ntawv nyob rau hauv uas spectrograms ntawm cov suab kaw ua ntej yuav muab khaws cia. Txij li tsis muaj sij hawm tsawg, peb yuav daws qhov teeb meem no raws li qhov peb ua tau.
Namely, peb tsim ib tsab ntawv sau ib lub suab excerpt (nws ua haujlwm tib yam li tau piav qhia saum toj no, tsuas yog thaum cuam tshuam los ntawm cov keyboard nws yuav txuag lub suab rau hauv cov ntaub ntawv).
Wb sim:
python3 voice_db/record_voice.py test.wav
Peb kaw cov suab ntawm ob peb tus neeg (hauv peb rooj plaub, peb pawg neeg ua haujlwm)
Tom ntej no, rau txhua lub suab kaw peb ua ib qho kev hloov pauv sai dua, tau txais ib qho spectrogram thiab txuag nws ua numpy array (.npy):
for file in glob.glob("voice_db/*.wav"):
spec = get_fft_spectrum(file)
np.save(file[:-4] + '.npy', spec)
Xav paub ntau ntxiv hauv cov ntaub ntawv create_base.py
Yog li ntawd, thaum peb khiav cov ntawv tseem ceeb, peb yuav tau txais embeddings los ntawm cov spectrograms thaum pib:
for file in glob.glob("voice_db/*.npy"):
spec = np.load(file)
spec = spec.astype('float32')
spec_reshaped = spec.reshape(1, 1, spec.shape[0], spec.shape[1])
srNet.setInput(spec_reshaped)
pred = srNet.forward()
emb = np.squeeze(pred)
Tom qab tau txais cov embedding los ntawm cov suab nrov ntu, peb yuav muaj peev xwm txiav txim siab seb nws yog leej twg los ntawm kev noj cov cosine deb ntawm qhov kev mus rau tag nrho cov suab hauv cov ntaub ntawv (qhov me dua, qhov ntau dua) - rau qhov demo peb teeb tsa qhov pib. rau 0.3):