-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy paths3_note.md~
705 lines (609 loc) · 37.1 KB
/
s3_note.md~
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
### S3 Note: Journal of PhD Life
~~day.month.year~~
year-month-day (ISO 8601)
Content:
- What is the question (Q), and what is the answer (A)
- what is your idea, how to implement
- what is the problem, what is your proposed solution
- what is X, what is Y
- what did you have done today, what is your plan for tomorrow
2020-07-15:
- Extracting information from data
- Extracting Knowledge from Information
2020-07-14:
- Argument for mean and std features:
- While one argued that the small number of hsf features may not sufficient,
we prove the opposite. Using mean and std of LLD are better than
using large-size LLD. Instead of increasing size of the feature,
adding more train (utterances) data might improve the performance
since it generalize the model.
About mfcc
(source:https://www.researchgate.net/post/Why_we_take_only_12-13_MFCC_coefficients_in_feature_extraction)
An intuition about the cepstral features can help to figure out what we should
look for when we use them in a speech-based system.
- As cepstral features are computed by taking the Fourier transform of the
warped logarithmic spectrum, they contain information about the rate changes
in the different spectrum bands. Cepstral features are favorable due to their
ability to separate the impact of source and filter in a speech signal. In
other words, in the cepstral domain, the influence of the vocal cords
(source) and the vocal tract (filter) in a signal can be separated since the
low-frequency excitation and the formant filtering of the vocal tract are
located in different regions in the cepstral domain.
- If a cepstral coefficient has a positive value, it represents a sonorant
sound since the majority of the spectral energy in sonorant sounds are
concentrated in the low-frequency regions.
- On the other hand, if a cepstral coefficient has a negative value, it
represents a fricative sound since most of the spectral energies in fricative
sounds are concentrated at high frequencies.
- The lower order coefficients contain most of the information about the
overall spectral shape of the source-filter transfer function.
- The zero-order coefficient indicates the average power of the input signal.
- The first-order coefficient represents the distribution spectral energy
between low and high frequencies.
- Even though higher order coefficients represent increasing levels of spectral
details, depending on the sampling rate and estimation method, 12 to 20
cepstral coefficients are typically optimal for speech analysis. Selecting a
large number of cepstral coefficients results in more complexity in the
models. For example, if we intend to model a speech signal by a Gaussian
mixture model (GMM), if a large number of cepstral coefficients is used, we
typically need more data in order to accurately estimate the parameters of
the GMM.
2020-07-13:
Keuntungan hsf (mean+std):
- Ukuran lebih kecil (46, 68, dst)
- Beban komputasi lebih kecil (sehingga waktu komputasi lebih cepat)
- Ukuran fitur sama untuk semua utterance (tidak bergantung panjang utterance,
sehingga tidak perlu zero-padding)
Start writing dissertation:
- Start from ch4: SER using acoustic features
- what window used, why?
- why needs windowing, overlapping?
- Make figure for frame-based processing
- Why you choose specific window size (25 ms) --> short context -->
Add reference.
2020-06-12:
- Beberapa keberhasilan itu terjadi karena kebetulan, kebetulan setelah kerja keras
2020-96-09:
Some synonyms:
- valence: pleasantness, evaluation, sentiment, polarity
- arousal: activation, intensity
- dominance: power, potency, control
2020-06-06:
- hearing is the process of discovering from sounds what is
present in the world and where it is (David Marr, 1982, in HMH).
- whither speech: hearing is the process of discovering from sounds what is
present in the world and where it is
2020-05-14:
- Information science is defined in the Online Dictionary for Library and
Information Science (ODLIS) (Reitz, s.a.) as “The systematic study and
analysis of the sources, development, collection, organization, dissemination,
evaluation, use, and management of information in all its forms, including the
channels (formal and informal) and technology used in its communication."
- voice: is the sound produced by humans and other vertebrates using the lungs
and the vocal folds in the larynx, or voice box.
- speech: is voice in specific and decodable sounds produced by precisely
coordinated muscle actions in the head, neck, chest, and abdomen.
- Language is the expression of human communication through which knowledge,
belief, and behaviour can be experienced, explained, and shared.
2020-05-02:
- Back to markdown note in github
- Use VScode since 2020
2019-12-18:
This is the first journal which purely in text format.
Some conventions:
- on the top is YYYY-MM-DD
- Use vim or gedit, either of both, no other editorexcept in public PC.
- max lengh:80 lines
- file name YYYY-MM-DD
- save on journal dir: ~/Dropbox/journal
- turn on spell check,
- make the last line blank
2019-12-20
Fri Dec 20 14:38:17 JST 2019
- relearn VIM
- relearn numpy and matplotlib
- basic numpy= array, arange, linspace, zeros, ones, full, empty, identity
2019-12-21
- There is a different between the convey of logic and emotion.
- Grice argued that logic in conversation is based only what it said,
not other factors (intention, etc). For example, if someone said,
"I am angry", hence he is transmitting (or expressing) an angry, although,
he looks happy.
- Emotion is different. A study by Busso et. al. (in IEMOCAP, 2008), reveals
that there is a different between expression emotion and perceived emotion.
This evidence is found when building emotional database and asking the
emotion of actor and raters.
- Progress on research:
- change data splitting for CSL paper:
- DNN training 6000:2000:2039 (train:dev:test)
- the output/prediction of DNN only dev sets
- Those dev sets are input for SVM
- SVM predict the remain 2039 data
- To-do: Split data into LOSO (Leave One Session Out)
- Never use dashes for (python) file naming (only for date?). Underscore is OK.
- The current result using SVM (RBF Kernel) shows promising result.
2019-12-23:
- contribution of CSL paper:
- A multi-stage approach to exploit the advantage of each emotion
recognition modality, i.e., acoustic and text, by combining
each result from DNN to SVM.
- Proposed CSL title:
"Dimensional speech emotion recognition based on late fusion of
acoustic and text information: A two-stage process by using DNN and SVM"
- idea: "Recognizing Emotion and Naturalness in Speech using
DNN with Multitask Learning"
2019-12-24:
- The current architecture (data processing) of two-stage SER make it impossible
to perform multi-stage, todo (next): adapt avec approach to make it able
to process on multi-stages technique.
- Got initial result on two-stage SER:
1. Generally good performance
2. TER on MSP-I+N, shows low result. This can be caused by:
a. The transcription is made by ASR (suffer from error)
b. MSP is designed by controlling lexical content. Although improvisation
and natural interaction recording is chosen, there is possibility
that those actors are still influenced by target sentences.
2012-12-25:
- Finish initial experiment on IEMOCAP and MSP-I+N for CSL Journal
- Result shows good!
- CSL use HARVARD references/citation styles!
- Lessons learned from who wants to be millionaire:
- Ask the Audience: the audience takes voting pads attached to their seats
and votes for the answer that they believe is correct. The computer tallies
the results and displays them as percentages to the contestant.
- 50:50: the game's computer eliminates two wrong answers from the current
question, leaving behind the correct answer and one incorrect answer.
From 2000, the selection of two incorrect answers were random.
- Phone a Friend: the contestant is connected with a friend over a phone line
and is given 30 seconds to read the question and answers and
solicit assistance. The time begins as soon as the contestant starts reading the question.
2012-12-26:
- I have two sets a = [a, b, c] and b = ['1', '2', '3']. I want to find how many
pairs I can choose from a and b, i.e., 9 (a1, a2, a3, b1, ..., c3).
The number of 9 I got from by multiplying 3 by 3.
So if I have three sets (say C = [x, y, z]), I will have 27 pairs (3 x 3 x 3).
- Answer: 3C2 x 3C2 = 9
- add bahasa to this https://github.com/cptangry/wahy
---
2019-12-18:
- This may the last note in this file
- making journal with markdown file is very heavy, especially by direct editing in Github
- I moved to journal in plain text (~/Dropbox/journal/) using text editor only.
---
2019-11-28:
- Conversation: a balance between speaking and listening.
- Pay attention on conversation!
- 10 tips on conversation: don't multitask, don't pontificate, use open-ended question, go with the flow,
if you don't know say you don't know, don't equate your experience with theirs, try not to repeat, stay out the weeds (focus), listen, be brief.
2019-11-27:
- Run speech + text for ASJ spring 2020 based on ococosda (failed) code
- Non MTL run, but MTL with different weight gives 0 CCC
- Move to github again after some months writing logs on notebook
---
2019-08-28:
- Q: What is the different with word tokens and word type?
- A: In sentence "Do in Rome as the Romans do", there is 7 tokens and 8 word types.
---
2019-08-27:
- Working on APSIPA draft, re running the experiment, found that longer window size (200ms) yield better result. Why? I don't know (need to be asked to Sensei?). But it is interesting to investigate: the impact of window size on feature extraction for speech emotion recognition.
- From book by Eyben; Using larger window to capture mid and longer dynamics, use short window to capture short dynamics context.
- TODO: extract feature using 200ms window on IEMOCAP mono speech file (currently from stereo)
- text processing using attention runs very slow, about 34 million learnable parameters, only iterate 10 times in 1 hour. No accuracy improvement so fat (compared to LSTM).
---
2019-08-26:
- Minor reserch report accepted by advisor
- Forget lesson learned from ASJ autumn compautation: stack 2 RNNs with return sequences true, no dense layer after it, instead, use Flatten.
---
2019-08-16:
- writing minor research report 3 (ANEW, Sentiwordnet, VADER)
- things to do: implement median and Mika Method (Mining valence, arousal....) for ANEW and Sentiwordnet.
---
2019-08-09:
- Linear vs Logistic regression: The outcome (dependent variable) has only a limited number of possible values. Logistic regression is used when the response variable is categorical in nature.
- Accomplishment: Affect-based text emotion recognition using ANEW, VADER and Sentitowordnet. Current results shows VADER give best in term of CCC (for Valence). It is interesting that text give better score on valence while speech resulting worst score on valence compared to (CCC) score on arousal and dominance.
---
2019-08-06:
- weekly meeting report: as presented (see in progress dir)
- weekly meeting note: what do you want to write for journal?
- dimensional SER (speech + text) using recursive SVR
- SER based on selected region (using attentional?)
- today's accomplishment: MAE and MSE from emobank using affective dictionary (ANEW).
- Q: What's different between lemmatization and stemming?
- A: Stemming and Lemmatization both generate the root form of the inflected words. The difference is that stem might not be an actual word whereas, lemma is an actual language word. Example: (connected, connection) --> connect, (am,is, are) --> be.
---
2019-07-12:
- Writing draft for ASJ Autumn 2019
- Dimensional speech emotion recognition works with comparable CCC score [0.306, 0.4, 0.11].
- If you have been asked with simple question, answer with simple answer.
- Change presentation style on lab meeting: purpose it to investigate, analyze/discuss result. Not
just to show.
---
2019-07-02:
- working on multi-loss, multi-task learning, and multi modal fusion emotion recognition for AVEC2019.
- Flow: multi-loss+multi-task --> unimodal (early) --> bimodal (early) --> late --> combination of both.
---
2019-07-01:
- dimensional ser on iemocap using Keras functional api works, it shows fairl good on MSE, but not on CCC.
- The current ccc for valence, arousal and dominance on test data is 0.03, 0.1, 0.05
---
2019-06-21:
- philosophy: the ability to interrupt argument.
---
2019-06-19:
- Feature-Level (FL) fusion (also known as “early fusion”) combines features from different modalities before performing recognition.
- Decision-Level (DL) fusion (also known as “late fusion”) combines the predictions and their probabilities given by each unimodal model for the multimodal model to make the final decision.
---
2019-06-18:
- Overall accuracy – where each sentence across the dataset has an equal weight, AKA weighted accuracy. In implementation this is the default accuracy in Keras metrics.
- Class accuracy – the accuracy is first evaluated for each emotion and then averaged, AKA unweighted accuracy. In implementaion, this class accuracy can be obtained by plotting normalized confustion matrix and get the average value along diagonal line.
---
2017-10-09
to be answered:
- what is semantic primitive?
- what is prosodic feature?
- what is lexicon?
- spectral feature: features based on/extracted from spectrum
- normalization: normalize the waveform (divided by biggest amplitude)
- what is para and non-linguistic
- SVM classifier (vs Fuzzy??)
- idea: use DNN and DNN+Fuzzy for classification
- resume: all method need to be confirmed with other datasets
- Entering JAIST as research student.
---
2017-10-10
to study:
- statistical significance test
- idea: record emotional utterence freely from various speaker, find the similar words
- reverse the idea above: provided utterence, spoke with different emotion
---
todo:
- Blog about emotion recognition (indonesia:pengenalan emosi) by reading related reference.
- Investigate tdnn in iban
---
2017-10-11
Semantik
se.man.tik /sèmantik/
n Ling ilmu tentang makna kata dan kalimat; pengetahuan mengenai seluk-beluk dan pergeseran arti kata
n Ling bagian struktur bahasa yang berhubungan dengan makna ungkapan atau struktur makna suatu wicara
From wikipedia:
Semantic primes or semantic primitives are semantic concepts that are innately understood, but cannot be expressed in simpler terms. They represent words or phrases that are learned through practice, but cannot be defined concretely. For example, although the meaning of "touching" is readily understood, a dictionary might define "touch" as "to make contact" and "contact" as "touching", providing no information if neither of these words are understood.
alternative research theme:
- **Multi-language emotion recognition based on acoustic and non-acoustic feature**
- A study to construct affective speech translation
Fix: **Speech emotion recognition from acoustic and contextual feature**
to study: correlation study of emotion dimension from acoustic and text feature
---
2017-11-7
- It is almost impossible to develop speech recognition using matlab/gnu octave due to data size and computational load
- Alternatives: KALDI and tensorflow, study and blog about it Gus!
---
2017-11-10
- prosody (suprasegmental phonology): the patterns of stress and intonation in a language.
- supresegmental: denoting a feature of an utterance other than the consonantal and vocalic components, for example (in English) stress and intonation.
- Segment: is "any discrete unit that can be identified, either physically or auditorily".
- low-rank matrix (e.g. rank-1: only one row independent): approximation is a minimization problem, in which the cost function measures the fit between a given matrix (the data) and an approximating matrix (the optimization variable), subject to a constraint that the approximating matrix has reduced rank.--> represent music
- sparse matrix or sparse array is a matrix in which most of the elements are zero. By contrast, if most of the elements are nonzero, then the matrix is considered dense. --> represent what? speech? Yes, speech. (2019/07/30).
---
2017-11-24
Pre-processing >> remove low part energy
---
2017-12-04
text processing:
- input: sentence (from deep learning)
- output: total VAD in sentence from each word
---
2018-04-08
- Idea for thesis book:
1. Introduction
2. Speech emotion recognition: Dimensional Vs Categorical Approach
2. Deep learning based Speech emotion Recognition
3. Emotion recognition from Text
4. Combining acoustic and text feature
5. Conclusion and future works
- Starting PhD at JAIST, bismillah.
---
2018-04-26
Philosophy of Doctoral study: Acoustic and Text feature for SER
1. Human recognize emotion from not only, but also word
2. Text feature can be extracted from speech by using Speech Recognition/STT
3. Having more information tends to improve SER performance
---
2018-09-13
Research idea to be conducted:
- Are semantics contributes to perceived emotion recognition?
- A listening test to test the hyphothesis
Listening test:
- Speech only --> emotion recognition
- Speech + transcription --> emotion recognition
---
2018-09-20
Mid-term presentation:
1. What kind of direction this study will be proceeded in the future,
2. How important this study is in this direction, and
3. How much contributions can be expected
---
2018-10-11
Course to be taken in term 2-1:
1. Data Analytics
2. Analysis of information science
---
2018-11-29
Zemi:
- Speaker dependent vs speaker independent
- Speaker dependent: The same speaker used for training and dev
- Speaker Independent: The different speaker used for training and dev
---
2018-12-12
a cepstral gain c0 is the logarithm of the modeling filter gain
loggging kaldi output:
~/kaldi/egs/iban/s5 $ ./local/nnet3/run_tdnn.sh 2>&1 | tee run-tdnn.log
some solution of kaldi errors:
Error:
Iteration stops on run_tdnn.sh no memory
Solution:
You shouldn't really be running multiple jobs on a single GPU.
If you want to run that script on a machine that has just one GPU, one
way to do it is to set exclusive mode via
`sudo nvidia-smi -c 3`
and to the train.py script, change the option "--use-gpu=yes" to
"--use-gpu=wait"
which will cause it to run the GPU jobs sequentially, as each waits
till it can get exclusive use of the GPU.
Error:
"Refusing to split data for number of speakers"
Solution:
You didn't provide enough info, but in general, you cannot split the directory in more parts than the number of speakers is.
So if you called the decoding with -nj 30 and you have 25 speakers (you can count lines of the spk2utt file) this is the error you receive.
Show how many features extracted using mfcc:
~/kaldi-trunk/egs/start/s5/mfcc$ ../src/featbin/feat-to-dim ark:/home/k/kaldi-trunk/egs/start/s5/mfcc/raw_mfcc_train.1.ark ark,t:-
GMM (gaussian mixture model): A mixture of some gaussian distribution.
---
2018-12-14
- Speech is not only HOW it is being said but also what is being said.
- low-level feature (descriptor): extracted per frame.
- High level feature: extracted per utterance.
---
2018-12-17
- warning from python2:
/home/bagustris/.local/lib/python2.7/site-packages/scipy/signal/_arraytools.py:45: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
b = a[a_slice]
---
2018-12-18
- Idea: concurrent speech and emotion recognition
- Desc: Currently speech recognition and emotion recognition is two separated research areas. Researcher build and improve performance of speech recognition and emotion recognition independently such as works done by (\cite{}, \cite{}, and \cite{}). The idea is simple, both emotion and text (output of speech recognition) can be extracted from speech by using the same features. Given two labels, transcription and emotion, two tasks can be done simulatenously: speech recognition and emotion recognition by training acoustic features to map both text and emotion label.
Idea for speech emotion recognition from acoustic and text features:
1. train speech corpus with given transcription --> output: predicted VAD (3 values)
2. obatin VAD score from speech transcription --> output: predicted VAD (3 values)
3. Feed all 6 variables into DNN with actual VAD value
---
2018-12-20
- mora (モーラ): Unit in phonology that determine syllable weight
- Example: 日本、にほん、3 mora, but, にっぽん is 4 mora
- Morpheme: the smallest unit of meaning of a word that can be devided to (it is in linguistic, in acoustic the smallest unit is phoneme) .
- Example: like --> 1 morpheme, but unlikely is 3 morpheme (un, like, ly)
- Find the different between dynamic feature and static feature and its
- relation to human perception.
- How about statistic feature?
- notch noise = v-shaped noise...?
---
2018-12-27
- Loss function = objective functions
- How to define custom loss function?
- Here in Keras, https://github.com/keras-team/keras/issues/369
- But I think loss="mse" is OK
- note: in avec baseline, there is already ccc_loss
- Dense and dropout layer:
The dense layer is fully connected layer, so all the neurons in a layer are connected to those in a next layer. The dropout drops connections of neurons from the dense layer to prevent overfitting. A dropout layer is similar except that when the layer is used, the activations are set to zero for some random nodes
povey window: povey is a window I made to be similar to Hamming but to go to zero at the edges, it's pow((0.5 - 0.5*cos(n/N*2*pi)), 0.85).
---
2019-02-08
- Likelihood vs probability:
- Likelihood is the probability that an event that has already occurred would yield a specific outcome. Probability refers to the occurrence of future events, while a likelihood refers to past events with known outcomes. Probability is used when describing a function of the outcome given a fixed parameter value.
---
2019-02-15
- Idea: Provided dataset with speech and linguistic information,
- How human perceive emotion from emotional speech with and without linguistic information?
---
2019-02-17
- Idea for ICSygSis 2019: Voiced Spectrogram and CNN for SER
- idea: remove silence from speech.
- Finding: Many pieces of data only contains noisy or silence, but labeled as neutral or other emotion.
- Next idea: add silence category as it is important cue for speech emotion recognition (??)
---
2019-03-06
- Idea for ASJ autumn 2019: Emotional speech recognition
- dataset: IEMOCAP
- tools: DeepSpeech
---
2019-04-04
- How to map emotion dimension to emotion category?
- One solution is by inputting emotion dimension to machine learning tool, such as GMM.
- Reda et al. tried this method and obtain very big improvement from 54% to 94% of accuracy.
- Next, try deep learning methods.
- Also, try to learn confusion matrix.
---
2019-04-08
- The research paper below shows the evidence that music didn't improve creativity.
https://onlinelibrary.wiley.com/doi/epdf/10.1002/acp.3532
- How about if we change the experiment set-up. Listening music first, 5-10 minutes, stop, give the question.
- Intuition: While music didnot contribute to improve creativity, but it may contributes to mood and emotion. After being calm by listening, it may improves creativity.
---
2019-04-09
Today,
- I implemented F0 based voiced segmentation for feature extraction using YAPT method with `amfm_decompy` package (now running in my PC).
- Learned how to convert data from tuple to 1D array (using np.flatten()), wrote blog post about it.
- Obtained signature from Akagi-sensei for MSP-Impro database, and forward it tu TU Dallas.
- Plan for tomorrow: run BSLTM from obtained feature today --> write result on WASPAA.
---
2019-04-10
- Attended workshop: deeplearning for supercomputer cray XC40 (xc40 /work/$USER/handson/)
- Run obtained feature (from F0) to currenty BLSTM+attention model system, got lower result. It may need to be processed per segment, not whole segment. Train each voiced segment feature, use majority voting like to decide.
- Prepare presentation for Lab meeting on Friday.
- Replace owncloud with nextcloud, now up to 300 GB.
---
2019-04-11
- made slide for tomorrow lab meeting presentation.
- run obtained feature on BLSTM+attention model, the higher accuracy was 52%, still lower than previous.
- change window size from 20 ms to 0.1 s, 0.04, 0.08, etc. Find the best result.
- Email Prof. Busso, asking for the speech transcription.
---
2019-04-12
Today's lab meeting:
- Compared voiced and voiced+unvoiced part --> done?
- You study at the school of information science? What is science in your PhD?
Human perceive emotion from speech. The speech contains some information, mainly : vocal tone information and lexical/linguistic information. Human can perceive emotion from speech only. In some cases it is difficult, like in noisy environment. Given another information, lexical information, it will be useful for human to recognize emotion of speaker. Can computer do that?
- Information science is a field primarily concerned with the analysis, collection, classification, manipulation, storage, retrieval, movement, dissemination, and protection of information.
- Recognition of human emotion by computer is one area of information science, right Sensei?
- Text feature is feature from text data.
---
2019-04-13
- In linguistics, prosody is concerned with those elements of speech that are not individual phonetic segments but are properties of syllables and larger units of speech. These are linguistic functions such as intonation, tone, stress, and rhythm.
- Extract F0 from IEMOCAP, padded with other 34 features, run it on PC, still got lower result.
---
2019-04-15
- A voiced sound is category of consonant sounds made while the vocal cords vibrate. All vowels in English are voiced, to feel this voicing, touch your throat and say AAAAH. ... That is voicing. Consonants can be either [voice/unvoice](/fig/460.png)
- Perform start-end silence removal on 5331 IEMOCAP utterances
---
2019-04-16
- Running experiment using feature from trimmed voice, still got lowe performance, 47%
- Extract egemamps feature set from IEMOCAP data, expecting improvement on SER system as egemaps is tailored for speech emotion recognition
- Running extracted egemaps feature on the system, 447,672,324 parameters, it breaks the GPU capability
- Next: extract egemaps feature from trimmed speeech: 10, 15, 20 dB
- GPU error (out of mems): ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[872704,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
---
2019-04-17
- Computation crash on yesterday run using egemaps feature, need to reduce size of feature
- run trimmed data with start-end silence removal, got lower accuracry (???)
- Pickle all data in iemocap dataset except for speech
- New dataset: [meld](https://github.com/SenticNet/MELD) ??
- __CONCEPT__: (acoustic) features are extracted from speech, i.e. wav file when offline, it
is not make sense to extract feature from .npy or .pickle, that is just for simplification method. But, if we can avoid it (converting wav to pickle/npy for to save feature), do it. Pickle and npy still haold big memory (MB/GB).
---
2019-04-18
- Getting improvement of accuracy from baseline IEMOCAP with 5531 utterances without start-end trim by adding more features (40 and 44), i.e pitch(1) and formants (5). Reduce number of neuron on BLSTMwA (Bidirectional LSTM with Attention) system.
- Doing start-end silence removal with `[10, 20, 30, 40, 50]` dB. ~~For 10 dB, need to change window size (due to shorten length of signal), compensate it with extending max length of feature sequence to 150 (original: 100).~~
- Finding that running on GPU for this sequence data **SLOWER** than in CPU.
- Add dropout 0.2 and 0.5 to the system, get higher accuracy. One simple way to detect overfitting is by checking val_loss vs loss, if it's higher, then overvitting (should be close each other). The cause usually is the number of trainable parameters is exceedingly greater than number of samples.
- Found a paper about "tensor fusion", a method to unite multimodal data. Read it!
---
2019-04-19
- Found the better number of features: 40 (+1 F0, +5 Formants)
- With dropout of 0.5, feature with 50 dB start-end silence removal perform better (55%)
---
2019-04-22
- Start-end silence removal can't give significant improvement on SER accuracy, move to add more features.
- replace LSTM with CuDNNLSTM to take advante of using GPU
- Use early stopping before model.fit to shortent computation
- Now evaluating on 39, 40 and 44 features
- __**concept**__: Overfitting occurs when number of trainable parameters greatly larger than number of samples, it is indicated with score of validation losss much higher than train loss.
- When to to stop iteration/epoch? when validation loss didn't decrease any more.
---
2019-04-23
- Need to model new feature that capture dynamics of speech sound if we want to improve SER performance
- Features to be tried: covarep, speechpy, with delta and delta-delta.
---
2019-04-26
- Building model for dimensional text emotion recognition. Currently using one output only (valence) and the obtained performance is still low. In term of MSE (mean squared error), the lowest mse was 0.522
---
2019-05-01
- Multiple output VAD prediction workd on iemocap text, change the metric to mape (mean absolute percentage error), the lowest score is about 19%.
- Current result shows float number of VAD dimension, **need** to be cut only for .0 or .5. <-- no need
---
2019-05-09
Today's meeting with Shirai-sensei:
- Use input from affective dictionary for LSTM
- Concatenate output from sentiment with current word vector
- Try different affective dictionaries
---
2019-05-16
- Regression must use `linear` activation function
- Dimensonal SER works wit all 10039 utterances data, current best mape: 21.86%
- Prepare (presentation training) for lab meeting tomorrow
- The output of an LSTM is:
- (Batch size, units) - with return_sequences=False
- (Batch size, time steps, units) - with return_sequences=True
---
2019-05-17
- in math, logit function is simply the logarithm of the odds: logit(x) = log(x / (1 – x)).
- in tensorflow, logits is a name that it is thought to imply that this Tensor is the quantity that is being mapped to probabilities by the Softmax (input to softmax).
- end-to-end loss, minimize D1 (intra-personal) and maximize D2 (inter-personal), D1 and D2 is distant between (audio) embedding (in spekaker verification, need to be confirmed)
- Most MFCC uses 30 ms of window, this result spectral shape will the same for smaller. This is maybe why removing silence gives better performance.
- To capture the dynamics of emotion, maybe the use of delta and delta-delta will be better.
- Why removing will improve SER performance? Intuition. Silence is small noise, it may come from hardware, electrical of ambient noise. If it is included in speech emotion processing, the extracted feature may be not relevant because it extracts feature from small noise, not the speech. By removing this part, the extracted feature will only comes from speech not silence. Therefore, this is why the result better.
---
2019-05-19
- **GRU** perform better and faster than LSTM.
- Hence, CNN vs RNN --> RNN, LSTM vs GRU --> GRU. Global attention vs local attention --> ...?
- idea: Obtain local attention from waveform directly, only extract feature on two or more highest attentions.
- what's different between written text and spoken language (speech transcription)...?
- **Modern SNS and chat like twitter and facebook status is more similar to spoken language (as the concept of "twit") rather than written text, so it will be useful to analyze speech transcription than (formal) writtent text to analyse affect within that context.**
---
2019-05-25
- evaluate word embedding method on iemocap text emotion recognition (word2vec, glove, fasstext), so far glove gives the best.
- In phonetics, rhythm is the sense of movement in speech, marked by the stress, timing, and quantity of syllables.
---
2019-06-03
- Progress research delivered (text emotion recognition, categorical & dimensional, written & spoken text)
- Text emotion recognition works well on dimensional, it is interpretable and easiler to be understood. Continue works on it.
- Combine acoustic and text feature for dimensional emotion recognition
---
2019-06-04
- re-run experiment on voice speech emotion recognition (ICSigsys 2019) for 0.1 threshold (using updated audiosegment)
- idea: how human brain process multimodal signal, implement it on computation
---
2019-06-05
RNN best practice:
- Most important parameters: units and n layers
- Units (size): depend on data:
- text data < 1 Mb --> < 300 units
- text data 2 - 6 Mb --> 300-600 units
- text data > 7 Mb --> > 700 units
- Units: 2 or 3 (source: Karpathy)
- Monitoring loss:
- Overfitting if: training loss << validation loss
- Underfitting if: training loss >> validation loss
- Just right if training loss ~ validation loss
Problem with categorical emotion:
- Need balanced data
- To make balanced data, some context between utterances will gone/disappear
---
2019-06-04:
- Idea: auditory based attention model for fusion of acoustic and text feature for speech emotion recognition. Attention is the main mechanism how to human auditory system perceive sound. By attention mechanism, human focus on what he interest to listen and collect the information from the sound, including emotion. In case speech emotion, human might focus on both tonal and verbal information. If the tonal information match the verbal information, than he believe the information he obtained is correct.
- To combine those information (verbal and tonal), two networks can be trained on the same label, separately. The acoustic network is the main (master/primary) and the text network is slave/secondary. The acoustic sytem acts as main system while the secondary system is supporting system which give weights to primary system. For categorical, If the weight above the thareshold (say 0.5 in the range 0-1), then both sytems agree for the same output/category. If no, the output of the system is the main system weighted by secondary system.
- For categorical (which is easier to devise), the output of the system is the main system weighted by secondary system (multiplication) ---> multiplicative attention?
- Whether it is additive or multiplication, beside via attention, it also can be implemented directly when combining two modalities. Instead of concatenate, we can use add() or multiply(). But, how to shape/reshape input feature?
---
2019-06-08:
- As of 2016, a rough rule of thumb
is that a supervised deep learning algorithm will generally achieve acceptable
performance with around 5,000 labeled examples per category, and will match or
exceed human performance when trained with a dataset containing at least 10
million labeled examples. Working successfully with datasets smaller than this is
an important research area, focusing in particular on how we can take advantage
of large quantities of unlabeled examples, with unsupervised or semi-supervised
learning.
---
2019-06-12:
- Working on dimensional emotion recognition (for cocosda?), the result from acoustic and text feature only shows a little improvement compared to acoustic only for text only.
- Current architecture:
- Acoustic: 2 stack BLSTM
- Text: 2 stack LSTM
- Combination: 2 Dense layers
- Current (best result):
- [mse: 0.4523394735235917, mape: 19.156075267531484, mae: 0.5276844193596124]
- Need advance strategy for combination: hfusion, attention, tensor fusion???
---
2019-06-14
- Current result (train/val loss plot) shows that system is overfitting regardles complexity of architecture (even with smalles number of hyperparameter). Needs to re-design.
- As obtained previously, the more data the better data. How if the data is limited?
- If can't increase the number/size of data, maybe the solution is to increase the number of input features.
- Let's implement it, and see if it works.
- to do: implement CCC (concordance coeff.) on current sytem
---
2019-06-17:
- Interspeech2019 --> rejected
- Usually people use 16-25ms for window size, especially when modeled with recursive structures.
- to study (a must): WA vs UA, weighted accuracy vs unweighted accuracy; WAR vs UAR (unweighted average recall)
- Accuracies by themselves are useless measure of goodness unless combined with precision values or averaged over Type I and II errors. What is those?
- Answer: see this: [https://en.wikipedia.org/wiki/Type_I_and_type_II_errors](https://en.wikipedia.org/wiki/Type_I_and_type_II_errors)
- it must never be concluded that, for example, 65.69% is "better" than 65.18%, or even that 68.83% is "better" that 63.86%, without providing a measure of significance for that statement