This repository has been archived by the owner on Jan 18, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathproject-final.Rmd
266 lines (202 loc) · 15.7 KB
/
project-final.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
---
title: "project-final"
author: "Gabriele Chignoli"
date: "12/21/2017"
output: html_document
---
## Do formants depend on speakers ?
***
### 1st step
Values of vowel formants can be a great indicator in order to distinguish between different vocalic sounds, but when it comes to voice recognition can formants help us to *recognize* different speakers and if they do, which are the more relevant ? We will analyse the first three formants, their distributions and their variations in order to try to answer this preliminary question. *F~1~* and *f~2~* values mainly corresponding to the average values of *ideal vowels*, in Automatic Speaker Recognition systems the *f~3~* has in particular a significant role to identify, for exemple in comparaison pairs, if it is the same person who's speaking or two different ones.
The other great value we will take in account is the pitch, noted *f~0~*. In human language the pitch can play a heavy role in recogniziting the speaker native language, the mood of a person and in some cases it could be the element which to characterise a certain speaker.
So where do we start ? By loading the data and making some assumption on how we can manipulate them.
```{r loadDataset}
voyelles = read.csv("data.csv", header = TRUE, sep = "\t")
```
### Dataset
Our dataset is made of 27 records by 9 different speakers, 5 women and 4 men, french native speakers. Every participant read 55 words, reapeting the set 3 times. All the values have been extracted automatically with [Praat](http://www.fon.hum.uva.nl/praat/) and saved in the table from the previous chunk, we have 6 columns in total : speaker gender, vowel, pitch value, first formants, second formant and third formant.
```{r vocalicTrianglePlot}
ggplot2::ggplot(voyelles, ggplot2::aes(F2, F1)) +
ggplot2::geom_point(ggplot2::aes(col=VOWEL), size=3) +
ggplot2::scale_colour_brewer(palette = "Spectral") +
ggplot2::scale_x_reverse() + ggplot2::scale_y_reverse() +
ggplot2::labs(y="F1", x="F2", title="Dataset vocalic triangle")
```
Not all french vowels are produced in the dataset but it will not have a great impact on our question, the vowels we consider are /a/ /u/ /i/ and we find in "@" all kind of schwa or *coup de glotte* (this explains the large distribution it has on the triangle).
In many cases /u/ and /i/ are closer to the values of /y/ but the original annotation doesn't contains any /y/.
### How to answer to the question
##### 1. F0 values
As we said in the introduction, the first element which can lead us to recognize if it is one speaker or a multitude of spekears to be analysed in a dataset would be the pitch or *f~0~* measure. We'll present different representations, all based on density. We won't look at the values of this measure, they will not tell us something in particular, but just looking at the shapes of density curves we will understand the great variation of our dataset.
```{r pithPlotGenders}
ggplot2::ggplot(voyelles, ggplot2::aes(F0)) +
ggplot2::geom_density(ggplot2::aes(fill=as.factor(SPEAKER)), alpha=0.8) +
ggplot2::scale_colour_brewer(palette = "Spectral") +
ggplot2::labs(title="Pitch measurements", fill="SPEAKER")
```
+ The first one is based only only on gender variation, the two curves tell us one simple thing : the speaker gender has a great impact on the pitch measure (not a great discover). Men's pitch is lower than women's one, due to the physiological difference between men and women frequencies, cavities, this tells us that just in thanks to pitch measurement we would, in general, predict if it's a male voice or a female one.
```{r pitchPlotVowels}
ggplot2::ggplot(voyelles, ggplot2::aes(F0)) +
ggplot2::geom_density(ggplot2::aes(fill=as.factor(VOWEL)), alpha=0.8) +
ggplot2::scale_colour_brewer(palette = "Spectral") +
ggplot2::labs(title="Pitch measurements", fill="VOWEL") + ggplot2::facet_wrap( ~ VOWEL)
```
+ We applied the density measurement taking the vowels as a variation factor. this time our curves differs from one another but we can't really see some important information from them. Considering the global realisation of a vowel knowing that the pitch is strongly influenced by gender lead us to a not really satisfying representation of values.
```{r pitchPlotVowelsGenders}
ggplot2::ggplot(voyelles, ggplot2::aes(F0)) +
ggplot2::geom_density(ggplot2::aes(fill=as.factor(SPEAKER)), alpha=0.8) +
ggplot2::scale_colour_brewer(palette = "Spectral") +
ggplot2::labs(title="Pitch measurements", fill="VOWEL") + ggplot2::facet_wrap( ~ VOWEL)
```
+ In order to clarify the variation between vowel we add gender as a factor of variation in this new density representation. As we see, this time our curves directly derive from our first pitch curve, but as that one it can help us to describe the variation between genders not specifically between speakers.
##### 2. F3 values
The second indicator we will consider in order to identify different speakers will be the third formant values of that speaker. Here it is a global *f~3~* representation.
At first look we can't tell if a great variation is present in our data, there are in every vowels some areas with a greater concentration of values but we don't have an explicit explaination to what this could be. We can start making two hypothesis :
- all the areas of concentration are common to avery speaker so that for a single speaker we will have a plot pretty much similar to this one. In this case we would assume that the *f~3~* can't be a reliable indicator of variation ;
- the plot we see here is just the addition of all the *f~3~* belonging to the different speakers. This is the hypothesis we'll try to demonstrate in order to confirm the statements made in the introduction.
```{r f3Plot}
ggplot2::ggplot(voyelles, ggplot2::aes(VOWEL, F3)) +
ggplot2::geom_count(col="tomato3", show.legend=F) +
ggplot2::labs(y="F3", title="Dataset F3 values")
```
##### 3. Cavity lengths
Another factor we could use to predict the speaker variation can be the cavity length ? Maybe this measure can be and addiction to other more precise ones :
* because of physiological factors, the human cavity is almost the same for every being, this measure isn't really specific for a person ;
* it isn't always precide, it can vary because we just computed it from formants measurements, using the reverse formant calcutation formula
$$Fn = (2n-1)c/4l$$
where Fn is the sum of all formants or a specific formant l is the length and c the constant of sound speed (35000 m/s) so the length formula will be
$$l = (c/Fn)/4$$
```{r cavityLength, echo=FALSE}
cavityF1 <- c()
cavityF2 <- c()
cavityF3 <- c()
cavities <- c()
for (i in voyelles[["F1"]] ) {
l <- (35000/i)/4
cavityF1 <- append(cavityF1, l)
}
for (i in voyelles[["F2"]] ) {
l <- (35000/i)/4
cavityF2 <- append(cavityF2, l)
}
for (i in voyelles[["F3"]] ) {
l <- (35000/i)/4
cavityF3 <- append(cavityF3, l)
}
for (i in 1:1885) {
lm <- (cavityF1[i] + cavityF2[i] + cavityF3[i])/3
cavities <-(append(cavities, lm))
}
```
```{r cavityLengthPlot}
ggplot2::ggplot(data.frame(cavities), ggplot2::aes(seq_along(cavities),cavities)) +
ggplot2::geom_bar(colour="black", stat="identity") +
ggplot2::labs(y="length", x="count", title="Average cavity length")
```
If this factor was decisive to distinguish the speakers one another we would have seen a change in the measure every about ~165, that's not the case as we can assume from the plot above.
***
#### Single Speaker considerations
In order to realize our main idea to recognize between two or more different speakers we will now compare one pair of speakers for every gender, using the *f~3~* and *f~0~* measures.
```{r subset}
voyellesLocutrice1 <- voyelles[1:168,]
voyellesLocutrice2 <- voyelles[169:337,]
voyellesLocuteur1 <- voyelles[1497:1720,]
voyellesLocuteur2 <- voyelles[1719:1885,]
```
##### 1. Vocalic triangle
First looking at the vocalic triangle of both speakers we can see a slight difference in th vowel production. This first indicator can made a base to create a model for a speaker to be compared with all other speakers triangle.
In this first comparaison pair we can see a part of the elements which can explain the difference between two speakers. The two triangles aren't exactly the same, they have some different distribution for all the three vowels, in the first one we find the *@* annotation, not seen in the second one, this can be another element of reflection : the variety of sound production is a great factor of variation, we know that a speaker has a certain repertoire differently from another person who can have a smaller or bigger one.
```{r vocalicTriangleCompW}
ggplot2::ggplot(voyellesLocutrice1, ggplot2::aes(x=F2, y=F1)) +
ggplot2::geom_point(ggplot2::aes(col=VOWEL), size=3) +
ggplot2::scale_colour_brewer(palette = "Spectral") +
ggplot2::scale_x_reverse() + ggplot2::scale_y_reverse() +
ggplot2::labs(y="F1", x="F2", title="Female speaker 1, vocalic triangle")
ggplot2::ggplot(voyellesLocutrice2, ggplot2::aes(x=F2, y=F1)) +
ggplot2::geom_point(ggplot2::aes(col=VOWEL), size=3) +
ggplot2::scale_colour_brewer(palette = "Spectral") +
ggplot2::scale_x_reverse() + ggplot2::scale_y_reverse() +
ggplot2::labs(y="F1", x="F2", title="Female speaker 2, vocalic triangle")
```
##### 2. Pitch measurement
The first element which better explains the variation and help in the recognition is the pitch curve. We have seen the global pitch value at the beginning, we'll compare here, as for the vocalic triangle, the pitch curves of the same two female speakers.
As we see, the two plots have an evident different, the first concentres its values on 250 Hz while in the second one we have almost all the curve before 200 Hz. This tells us that *f~0~* measure is a more specific feature for speakers and it can really represent a fundamental distinction factor.
```{r pitchCompW}
pitchLocW1 <- ggplot2::ggplot(voyellesLocutrice1, ggplot2::aes(F0)) +
ggplot2::geom_density(ggplot2::aes(fill=as.factor(SPEAKER)), alpha=0.8) +
ggplot2::scale_colour_brewer(palette = "Spectral") +
ggplot2::labs(title="Female speaker 1, pitch curve", fill="GENDER") +
ggplot2::xlim(100, 300)
pitchLocW2 <- ggplot2::ggplot(voyellesLocutrice2, ggplot2::aes(F0)) +
ggplot2::geom_density(ggplot2::aes(fill=as.factor(SPEAKER)), alpha=0.8) +
ggplot2::scale_colour_brewer(palette = "Spectral") +
ggplot2::labs(title="Female speaker 2, pitch curve", fill="GENDER") +
ggplot2::xlim(100, 300)
gridExtra::grid.arrange(pitchLocW1, pitchLocW2, ncol=2)
```
##### 3. F3 comparaison
We now present the second element we have identified as a possible factor in speaker recognition, the *f~3~* values specific to the speaker. We stated our main hypothesis on this measure, telling that it could represent a fundamental value to be considered. We hoped to have two different representations in order to confirm our statement.
As we predict, the *f~3~* values from two speakers strongly differ one another.
* /a/ have a lower concentration for *speaker 2* ;
* /i/ is produced mostly around 3500 Hz and higher frequency for *speaker 2* while *speaker 1* have this vowel formant higher frequency around 3500 Hz and more realisations in lower values ;
* /u/ differ the most with a greater concentration of values for *speaker 2*, while in *speaker 1* we just see a distribution on a large scale.
```{r f3CompW}
f3LocW1 <- ggplot2::ggplot(voyellesLocutrice1, ggplot2::aes(VOWEL, F3)) +
ggplot2::geom_count(col="tomato3", show.legend=F) +
ggplot2::labs(y="F3", title="Female speaker 1, F3 values") +
ggplot2::ylim(2250, 3750)
f3LocW2 <- ggplot2::ggplot(voyellesLocutrice2, ggplot2::aes(VOWEL, F3)) +
ggplot2::geom_count(col="tomato3", show.legend=F) +
ggplot2::labs(y="F3", title="Female speaker 2, F3 values") +
ggplot2::ylim(2250, 3750)
gridExtra::grid.arrange(f3LocW1, f3LocW2, ncol=2)
```
##### 1. Vocalic triangle
We will take the same comparaison elements we have seen for female speakers in order to uderstand if the same assumptions can be made for men.
As we start from the vocalic triangles, we already note a more important difference both in scales :
* *f~1~* for the first speaker goes from almost 130 Hz to 1000, while in the second one it goes from 200 to about 800 ;
* *f~2~* has a scale from about 550 to 2500 for *speaker 1* and from ~850 to ~2350 in the second case.
We have here too a repertoire difference with one speaker producing a large quantity of *@*. We can assume the vocalic triangles we have presented have a greater characterisation for men than for women due to the important variations of frequency.
```{r vocalicTriangleCompM}
ggplot2::ggplot(voyellesLocuteur1, ggplot2::aes(x=F2, y=F1)) +
ggplot2::geom_point(ggplot2::aes(col=VOWEL), size=3) +
ggplot2::scale_colour_brewer(palette = "Spectral") +
ggplot2::scale_x_reverse() + ggplot2::scale_y_reverse() +
ggplot2::labs(y="F1", x="F2", title="Male speaker 1, vocalic triangle")
ggplot2::ggplot(voyellesLocuteur2, ggplot2::aes(x=F2, y=F1)) +
ggplot2::geom_point(ggplot2::aes(col=VOWEL), size=3) +
ggplot2::scale_colour_brewer(palette = "Spectral") +
ggplot2::scale_x_reverse() + ggplot2::scale_y_reverse() +
ggplot2::labs(y="F1", x="F2", title="Male speaker 2, vocalic triangle")
```
##### 2. Pitch measurement
In what concerns pitch, we see for men too a great difference, but sligthly less important than the women's one. The two pitch curves have both their values around 100 and 200, but conentrated in different areas. The difference is visible, but it will need a great precision in order to be really useful.
```{r pitchCompM}
pitchLocM1 <- ggplot2::ggplot(voyellesLocuteur1, ggplot2::aes(F0)) +
ggplot2::geom_density(ggplot2::aes(fill=as.factor(SPEAKER)), alpha=0.8) +
ggplot2::scale_colour_brewer(palette = "Spectral") +
ggplot2::labs(title="Male speaker 1, pitch curve", fill="GENDER") +
ggplot2::xlim(0, 300)
pitchLocM2 <- ggplot2::ggplot(voyellesLocuteur2, ggplot2::aes(F0)) +
ggplot2::geom_density(ggplot2::aes(fill=as.factor(SPEAKER)), alpha=0.8) +
ggplot2::scale_colour_brewer(palette = "Spectral") +
ggplot2::labs(title="Male speaker 2, pitch curve", fill="GENDER") +
ggplot2::xlim(0, 300)
gridExtra::grid.arrange(pitchLocM1, pitchLocM2, ncol=2)
```
##### 3. F3 comparaison
Looking at the *f~3~* realisations we really can assume this is a characteristic feature for speakers. we see for /a/, /i/ and /u/ three fundamentally different distributions for the two men we considered here, values are arranged around the same frequencies but in different areas for the first two segments, while for the /u/ they strongly differ.
```{r f3CompM}
f3LocM1 <- ggplot2::ggplot(voyellesLocuteur1, ggplot2::aes(VOWEL, F3)) +
ggplot2::geom_count(col="tomato3", show.legend=F) +
ggplot2::labs(y="F3", title="Male speaker 1, F3 values") +
ggplot2::ylim(2000, 3750)
f3LocM2 <- ggplot2::ggplot(voyellesLocuteur2, ggplot2::aes(VOWEL, F3)) +
ggplot2::geom_count(col="tomato3", show.legend=F) +
ggplot2::labs(y="F3", title="Male speaker 2, F3 values") +
ggplot2::ylim(2000, 3750)
gridExtra::grid.arrange(f3LocM1, f3LocM2, ncol=2)
```
***
#### Conclusion
In order to understand how the Speaker Recognition works we have taken 3 variation factors : pitch, *f~3~* and the cavity length. For the last one we assumed it doesn't really represent a characteristic value, basically because of its inaccuracy, so we did not used it to study speakers one to one.
In a global representation we don't really managed to understand where the variation came from, if what we saw was just a greater representation of a common model or the sum of all speaker specific models. The analysis one-to-one has helped to clarify this last point, our hypothesis on *f~0~* and *f~3~* have been confirmed as they represent great characterisation factors for speakers, they remain two important features in a large set which ASR studies continue to develop and extend.