Special normalisation rules #37

amessina71 · 2019-03-08T13:21:44Z

How to deal with:

numbers
acronyms / symbols
website / email spellings (e.g. use "dot", "at" )

I would go for trying as much as possible to have letter-based normalised representations of all the above such as:

100 -> one hundred (cento, cent)
Hz -> (hertz), WHO -> double u aitch o (less sure about this one ...)
www.rai.it -> vu vu vu punto rai punto it, [email protected] -> pippo at pluto dot com

of course this would be for the sake of comparison, no one would really like to have such transcripts as a final product ... we don't even need to output normalised text if not for a debug session.

MikeSmithEU · 2019-03-12T13:04:11Z

I would do the opposite actually in all cases, i.e. going from complexity to simplicity.

one hundred -> 100
hertz -> Hz, double u aitch o (not sure any stt actually outputs this)
vu vu vu punto ... (same as 2, not sure any stt actually outputs this)

especially in the case of the vu vu vu punto rai punto it, if this is transcribed wrongly, eg. "vuvuvupunto raai punto it", it would impact WER heavily as it has 5 "words" wrong, while imho this should only count as one "word", and one point "penalty" on the WER score...

EyalLavi · 2019-03-12T13:37:51Z

I'm concerned that specific normalisation discussions can send up down a very deep rabbit hole. Like @amessina71 says, the thing to remember is that we are not interested in absolute WER (compared to reference) but in relative WER (compared to other vendors). So as long as we apply the normalisation consistently it's not so important how we normalise. My preference would be to have a very small number of core normalisation rules that each user can add to. We also don't have to reinvent the wheel. There may be something we can use from these sources: https://www.kaggle.com/headsortails/watch-your-language-update-feature-engineering,
https://github.com/google/sparrowhawk

amessina71 · 2019-03-19T13:06:46Z

@MikeSmithEU I actually meant the opposite too. Engines would certainly output "www.rai.it" or "100". The problem is that there might be slight differences with one another. One can output "www.rai.it" the other "www dot rai dot it". How to compare them? Again, one could output "100" the other "one hundred". So, by having a common denominator for these kinds of normalisations would make that engine 1 saying "There were 100 members attending" and engine 2 saying "There were one hundred persons attending" would be considered equivalent.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Special normalisation rules #37

Special normalisation rules #37

amessina71 commented Mar 8, 2019 •

edited

Loading

MikeSmithEU commented Mar 12, 2019

EyalLavi commented Mar 12, 2019

amessina71 commented Mar 19, 2019 •

edited

Loading

Special normalisation rules #37

Special normalisation rules #37

Comments

amessina71 commented Mar 8, 2019 • edited Loading

MikeSmithEU commented Mar 12, 2019

EyalLavi commented Mar 12, 2019

amessina71 commented Mar 19, 2019 • edited Loading

amessina71 commented Mar 8, 2019 •

edited

Loading

amessina71 commented Mar 19, 2019 •

edited

Loading