Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special normalisation rules #37

Open
amessina71 opened this issue Mar 8, 2019 · 3 comments
Open

Special normalisation rules #37

amessina71 opened this issue Mar 8, 2019 · 3 comments

Comments

@amessina71
Copy link
Contributor

amessina71 commented Mar 8, 2019

How to deal with:

  1. numbers
  2. acronyms / symbols
  3. website / email spellings (e.g. use "dot", "at" )

I would go for trying as much as possible to have letter-based normalised representations of all the above such as:

  1. 100 -> one hundred (cento, cent)
  2. Hz -> (hertz), WHO -> double u aitch o (less sure about this one ...)
  3. www.rai.it -> vu vu vu punto rai punto it, [email protected] -> pippo at pluto dot com

of course this would be for the sake of comparison, no one would really like to have such transcripts as a final product ... we don't even need to output normalised text if not for a debug session.

@MikeSmithEU
Copy link
Contributor

I would do the opposite actually in all cases, i.e. going from complexity to simplicity.

  1. one hundred -> 100
  2. hertz -> Hz, double u aitch o (not sure any stt actually outputs this)
  3. vu vu vu punto ... (same as 2, not sure any stt actually outputs this)

especially in the case of the vu vu vu punto rai punto it, if this is transcribed wrongly, eg. "vuvuvupunto raai punto it", it would impact WER heavily as it has 5 "words" wrong, while imho this should only count as one "word", and one point "penalty" on the WER score...

@EyalLavi
Copy link
Contributor

I'm concerned that specific normalisation discussions can send up down a very deep rabbit hole. Like @amessina71 says, the thing to remember is that we are not interested in absolute WER (compared to reference) but in relative WER (compared to other vendors). So as long as we apply the normalisation consistently it's not so important how we normalise. My preference would be to have a very small number of core normalisation rules that each user can add to. We also don't have to reinvent the wheel. There may be something we can use from these sources: https://www.kaggle.com/headsortails/watch-your-language-update-feature-engineering,
https://github.com/google/sparrowhawk

@amessina71
Copy link
Contributor Author

amessina71 commented Mar 19, 2019

@MikeSmithEU I actually meant the opposite too. Engines would certainly output "www.rai.it" or "100". The problem is that there might be slight differences with one another. One can output "www.rai.it" the other "www dot rai dot it". How to compare them? Again, one could output "100" the other "one hundred". So, by having a common denominator for these kinds of normalisations would make that engine 1 saying "There were 100 members attending" and engine 2 saying "There were one hundred persons attending" would be considered equivalent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants