-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Special normalisation rules #37
Comments
I would do the opposite actually in all cases, i.e. going from complexity to simplicity.
especially in the case of the vu vu vu punto rai punto it, if this is transcribed wrongly, eg. "vuvuvupunto raai punto it", it would impact WER heavily as it has 5 "words" wrong, while imho this should only count as one "word", and one point "penalty" on the WER score... |
I'm concerned that specific normalisation discussions can send up down a very deep rabbit hole. Like @amessina71 says, the thing to remember is that we are not interested in absolute WER (compared to reference) but in relative WER (compared to other vendors). So as long as we apply the normalisation consistently it's not so important how we normalise. My preference would be to have a very small number of core normalisation rules that each user can add to. We also don't have to reinvent the wheel. There may be something we can use from these sources: https://www.kaggle.com/headsortails/watch-your-language-update-feature-engineering, |
@MikeSmithEU I actually meant the opposite too. Engines would certainly output "www.rai.it" or "100". The problem is that there might be slight differences with one another. One can output "www.rai.it" the other "www dot rai dot it". How to compare them? Again, one could output "100" the other "one hundred". So, by having a common denominator for these kinds of normalisations would make that engine 1 saying "There were 100 members attending" and engine 2 saying "There were one hundred persons attending" would be considered equivalent. |
How to deal with:
I would go for trying as much as possible to have letter-based normalised representations of all the above such as:
of course this would be for the sake of comparison, no one would really like to have such transcripts as a final product ... we don't even need to output normalised text if not for a debug session.
The text was updated successfully, but these errors were encountered: