-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cp1250 is not detected #18
Comments
Am I right to assume that cp1250 is the same as Windows-1250? windows-1250 confidence = 0.8075649621709267 The problem is that the latin1 prober (the windows-1252) is winning, they have really similar encoding tables actually and since there are many more latin1 in the text (compared to the more distinct ă Ă â Â î Î ş Ş ţ Ţ chars) that's probably the reason the confidence is higher and it wins in the end... Charset encoding detection is purely based on heuristics, these encodings have a statistical model based on frequency so it's never going to be 100% accurate. I wonder how OpenSubtitles do it, maybe the browser sends that information when posting the file? |
Yes, you are correct.
Nope, that info is not sent to the browser as the subtitle files are never encoded correctly. Here is some info about how OpenSubtitles detects/converts encodings: |
Here is another similar problem in the czech language (windows-1250). |
Wouldn't it be better to use dictionaries when multiple encodings return high confidence levels? |
Heuristics should work based on most readable/writeable characters from file, not just probability of characters because difference between European encodings is small. Polish diacritics characters are:
which are incorrectly shown in CP-1252 (when written originally as CP-1250):
It is clear that some of above characters in CP-1252 are not a word characters, thus it do not fulfil requirement (= probability is low) to contain all word characters in this encoding. |
It does not work with Romanian subtitle files. OpenSubtitles detects these files as "cp1250",
jschardet
detects the encoding as "windows-1252".Wrong characters: ã þ º
Correct romanian special characters: ă Ă â Â î Î ş Ş ţ Ţ
Test file: http://dl.opensubtitles.org/en/download/file/1954820326.srt
I've tested with many more files though, if I use
iconv-lite
with "cp1250" (instead of "windows-1252" as detected) it encodes the file to "utf8" correctly.The text was updated successfully, but these errors were encountered: