Skip to content

Commit

Permalink
Support Windows-1252 and make UTF-16 detection stricter
Browse files Browse the repository at this point in the history
Fixes #797
  • Loading branch information
Wilfred committed Jan 4, 2025
1 parent 5a06f3d commit fadd0f2
Show file tree
Hide file tree
Showing 7 changed files with 27 additions and 1 deletion.
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@

### Parsing

File detection now supports Windows-1252 encoded test (an extension of
ISO-8859-1), and is stricter about UTF-16 detection.

Updated to the latest tree-sitter parser for Make and YAML.

## 0.62 (released 20th December 2024)
Expand Down
10 changes: 10 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,7 @@ tree-sitter-lua = "0.2.0"
tree-sitter-xml = "0.7.0"
tree-sitter-make = "1.1.1"
tree-sitter-yaml = "0.7.0"
encoding_rs = "0.8.35"

[dev-dependencies]
# assert_cmd 2.0.10 requires predicates 3.
Expand Down
3 changes: 3 additions & 0 deletions sample_files/compare.expected
Original file line number Diff line number Diff line change
Expand Up @@ -298,6 +298,9 @@ ca98b4d14fc21e0f04cf24aeb3d2526c -
sample_files/whitespace_1.tsx sample_files/whitespace_2.tsx
ac8b1a89ac26333f2d4e9433b2ca3958 -

sample_files/windows_1251_1.txt sample_files/windows_2251_1.txt
d41d8cd98f00b204e9800998ecf8427e -

sample_files/xml_1.xml sample_files/xml_2.xml
e629cbd2e721fd249c7ce1626f17e953 -

Expand Down
1 change: 1 addition & 0 deletions sample_files/windows_1251_1.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Mu� k�nnen: l�st mu� da� Hei�t l�scht f�hren f�r mu� �hnlich
1 change: 1 addition & 0 deletions sample_files/windows_1251_2.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Mu� k�nnen: l�st mu� da� Hei�t l�scht f�hren f�r mu� �hmlich
9 changes: 8 additions & 1 deletion src/files.rs
Original file line number Diff line number Diff line change
Expand Up @@ -215,14 +215,21 @@ pub(crate) fn guess_content(bytes: &[u8]) -> ProbableFileKind {
.take(5000)
.filter(|c| *c == std::char::REPLACEMENT_CHARACTER || *c == '\0')
.count();
if num_utf16_invalid <= 5 {
if num_utf16_invalid <= 1 {
info!(
"Input file is mostly valid UTF-16 (invalid characters: {})",
num_utf16_invalid
);
return ProbableFileKind::Text(utf16_string);
}

// If the input bytes are valid Windows-1252 (an extension of
// ISO-8859-1 aka Latin 1), treat them as such.
let (latin1_str, _encoding, saw_malformed) = encoding_rs::WINDOWS_1252.decode(bytes);
if !saw_malformed {
return ProbableFileKind::Text(latin1_str.to_string());
}

ProbableFileKind::Binary
}

Expand Down

0 comments on commit fadd0f2

Please sign in to comment.