Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong names in Windows? #250

Open
przemoc opened this issue Jan 8, 2024 · 5 comments
Open

Wrong names in Windows? #250

przemoc opened this issue Jan 8, 2024 · 5 comments

Comments

@przemoc
Copy link

przemoc commented Jan 8, 2024

Disclaimer: I haven't played with Java for ~18 years, so maybe I'm doing something wrong.

PS D:\git\github.com\gunnarmorling\1brc> $PSVersionTable

Name                           Value
----                           -----
PSVersion                      7.3.10
PSEdition                      Core
GitCommitId                    7.3.10
OS                             Microsoft Windows 10.0.22621
Platform                       Win32NT
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0…}
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1
WSManStackVersion              3.0
PS D:\git\github.com\gunnarmorling\1brc> scoop install temurin21-jdk maven
...
PS D:\git\github.com\gunnarmorling\1brc> mvn clean verify
...

PS D:\git\github.com\gunnarmorling\1brc> java --version
openjdk 21.0.1 2023-10-17 LTS
OpenJDK Runtime Environment Temurin-21.0.1+12 (build 21.0.1+12-LTS)
OpenJDK 64-Bit Server VM Temurin-21.0.1+12 (build 21.0.1+12-LTS, mixed mode, sharing)
PS D:\git\github.com\gunnarmorling\1brc> chcp
Active code page: 437
PS D:\git\github.com\gunnarmorling\1brc> chcp 65001
Active code page: 65001
PS D:\git\github.com\gunnarmorling\1brc> $OutputEncoding

Preamble          :
BodyName          : utf-8
EncodingName      : Unicode (UTF-8)
HeaderName        : utf-8
WebName           : utf-8
WindowsCodePage   : 1200
IsBrowserDisplay  : True
IsBrowserSave     : True
IsMailNewsDisplay : True
IsMailNewsSave    : True
IsSingleByte      : False
EncoderFallback   : System.Text.EncoderReplacementFallback
DecoderFallback   : System.Text.DecoderReplacementFallback
IsReadOnly        : True
CodePage          : 65001


PS D:\git\github.com\gunnarmorling\1brc> $Env:JAVA_TOOL_OPTIONS = "-Dfile.encoding=UTF8"
PS D:\git\github.com\gunnarmorling\1brc> java --class-path target/average-1.0.0-SNAPSHOT.jar dev.morling.onebrc.CreateMeasurements 1000000000
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF8
Wrote 50,000,000 measurements in 17124 ms
Wrote 100,000,000 measurements in 34366 ms
Wrote 150,000,000 measurements in 51380 ms
Wrote 200,000,000 measurements in 68445 ms
Wrote 250,000,000 measurements in 85397 ms
Wrote 300,000,000 measurements in 102491 ms
Wrote 350,000,000 measurements in 119489 ms
Wrote 400,000,000 measurements in 136484 ms
Wrote 450,000,000 measurements in 153494 ms
Wrote 500,000,000 measurements in 170461 ms
Wrote 550,000,000 measurements in 187471 ms
Wrote 600,000,000 measurements in 205101 ms
Wrote 650,000,000 measurements in 222205 ms
Wrote 700,000,000 measurements in 239340 ms
Wrote 750,000,000 measurements in 256477 ms
Wrote 800,000,000 measurements in 273675 ms
Wrote 850,000,000 measurements in 290896 ms
Wrote 900,000,000 measurements in 307993 ms
Wrote 950,000,000 measurements in 325116 ms
Created file with 1,000,000,000 measurements in 342196 ms
PS D:\git\github.com\gunnarmorling\1brc> java --class-path target/average-1.0.0-SNAPSHOT.jar dev.morling.onebrc.CalculateAverage >result.txt
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF8
PS D:\git\github.com\gunnarmorling\1brc> ug -m1 -o "Ab.ch." src/main/java/dev/morling/onebrc/CreateMeasurements.java measurements.txt result.txt
src/main/java/dev/morling/onebrc/CreateMeasurements.java
    80: Abéché

measurements.txt
   527: Abéché

result.txt
     1: AbΘchΘ

PS D:\git\github.com\gunnarmorling\1brc> ug --hexdump -m1 -o "Ab.ch." src/main/java/dev/morling/onebrc/CreateMeasurements.java measurements.txt result.txt
src/main/java/dev/morling/onebrc/CreateMeasurements.java
    80:
00000cc0  -- -- -- -- -- -- -- --  -- -- -- -- 41 62 c3 a9  |------------Ab..|
00000cd0  63 68 c3 a9 -- -- -- --  -- -- -- -- -- -- -- --  |ch..------------|

measurements.txt
   527:
00001c10  41 62 c3 a9 63 68 c3 a9  -- -- -- -- -- -- -- --  |Ab..ch..--------|

result.txt
     1:
00000030  41 62 ce 98 63 68 ce 98  -- -- -- -- -- -- -- --  |Ab..ch..--------|

I changed code page to 65001 (UTF-8) and set JAVA_TOOL_OPTIONS to -Dfile.encoding=UTF8 hoping it could improve the situation, but it didn't change anything (originally I tested without those steps).

Can someone explain why there is ce 98 for é instead of c3 a9?

@00gh
Copy link

00gh commented Jan 9, 2024

It seems the "> result.txt" redirect is influenced by PS. You should try cmd.exe instead of PS.

Good Luck.

More information and suggestions regarding Code Page issues with PowerShell on Stack Overflow: https://stackoverflow.com/questions/57131654/using-utf-8-encoding-chcp-65001-in-command-prompt-windows-powershell-window

@przemoc
Copy link
Author

przemoc commented Jan 9, 2024

For cmd it's even worse:

D:\git\github.com\gunnarmorling\1brc>chcp
Active code page: 437

D:\git\github.com\gunnarmorling\1brc>java --class-path target/average-1.0.0-SNAPSHOT.jar dev.morling.onebrc.CalculateAverage >result-cmd.txt

D:\git\github.com\gunnarmorling\1brc>chcp 65001
Active code page: 65001

D:\git\github.com\gunnarmorling\1brc>java --class-path target/average-1.0.0-SNAPSHOT.jar dev.morling.onebrc.CalculateAverage >result-cmd-65001.txt

D:\git\github.com\gunnarmorling\1brc>ug --hexdump -m1 -o "Ab.ch." result-cmd.txt result-cmd-65001.txt
result-cmd.txt
     1:
00000030  41 62 e9 63 68 e9 -- --  -- -- -- -- -- -- -- --  |Ab.ch.----------|

result-cmd-65001.txt
     1:
00000030  41 62 e9 63 68 e9 -- --  -- -- -- -- -- -- -- --  |Ab.ch.----------|

No c3 a9 in sight, only 1 byte which makes ugrep think it is binary file.

But thanks to your SO link I realized I should have looked in PS at [console]::InputEncoding and [console]::OutputEncoding, not $OutputEncoding as I did before.

PS D:\git\github.com\gunnarmorling\1brc> [console]::InputEncoding

Preamble          :
BodyName          : utf-8
EncodingName      : Unicode (UTF-8)
HeaderName        : utf-8
WebName           : utf-8
WindowsCodePage   : 1200
IsBrowserDisplay  : True
IsBrowserSave     : True
IsMailNewsDisplay : True
IsMailNewsSave    : True
IsSingleByte      : False
EncoderFallback   : System.Text.EncoderReplacementFallback
DecoderFallback   : System.Text.DecoderReplacementFallback
IsReadOnly        : True
CodePage          : 65001


PS D:\git\github.com\gunnarmorling\1brc> [console]::OutputEncoding

IsSingleByte      : True
EncodingName      : OEM United States
WebName           : ibm437
HeaderName        : ibm437
BodyName          : ibm437
Preamble          :
WindowsCodePage   :
IsBrowserDisplay  :
IsBrowserSave     :
IsMailNewsDisplay :
IsMailNewsSave    :
EncoderFallback   : System.Text.InternalEncoderBestFitFallback
DecoderFallback   : System.Text.InternalDecoderBestFitFallback
IsReadOnly        : False
CodePage          : 437


which showed that console's output encoding is not UTF-8, but whether it is a source of problem remains to be seen.

I didn't try turning on:

  • Beta: Use Unicode UTF-8 for worldwide language support`

in control intl.cpl yet, but I tried following:

PS D:\git\github.com\gunnarmorling\1brc> [console]::OutputEncoding = New-Object System.Text.UTF8Encoding
PS D:\git\github.com\gunnarmorling\1brc> [console]::OutputEncoding

Preamble          :
BodyName          : utf-8
EncodingName      : Unicode (UTF-8)
HeaderName        : utf-8
WebName           : utf-8
WindowsCodePage   : 1200
IsBrowserDisplay  : True
IsBrowserSave     : True
IsMailNewsDisplay : True
IsMailNewsSave    : True
IsSingleByte      : False
EncoderFallback   : System.Text.EncoderReplacementFallback
DecoderFallback   : System.Text.DecoderReplacementFallback
IsReadOnly        : False
CodePage          : 65001

Retrying test gives new flavour of failure:

PS D:\git\github.com\gunnarmorling\1brc> java --class-path target/average-1.0.0-SNAPSHOT.jar dev.morling.onebrc.CalculateAverage >result-oe-utf8.txt
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF8
PS D:\git\github.com\gunnarmorling\1brc> ug --hexdump -m1 -o "Ab.ch." result-oe-utf8.txt
     1:
00000030  41 62 ef bf bd 63 68 ef  bf bd -- -- -- -- -- --  |Ab...ch...------|                              

Instead of c3 a9, we got ef bf bd...

@Spiderpig86
Copy link
Contributor

@przemoc Were you able to get it resolved? I am running into the same issues you described above. Git bash does not work properly either for me and I don't want to use WSL for this.

@ddimtirov
Copy link
Contributor

I did my first 2 days on PowerShell - nothing special around file generation, except that Java recognizes that it is on Windows and outputs CRLF line endings.

Turns out this breaks many submissions who assume single byte line endings, and while it is fixable with Java args, in the end I checked out the repo under WSL and used IDEA remoting with WSL backend which worked quite decently.

@przemoc
Copy link
Author

przemoc commented Feb 23, 2024

@przemoc Were you able to get it resolved? I am running into the same issues you described above. Git bash does not work properly either for me and I don't want to use WSL for this.

No, I didn't spend more time on this and didn't get it resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants