Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems encoding days and months in UTF-8 #414

Open
jennybc opened this issue Mar 8, 2022 · 2 comments
Open

Problems encoding days and months in UTF-8 #414

jennybc opened this issue Mar 8, 2022 · 2 comments
Assignees
Labels
bug an unexpected problem or unintended behavior encoding �

Comments

@jennybc
Copy link
Member

jennybc commented Mar 8, 2022

Manually transferring from tidyverse/readr#1383


I was trying to create and use a locale for Walloon dialect as shown in the example of days and months in Maori language (see locale vignette). It was an exercise for a coding club event.

However, I encountered a suspect behavior (see reprex below): the dates "vén 3 djun 2016" and "vén 1 djulete 2016" are not parsed, "vén 1 avri 2016" well. This with a file formatted with UTF-8 encoding. The same file saved with latin-1 formatting ("iso8859-1") can be parsed correctly, of course with encoding = "iso8859-1".
Input files:
example_walloon_latin1.txt
example_walloon_utf8.txt

My session info follows the reprex.
Any kind of help is very welcome! Thanks a lot.

library(readr)
#> Warning: package 'readr' was built under R version 4.1.2

days_walloon <- c("londi", "mårdi", "mierkidi", "djudi", "vénrdi",
                  "semdi", "dimegne")
months_walloon <- c("djanvî","fevrî", "måss", "avri", "may", "djun", "djulete",
                    "awousse","setimbre", "octôbe", "nôvimbe", "decimbe")
days_abbr_walloon <- c("lon", "mår", "mie", "dju", "vén", "sem", "dim")
months_abbr_walloon <- c("djan","fev", "mås", "avr", "may", "djun", "djul",
                         "awou","set", "oct", "nôv", "dec")

walloon_utf8 <- locale(date_names(
  day= days_walloon,
  mon = months_walloon,
  day_ab = days_abbr_walloon,
  mon_ab = months_abbr_walloon),
  encoding = "UTF-8",
  decimal_mark = ".",
  grouping_mark = "'",
  date_format = "%a %d %B %Y"
)

walloon_latin1 <- walloon_utf8
walloon_latin1$encoding <- "iso8859-1"

# parsing problems
df_utf8 <-
  read_csv(
    "example_walloon_utf8.txt",
    col_types = cols(
      Date = col_date(format = walloon_utf8$date_format)),
    locale = walloon_utf8
)
#> Warning: One or more parsing issues, see `problems()` for details
df_utf8
#> # A tibble: 4 x 1
#>   Date      
#>   <date>    
#> 1 2016-04-01
#> 2 NA        
#> 3 2016-07-02
#> 4 NA
problems(df_utf8)
#> # A tibble: 2 x 5
#>     row   col expected              actual             file                     
#>   <int> <int> <chr>                 <chr>              <chr>                    
#> 1     3     1 date like %a %d %B %Y vén 3 djun 2016    C:/Users/damiano_oldoni/~
#> 2     5     1 date like %a %d %B %Y vén 1 djulete 2016 C:/Users/damiano_oldoni/~


# everything works fine
df_latin1 <-
  read_csv(
    "example_walloon_latin1.txt",
    col_types = cols(
      Date = col_date(format = walloon_latin1$date_format)),
    locale = walloon_latin1
)

df_latin1
#> # A tibble: 4 x 1
#>   Date      
#>   <date>    
#> 1 2016-04-01
#> 2 2016-06-03
#> 3 2016-07-02
#> 4 2016-07-01

Created on 2022-03-07 by the reprex package (v2.0.1)

Session info
> devtools::session_info()
- Session info ------------------------------------------------------------------------------------------
 setting  value
 version  R version 4.1.1 (2021-08-10)
 os       Windows 10 x64 (build 19042)
 system   x86_64, mingw32
 ui       RStudio
 language (EN)
 collate  Dutch_Belgium.1252
 ctype    Dutch_Belgium.1252
 tz       Europe/Paris
 date     2022-03-07
 rstudio  2021.09.0+351 Ghost Orchid (desktop)
 pandoc   2.14.1 @ C:/PROGRA~1/Pandoc/ (via rmarkdown)

- Packages ----------------------------------------------------------------------------------------------
 package       * version    date (UTC) lib source
 arrow         * 7.0.0      2022-02-10 [1] CRAN (R 4.1.2)
 assertthat      0.2.1      2019-03-21 [1] CRAN (R 4.1.0)
 backports       1.4.1      2021-12-13 [1] CRAN (R 4.1.2)
 bit             4.0.4      2020-08-04 [1] CRAN (R 4.1.0)
 bit64           4.0.5      2020-08-30 [1] CRAN (R 4.1.0)
 brio            1.1.3      2021-11-30 [1] CRAN (R 4.1.2)
 cachem          1.0.6      2021-08-19 [1] CRAN (R 4.1.1)
 callr           3.7.0      2021-04-20 [1] CRAN (R 4.1.0)
 cellranger      1.1.0      2016-07-27 [1] CRAN (R 4.1.0)
 cli             3.2.0      2022-02-14 [1] CRAN (R 4.1.2)
 clipr           0.8.0      2022-02-22 [1] CRAN (R 4.1.2)
 crayon          1.5.0      2022-02-14 [1] CRAN (R 4.1.2)
 DBI             1.1.2      2021-12-20 [1] CRAN (R 4.1.2)
 desc            1.4.0.9000 2021-12-14 [1] https://ropensci.r-universe.dev (R 4.1.2)
 devtools        2.4.3      2021-11-30 [1] CRAN (R 4.1.1)
 digest          0.6.29     2021-12-01 [1] CRAN (R 4.1.2)
 dplyr           1.0.8      2022-02-08 [1] CRAN (R 4.1.2)
 ellipsis        0.3.2      2021-04-29 [1] CRAN (R 4.1.0)
 evaluate        0.15       2022-02-18 [1] CRAN (R 4.1.1)
 fansi           1.0.2      2022-01-14 [1] CRAN (R 4.1.2)
 fastmap         1.1.0      2021-01-25 [1] CRAN (R 4.1.0)
 fortunes        1.5-4      2016-12-29 [1] CRAN (R 4.1.0)
 fs              1.5.2      2021-12-08 [1] CRAN (R 4.1.2)
 gargle          1.2.0      2021-07-02 [1] CRAN (R 4.1.0)
 generics        0.1.2      2022-01-31 [1] CRAN (R 4.1.2)
 glue            1.6.2      2022-02-24 [1] CRAN (R 4.1.2)
 googledrive     2.0.0      2021-07-08 [1] CRAN (R 4.1.0)
 googlesheets4 * 1.0.0      2021-07-21 [1] CRAN (R 4.1.0)
 highr           0.9        2021-04-16 [1] CRAN (R 4.1.0)
 hms             1.1.1      2021-09-26 [1] CRAN (R 4.1.0)
 htmltools       0.5.2      2021-08-25 [1] CRAN (R 4.1.1)
 janitor         2.1.0      2021-01-05 [1] CRAN (R 4.1.1)
 knitr           1.37       2021-12-16 [1] CRAN (R 4.1.2)
 lifecycle       1.0.1      2021-09-24 [1] CRAN (R 4.1.0)
 lubridate       1.8.0      2021-10-07 [1] CRAN (R 4.1.2)
 magrittr        2.0.2      2022-01-26 [1] CRAN (R 4.1.2)
 memoise         2.0.1      2021-11-26 [1] CRAN (R 4.1.2)
 pillar          1.7.0      2022-02-01 [1] CRAN (R 4.1.2)
 pkgbuild        1.3.1      2021-12-20 [1] CRAN (R 4.1.2)
 pkgconfig       2.0.3      2019-09-22 [1] CRAN (R 4.1.0)
 pkgload         1.2.4      2021-11-30 [1] CRAN (R 4.1.2)
 prettyunits     1.1.1      2020-01-24 [1] CRAN (R 4.1.0)
 processx        3.5.2      2021-04-30 [1] CRAN (R 4.1.0)
 ps              1.6.0      2021-02-28 [1] CRAN (R 4.1.0)
 purrr           0.3.4      2020-04-17 [1] CRAN (R 4.1.0)
 R.cache         0.15.0     2021-04-30 [1] CRAN (R 4.1.1)
 R.methodsS3     1.8.1      2020-08-26 [1] CRAN (R 4.1.0)
 R.oo            1.24.0     2020-08-26 [1] CRAN (R 4.1.0)
 R.utils         2.11.0     2021-09-26 [1] CRAN (R 4.1.1)
 R6              2.5.1      2021-08-19 [1] CRAN (R 4.1.0)
 Rcpp            1.0.8      2022-01-13 [1] CRAN (R 4.1.2)
 readr         * 2.1.2      2022-01-30 [1] CRAN (R 4.1.2)
 readxl        * 1.3.1      2019-03-13 [1] CRAN (R 4.1.0)
 remotes         2.4.2      2021-11-30 [1] CRAN (R 4.1.2)
 reprex        * 2.0.1      2021-08-05 [1] CRAN (R 4.1.1)
 rlang           1.0.1      2022-02-03 [1] CRAN (R 4.1.2)
 rmarkdown       2.11       2021-09-14 [1] CRAN (R 4.1.1)
 rprojroot       2.0.2      2020-11-15 [1] CRAN (R 4.1.0)
 rstudioapi      0.13       2020-11-12 [1] CRAN (R 4.1.0)
 sessioninfo     1.2.2      2021-12-06 [1] CRAN (R 4.1.2)
 snakecase       0.11.0     2019-05-25 [1] CRAN (R 4.1.1)
 stringi         1.7.6      2021-11-29 [1] CRAN (R 4.1.2)
 stringr         1.4.0      2019-02-10 [1] CRAN (R 4.1.0)
 styler          1.6.2      2021-09-23 [1] CRAN (R 4.1.1)
 testthat        3.1.2      2022-01-20 [1] CRAN (R 4.1.2)
 tibble          3.1.6      2021-11-07 [1] CRAN (R 4.1.2)
 tidyselect      1.1.2      2022-02-21 [1] CRAN (R 4.1.2)
 tzdb            0.2.0      2021-10-27 [1] CRAN (R 4.1.1)
 usethis         2.1.5      2021-12-09 [1] CRAN (R 4.1.2)
 utf8            1.2.2      2021-07-24 [1] CRAN (R 4.1.0)
 vctrs           0.3.8      2021-04-29 [1] CRAN (R 4.1.0)
 vroom           1.5.7      2021-11-30 [1] CRAN (R 4.1.2)
 withr           2.4.3      2021-11-30 [1] CRAN (R 4.1.2)
 xfun            0.29       2021-12-14 [1] CRAN (R 4.1.2)
 yaml            2.3.5      2022-02-21 [1] CRAN (R 4.1.2)

 [1] C:/R/library
 [2] C:/R/R-4.1.1/library
@jennybc
Copy link
Member Author

jennybc commented Mar 8, 2022

I took a quick look at this and, if it is a bug, it's a bug in vroom. The first UTF-8 example runs fine with 1st edition readr.

library(readr)

days_walloon <- c("londi", "mårdi", "mierkidi", "djudi", "vénrdi",
                  "semdi", "dimegne")
months_walloon <- c("djanvî","fevrî", "måss", "avri", "may", "djun", "djulete",
                    "awousse","setimbre", "octôbe", "nôvimbe", "decimbe")
days_abbr_walloon <- c("lon", "mår", "mie", "dju", "vén", "sem", "dim")
months_abbr_walloon <- c("djan","fev", "mås", "avr", "may", "djun", "djul",
                         "awou","set", "oct", "nôv", "dec")

walloon_utf8 <- locale(
  date_names(
    day= days_walloon,
    mon = months_walloon,
    day_ab = days_abbr_walloon,
    mon_ab = months_abbr_walloon
  ),
  encoding = "UTF-8",
  decimal_mark = ".",
  grouping_mark = "'",
  date_format = "%a %d %B %Y"
)

local_edition(1)
df_utf8 <-
  read_csv(
    "example_walloon_utf8.txt",
    col_types = cols(
      Date = col_date(format = walloon_utf8$date_format)),
    locale = walloon_utf8
)
df_utf8
#> # A tibble: 4 × 1
#>   Date      
#>   <date>    
#> 1 2016-04-01
#> 2 2016-06-03
#> 3 2016-07-02
#> 4 2016-07-01

Created on 2022-03-08 by the reprex package (v2.0.1.9000)

@sbearrows sbearrows self-assigned this Mar 23, 2022
@sbearrows sbearrows added the bug an unexpected problem or unintended behavior label Mar 23, 2022
@ror-at-ebp
Copy link

I think I might be experiencing a similar bug, albeit with my example working in UTF-8 and failing in latin1 (or Windows-1252). In my case, readr/vroom is unable to parse the abbreviated months with Umlaut ("Mär") correctly (although in some cases it also failed to parse other months, using Windows-1252 encoding).

I described the issue here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior encoding �
Projects
None yet
Development

No branches or pull requests

3 participants