Problems encoding days and months in UTF-8 #414

jennybc · 2022-03-08T18:39:19Z

Manually transferring from tidyverse/readr#1383

I was trying to create and use a locale for Walloon dialect as shown in the example of days and months in Maori language (see locale vignette). It was an exercise for a coding club event.

However, I encountered a suspect behavior (see reprex below): the dates "vén 3 djun 2016" and "vén 1 djulete 2016" are not parsed, "vén 1 avri 2016" well. This with a file formatted with UTF-8 encoding. The same file saved with latin-1 formatting ("iso8859-1") can be parsed correctly, of course with encoding = "iso8859-1".
Input files:
example_walloon_latin1.txt
example_walloon_utf8.txt

My session info follows the reprex.
Any kind of help is very welcome! Thanks a lot.

library(readr)
#> Warning: package 'readr' was built under R version 4.1.2

days_walloon <- c("londi", "mårdi", "mierkidi", "djudi", "vénrdi",
                  "semdi", "dimegne")
months_walloon <- c("djanvî","fevrî", "måss", "avri", "may", "djun", "djulete",
                    "awousse","setimbre", "octôbe", "nôvimbe", "decimbe")
days_abbr_walloon <- c("lon", "mår", "mie", "dju", "vén", "sem", "dim")
months_abbr_walloon <- c("djan","fev", "mås", "avr", "may", "djun", "djul",
                         "awou","set", "oct", "nôv", "dec")

walloon_utf8 <- locale(date_names(
  day= days_walloon,
  mon = months_walloon,
  day_ab = days_abbr_walloon,
  mon_ab = months_abbr_walloon),
  encoding = "UTF-8",
  decimal_mark = ".",
  grouping_mark = "'",
  date_format = "%a %d %B %Y"
)

walloon_latin1 <- walloon_utf8
walloon_latin1$encoding <- "iso8859-1"

# parsing problems
df_utf8 <-
  read_csv(
    "example_walloon_utf8.txt",
    col_types = cols(
      Date = col_date(format = walloon_utf8$date_format)),
    locale = walloon_utf8
)
#> Warning: One or more parsing issues, see `problems()` for details
df_utf8
#> # A tibble: 4 x 1
#>   Date      
#>   <date>    
#> 1 2016-04-01
#> 2 NA        
#> 3 2016-07-02
#> 4 NA
problems(df_utf8)
#> # A tibble: 2 x 5
#>     row   col expected              actual             file                     
#>   <int> <int> <chr>                 <chr>              <chr>                    
#> 1     3     1 date like %a %d %B %Y vén 3 djun 2016    C:/Users/damiano_oldoni/~
#> 2     5     1 date like %a %d %B %Y vén 1 djulete 2016 C:/Users/damiano_oldoni/~


# everything works fine
df_latin1 <-
  read_csv(
    "example_walloon_latin1.txt",
    col_types = cols(
      Date = col_date(format = walloon_latin1$date_format)),
    locale = walloon_latin1
)

df_latin1
#> # A tibble: 4 x 1
#>   Date      
#>   <date>    
#> 1 2016-04-01
#> 2 2016-06-03
#> 3 2016-07-02
#> 4 2016-07-01

Created on 2022-03-07 by the reprex package (v2.0.1)

Session info

> devtools::session_info()
- Session info ------------------------------------------------------------------------------------------
 setting  value
 version  R version 4.1.1 (2021-08-10)
 os       Windows 10 x64 (build 19042)
 system   x86_64, mingw32
 ui       RStudio
 language (EN)
 collate  Dutch_Belgium.1252
 ctype    Dutch_Belgium.1252
 tz       Europe/Paris
 date     2022-03-07
 rstudio  2021.09.0+351 Ghost Orchid (desktop)
 pandoc   2.14.1 @ C:/PROGRA~1/Pandoc/ (via rmarkdown)

- Packages ----------------------------------------------------------------------------------------------
 package       * version    date (UTC) lib source
 arrow         * 7.0.0      2022-02-10 [1] CRAN (R 4.1.2)
 assertthat      0.2.1      2019-03-21 [1] CRAN (R 4.1.0)
 backports       1.4.1      2021-12-13 [1] CRAN (R 4.1.2)
 bit             4.0.4      2020-08-04 [1] CRAN (R 4.1.0)
 bit64           4.0.5      2020-08-30 [1] CRAN (R 4.1.0)
 brio            1.1.3      2021-11-30 [1] CRAN (R 4.1.2)
 cachem          1.0.6      2021-08-19 [1] CRAN (R 4.1.1)
 callr           3.7.0      2021-04-20 [1] CRAN (R 4.1.0)
 cellranger      1.1.0      2016-07-27 [1] CRAN (R 4.1.0)
 cli             3.2.0      2022-02-14 [1] CRAN (R 4.1.2)
 clipr           0.8.0      2022-02-22 [1] CRAN (R 4.1.2)
 crayon          1.5.0      2022-02-14 [1] CRAN (R 4.1.2)
 DBI             1.1.2      2021-12-20 [1] CRAN (R 4.1.2)
 desc            1.4.0.9000 2021-12-14 [1] https://ropensci.r-universe.dev (R 4.1.2)
 devtools        2.4.3      2021-11-30 [1] CRAN (R 4.1.1)
 digest          0.6.29     2021-12-01 [1] CRAN (R 4.1.2)
 dplyr           1.0.8      2022-02-08 [1] CRAN (R 4.1.2)
 ellipsis        0.3.2      2021-04-29 [1] CRAN (R 4.1.0)
 evaluate        0.15       2022-02-18 [1] CRAN (R 4.1.1)
 fansi           1.0.2      2022-01-14 [1] CRAN (R 4.1.2)
 fastmap         1.1.0      2021-01-25 [1] CRAN (R 4.1.0)
 fortunes        1.5-4      2016-12-29 [1] CRAN (R 4.1.0)
 fs              1.5.2      2021-12-08 [1] CRAN (R 4.1.2)
 gargle          1.2.0      2021-07-02 [1] CRAN (R 4.1.0)
 generics        0.1.2      2022-01-31 [1] CRAN (R 4.1.2)
 glue            1.6.2      2022-02-24 [1] CRAN (R 4.1.2)
 googledrive     2.0.0      2021-07-08 [1] CRAN (R 4.1.0)
 googlesheets4 * 1.0.0      2021-07-21 [1] CRAN (R 4.1.0)
 highr           0.9        2021-04-16 [1] CRAN (R 4.1.0)
 hms             1.1.1      2021-09-26 [1] CRAN (R 4.1.0)
 htmltools       0.5.2      2021-08-25 [1] CRAN (R 4.1.1)
 janitor         2.1.0      2021-01-05 [1] CRAN (R 4.1.1)
 knitr           1.37       2021-12-16 [1] CRAN (R 4.1.2)
 lifecycle       1.0.1      2021-09-24 [1] CRAN (R 4.1.0)
 lubridate       1.8.0      2021-10-07 [1] CRAN (R 4.1.2)
 magrittr        2.0.2      2022-01-26 [1] CRAN (R 4.1.2)
 memoise         2.0.1      2021-11-26 [1] CRAN (R 4.1.2)
 pillar          1.7.0      2022-02-01 [1] CRAN (R 4.1.2)
 pkgbuild        1.3.1      2021-12-20 [1] CRAN (R 4.1.2)
 pkgconfig       2.0.3      2019-09-22 [1] CRAN (R 4.1.0)
 pkgload         1.2.4      2021-11-30 [1] CRAN (R 4.1.2)
 prettyunits     1.1.1      2020-01-24 [1] CRAN (R 4.1.0)
 processx        3.5.2      2021-04-30 [1] CRAN (R 4.1.0)
 ps              1.6.0      2021-02-28 [1] CRAN (R 4.1.0)
 purrr           0.3.4      2020-04-17 [1] CRAN (R 4.1.0)
 R.cache         0.15.0     2021-04-30 [1] CRAN (R 4.1.1)
 R.methodsS3     1.8.1      2020-08-26 [1] CRAN (R 4.1.0)
 R.oo            1.24.0     2020-08-26 [1] CRAN (R 4.1.0)
 R.utils         2.11.0     2021-09-26 [1] CRAN (R 4.1.1)
 R6              2.5.1      2021-08-19 [1] CRAN (R 4.1.0)
 Rcpp            1.0.8      2022-01-13 [1] CRAN (R 4.1.2)
 readr         * 2.1.2      2022-01-30 [1] CRAN (R 4.1.2)
 readxl        * 1.3.1      2019-03-13 [1] CRAN (R 4.1.0)
 remotes         2.4.2      2021-11-30 [1] CRAN (R 4.1.2)
 reprex        * 2.0.1      2021-08-05 [1] CRAN (R 4.1.1)
 rlang           1.0.1      2022-02-03 [1] CRAN (R 4.1.2)
 rmarkdown       2.11       2021-09-14 [1] CRAN (R 4.1.1)
 rprojroot       2.0.2      2020-11-15 [1] CRAN (R 4.1.0)
 rstudioapi      0.13       2020-11-12 [1] CRAN (R 4.1.0)
 sessioninfo     1.2.2      2021-12-06 [1] CRAN (R 4.1.2)
 snakecase       0.11.0     2019-05-25 [1] CRAN (R 4.1.1)
 stringi         1.7.6      2021-11-29 [1] CRAN (R 4.1.2)
 stringr         1.4.0      2019-02-10 [1] CRAN (R 4.1.0)
 styler          1.6.2      2021-09-23 [1] CRAN (R 4.1.1)
 testthat        3.1.2      2022-01-20 [1] CRAN (R 4.1.2)
 tibble          3.1.6      2021-11-07 [1] CRAN (R 4.1.2)
 tidyselect      1.1.2      2022-02-21 [1] CRAN (R 4.1.2)
 tzdb            0.2.0      2021-10-27 [1] CRAN (R 4.1.1)
 usethis         2.1.5      2021-12-09 [1] CRAN (R 4.1.2)
 utf8            1.2.2      2021-07-24 [1] CRAN (R 4.1.0)
 vctrs           0.3.8      2021-04-29 [1] CRAN (R 4.1.0)
 vroom           1.5.7      2021-11-30 [1] CRAN (R 4.1.2)
 withr           2.4.3      2021-11-30 [1] CRAN (R 4.1.2)
 xfun            0.29       2021-12-14 [1] CRAN (R 4.1.2)
 yaml            2.3.5      2022-02-21 [1] CRAN (R 4.1.2)

 [1] C:/R/library
 [2] C:/R/R-4.1.1/library

The text was updated successfully, but these errors were encountered:

jennybc · 2022-03-08T18:39:29Z

I took a quick look at this and, if it is a bug, it's a bug in vroom. The first UTF-8 example runs fine with 1st edition readr.

library(readr)

days_walloon <- c("londi", "mårdi", "mierkidi", "djudi", "vénrdi",
                  "semdi", "dimegne")
months_walloon <- c("djanvî","fevrî", "måss", "avri", "may", "djun", "djulete",
                    "awousse","setimbre", "octôbe", "nôvimbe", "decimbe")
days_abbr_walloon <- c("lon", "mår", "mie", "dju", "vén", "sem", "dim")
months_abbr_walloon <- c("djan","fev", "mås", "avr", "may", "djun", "djul",
                         "awou","set", "oct", "nôv", "dec")

walloon_utf8 <- locale(
  date_names(
    day= days_walloon,
    mon = months_walloon,
    day_ab = days_abbr_walloon,
    mon_ab = months_abbr_walloon
  ),
  encoding = "UTF-8",
  decimal_mark = ".",
  grouping_mark = "'",
  date_format = "%a %d %B %Y"
)

local_edition(1)
df_utf8 <-
  read_csv(
    "example_walloon_utf8.txt",
    col_types = cols(
      Date = col_date(format = walloon_utf8$date_format)),
    locale = walloon_utf8
)
df_utf8
#> # A tibble: 4 × 1
#>   Date      
#>   <date>    
#> 1 2016-04-01
#> 2 2016-06-03
#> 3 2016-07-02
#> 4 2016-07-01

^{Created on 2022-03-08 by the reprex package (v2.0.1.9000)}

ror-at-ebp · 2024-10-16T07:17:11Z

I think I might be experiencing a similar bug, albeit with my example working in UTF-8 and failing in latin1 (or Windows-1252). In my case, readr/vroom is unable to parse the abbreviated months with Umlaut ("Mär") correctly (although in some cases it also failed to parse other months, using Windows-1252 encoding).

I described the issue here.

sbearrows self-assigned this Mar 23, 2022

sbearrows added the bug an unexpected problem or unintended behavior label Mar 23, 2022

sbearrows added the encoding � label Apr 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems encoding days and months in UTF-8 #414

Problems encoding days and months in UTF-8 #414

jennybc commented Mar 8, 2022

jennybc commented Mar 8, 2022

ror-at-ebp commented Oct 16, 2024

Problems encoding days and months in UTF-8 #414

Problems encoding days and months in UTF-8 #414

Comments

jennybc commented Mar 8, 2022

jennybc commented Mar 8, 2022

ror-at-ebp commented Oct 16, 2024