Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

st_write significantly slower when writing logical (boolean) field types #1689

Closed
wkmor1 opened this issue Jun 8, 2021 · 6 comments
Closed

Comments

@wkmor1
Copy link

wkmor1 commented Jun 8, 2021

Have noticed that when there are logical data type columns in an sf object writing out to a file is much slower. The following illustrates the issue:

f <- function(x, n, fmt = ".csv") {
  x <- rep(x, n)
  x <- data.frame(0, 0, x)
  x <- sf::st_as_sf(x, coords = 1:2)
  sf::st_write(x, tempfile(fileext = fmt), quiet = TRUE)
}

microbenchmark::microbenchmark(
  f(FALSE, 10000),
  f(0L, 10000),
  f(0, 10000),
  f("FALSE", 10000),
  f(FALSE, 10000, ".gpkg"),
  f(0L, 10000, ".gpkg"),
  f(0, 10000, ".gpkg"),
  f("FALSE", 10000, ".gpkg"),
  times = 10
)
#> Unit: milliseconds
#>                        expr       min        lq      mean    median        uq
#>             f(FALSE, 10000) 189.83119 204.43109 307.18853 225.08582 256.84633
#>                f(0L, 10000)  35.12176  35.90689  43.35830  38.03695  42.48434
#>                 f(0, 10000)  37.55803  39.72591  47.20795  41.14088  43.24004
#>           f("FALSE", 10000)  34.27908  35.90282  38.39523  37.67921  40.45558
#>    f(FALSE, 10000, ".gpkg") 284.02960 300.72006 350.00660 337.46137 368.82369
#>       f(0L, 10000, ".gpkg") 126.78843 133.17075 154.76982 137.61813 161.53709
#>        f(0, 10000, ".gpkg") 129.27981 129.50055 140.01243 133.58174 140.45955
#>  f("FALSE", 10000, ".gpkg") 125.95240 130.73475 146.92906 135.61447 157.52841
#>         max neval
#>  1025.06738    10
#>    67.04742    10
#>   104.12921    10
#>    44.11557    10
#>   486.99444    10
#>   244.68378    10
#>   173.41739    10
#>   201.76884    10

Created on 2021-06-08 by the reprex package (v2.0.0)

Session info
sessionInfo()
#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.5 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.6           pillar_1.6.1         compiler_4.1.0      
#>  [4] highr_0.9            R.methodsS3_1.8.1    R.utils_2.10.1      
#>  [7] class_7.3-19         tools_4.1.0          digest_0.6.27       
#> [10] evaluate_0.14        lifecycle_1.0.0      tibble_3.1.2        
#> [13] R.cache_0.15.0       pkgconfig_2.0.3      rlang_0.4.11        
#> [16] reprex_2.0.0         DBI_1.1.1            microbenchmark_1.4-7
#> [19] yaml_2.2.1           xfun_0.23            e1071_1.7-7         
#> [22] dplyr_1.0.6          withr_2.4.2          styler_1.4.1        
#> [25] stringr_1.4.0        knitr_1.33           generics_0.1.0      
#> [28] fs_1.5.0             vctrs_0.3.8          tidyselect_1.1.1    
#> [31] grid_4.1.0           classInt_0.4-3       glue_1.4.2          
#> [34] R6_2.5.0             sf_0.9-8             fansi_0.5.0         
#> [37] rmarkdown_2.8        purrr_0.3.4          magrittr_2.0.1      
#> [40] units_0.7-1          backports_1.2.1      ellipsis_0.3.2      
#> [43] htmltools_0.5.1.1    assertthat_0.2.1     utf8_1.2.1          
#> [46] KernSmooth_2.23-20   stringi_1.6.2        proxy_0.4-25        
#> [49] crayon_1.4.1         R.oo_1.24.0
@johnbaums
Copy link

johnbaums commented Jun 8, 2021

Interesting 🤔 ... Can confirm on MacOS

library(sf)
#> Linking to GEOS 3.8.1, GDAL 3.1.1, PROJ 6.3.1

f <- function(x, n, fmt = ".csv") {
  x <- rep(x, n)
  x <- data.frame(0, 0, x)
  x <- sf::st_as_sf(x, coords = 1:2)
  sf::st_write(x, tempfile(fileext = fmt), quiet = TRUE)
}

microbenchmark::microbenchmark(
  f(FALSE, 10000),
  f(0L, 10000),
  f(0, 10000),
  f("FALSE", 10000),
  f(FALSE, 10000, ".gpkg"),
  f(0L, 10000, ".gpkg"),
  f(0, 10000, ".gpkg"),
  f("FALSE", 10000, ".gpkg"),
  times = 10)
#> Unit: milliseconds
#>                        expr       min        lq      mean    median        uq
#>             f(FALSE, 10000) 209.85522 230.28042 238.60764 234.83916 241.14995
#>                f(0L, 10000)  32.27804  34.97311  36.81165  35.90451  38.60695
#>                 f(0, 10000)  35.26579  36.09508  36.74345  36.49578  37.87945
#>           f("FALSE", 10000)  34.76497  35.47865  37.41611  36.41996  38.75253
#>    f(FALSE, 10000, ".gpkg") 273.42011 282.26102 300.65752 290.11968 312.02942
#>       f(0L, 10000, ".gpkg")  92.12086  93.14257  97.89846  98.11679  99.52682
#>        f(0, 10000, ".gpkg")  94.32044  97.65801  99.13713  98.62017 100.35575
#>  f("FALSE", 10000, ".gpkg")  94.62713  96.33537  98.80034  97.97429  99.84456
#>        max neval
#>  279.97131    10
#>   41.73219    10
#>   38.51427    10
#>   43.07141    10
#>  378.14181    10
#>  107.44184    10
#>  105.80861    10
#>  105.31229    10

Created on 2021-06-08 by the reprex package (v2.0.0)

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.0.2 (2020-06-22)
#>  os       macOS Mojave 10.14.3        
#>  system   x86_64, darwin17.0          
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_AU.UTF-8                 
#>  ctype    en_AU.UTF-8                 
#>  tz       Australia/Melbourne         
#>  date     2021-06-08                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package        * version date       lib source        
#>  assertthat       0.2.1   2019-03-21 [1] CRAN (R 4.0.0)
#>  backports        1.2.1   2020-12-09 [1] CRAN (R 4.0.2)
#>  class            7.3-17  2020-04-26 [1] CRAN (R 4.0.2)
#>  classInt         0.4-3   2020-04-07 [1] CRAN (R 4.0.0)
#>  cli              2.5.0   2021-04-26 [1] CRAN (R 4.0.2)
#>  crayon           1.4.1   2021-02-08 [1] CRAN (R 4.0.2)
#>  DBI              1.1.0   2019-12-15 [1] CRAN (R 4.0.0)
#>  digest           0.6.27  2020-10-24 [1] CRAN (R 4.0.2)
#>  dplyr            1.0.6   2021-05-05 [1] CRAN (R 4.0.2)
#>  e1071            1.7-4   2020-10-14 [1] CRAN (R 4.0.2)
#>  ellipsis         0.3.2   2021-04-29 [1] CRAN (R 4.0.2)
#>  evaluate         0.14    2019-05-28 [1] CRAN (R 4.0.0)
#>  fansi            0.4.2   2021-01-15 [1] CRAN (R 4.0.2)
#>  fs               1.5.0   2020-07-31 [1] CRAN (R 4.0.2)
#>  generics         0.1.0   2020-10-31 [1] CRAN (R 4.0.2)
#>  glue             1.4.2   2020-08-27 [1] CRAN (R 4.0.2)
#>  highr            0.9     2021-04-16 [1] CRAN (R 4.0.2)
#>  htmltools        0.5.1.1 2021-01-22 [1] CRAN (R 4.0.2)
#>  KernSmooth       2.23-17 2020-04-26 [1] CRAN (R 4.0.2)
#>  knitr            1.33    2021-04-24 [1] CRAN (R 4.0.2)
#>  lifecycle        1.0.0   2021-02-15 [1] CRAN (R 4.0.2)
#>  magrittr         2.0.1   2020-11-17 [1] CRAN (R 4.0.2)
#>  microbenchmark * 1.4-7   2019-09-24 [1] CRAN (R 4.0.2)
#>  pillar           1.6.1   2021-05-16 [1] CRAN (R 4.0.2)
#>  pkgconfig        2.0.3   2019-09-22 [1] CRAN (R 4.0.0)
#>  purrr            0.3.4   2020-04-17 [1] CRAN (R 4.0.0)
#>  R6               2.5.0   2020-10-28 [1] CRAN (R 4.0.2)
#>  Rcpp             1.0.6   2021-01-15 [1] CRAN (R 4.0.2)
#>  reprex           2.0.0   2021-04-02 [1] CRAN (R 4.0.2)
#>  rlang            0.4.11  2021-04-30 [1] CRAN (R 4.0.2)
#>  rmarkdown        2.5     2020-10-21 [1] CRAN (R 4.0.2)
#>  sessioninfo      1.1.1   2018-11-05 [1] CRAN (R 4.0.0)
#>  sf             * 0.9-6   2020-09-13 [1] CRAN (R 4.0.2)
#>  stringi          1.6.2   2021-05-17 [1] CRAN (R 4.0.2)
#>  stringr          1.4.0   2019-02-10 [1] CRAN (R 4.0.0)
#>  styler           1.4.1   2021-03-30 [1] CRAN (R 4.0.2)
#>  tibble           3.1.2   2021-05-16 [1] CRAN (R 4.0.2)
#>  tidyselect       1.1.1   2021-04-30 [1] CRAN (R 4.0.2)
#>  units            0.6-7   2020-06-13 [1] CRAN (R 4.0.0)
#>  utf8             1.2.1   2021-03-12 [1] CRAN (R 4.0.2)
#>  vctrs            0.3.8   2021-04-29 [1] CRAN (R 4.0.2)
#>  withr            2.4.2   2021-04-18 [1] CRAN (R 4.0.2)
#>  xfun             0.23    2021-05-15 [1] CRAN (R 4.0.2)
#>  yaml             2.2.1   2020-02-01 [1] CRAN (R 4.0.0)
#> 
#> [1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library

@tim-salabim
Copy link
Member

Can also confirm. Flatgeobuf helps speed in general, but also slower for bool

library(sf)
#> Linking to GEOS 3.9.0, GDAL 3.2.1, PROJ 7.2.1

f <- function(x, n, fmt = ".csv") {
  x <- rep(x, n)
  x <- data.frame(0, 0, x)
  x <- sf::st_as_sf(x, coords = 1:2)
  sf::st_write(x, tempfile(fileext = fmt), quiet = TRUE)
}

microbenchmark::microbenchmark(
  f(FALSE, 10000),
  f(0L, 10000),
  f(0, 10000),
  f("FALSE", 10000),
  f(FALSE, 10000, ".fgb"),
  f(0L, 10000, ".fgb"),
  f(0, 10000, ".fgb"),
  f("FALSE", 10000, ".fgb"),
  times = 10)
#> Unit: milliseconds
#>                       expr       min        lq      mean    median        uq
#>            f(FALSE, 10000) 116.03899 120.08824 142.52775 132.83730 146.39599
#>               f(0L, 10000)  23.38568  24.20939  25.40193  26.00254  26.17420
#>                f(0, 10000)  25.33234  25.73252  26.94360  26.05844  28.04113
#>          f("FALSE", 10000)  21.54567  21.88413  22.14330  22.15599  22.50748
#>    f(FALSE, 10000, ".fgb") 124.45799 129.56546 143.20141 140.17218 153.84233
#>       f(0L, 10000, ".fgb")  31.12163  31.95204  32.87096  33.04909  33.58276
#>        f(0, 10000, ".fgb")  30.96207  31.16706  32.97509  31.75533  34.07206
#>  f("FALSE", 10000, ".fgb")  31.68097  32.49478  33.56320  33.04091  34.89044
#>        max neval
#>  234.55547    10
#>   27.09925    10
#>   31.91724    10
#>   22.56170    10
#>  186.41495    10
#>   35.00678    10
#>   39.57030    10
#>   36.27695    10

Created on 2021-06-08 by the reprex package (v2.0.0)

@antuki
Copy link

antuki commented Sep 24, 2021

The same here. We had to transform the dummy variables to integers. It seems that transforming them into characters is slow too to export. Strange fact !

@edzer
Copy link
Member

edzer commented Sep 24, 2021

It's treated differently from the others by GDAL: https://trac.osgeo.org/gdal/wiki/rfc50_ogr_field_subtype

@kadyb
Copy link
Contributor

kadyb commented Feb 7, 2023

This should now be fixed with 558c693 (see also #1409).

@edzer
Copy link
Member

edzer commented Feb 7, 2023

Thanks!

@edzer edzer closed this as completed Feb 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants