-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bind_rows_sf for combining lists of sf objects to single dataframe #798
Comments
Note that |
For completeness, I have successfully used |
Ok, so the data.table approach is something else (as expected I guess): > nc <- sf::read_sf(system.file("shape/nc.shp", package = "sf"))
> set.seed(1234)
> nc_list_1k <- nc %>%
+ dplyr::sample_n(size = 1000, replace = TRUE) %>%
+ mutate(sample_id = paste0("sample_", row_number())) %>%
+ split(.$sample_id)
> microbenchmark(mapedit:::combine_list_of_sf(nc_list_1k),
+ do.call(what = sf:::rbind.sf,
+ args = nc_list_1k),
+ purrr::reduce(.x = nc_list_1k,
+ .f = sf:::rbind.sf),
+ sf::st_as_sf(data.table::rbindlist(nc_list_1k)),
+ times = 25L)
Unit: milliseconds
expr min lq mean median uq max neval cld
mapedit:::combine_list_of_sf(nc_list_1k) 1175.212699 1190.186592 1219.459763 1201.873386 1234.360720 1440.05542 25 c
do.call(what = sf:::rbind.sf, args = nc_list_1k) 894.177444 912.931392 928.037925 925.873504 938.368144 992.35103 25 b
purrr::reduce(.x = nc_list_1k, .f = sf:::rbind.sf) 8783.179875 8981.547539 9215.896749 9182.200163 9371.552815 9919.67273 25 d
sf::st_as_sf(data.table::rbindlist(nc_list_1k)) 4.802647 5.438064 5.904802 5.678197 5.819243 10.58275 25 a
> set.seed(1234)
> nc_list_10k <- nc %>%
+ dplyr::sample_n(size = 10000, replace = TRUE) %>%
+ mutate(sample_id = paste0("sample_", row_number())) %>%
+ split(.$sample_id)
> microbenchmark(mapedit:::combine_list_of_sf(nc_list_10k),
+ do.call(what = sf:::rbind.sf,
+ args = nc_list_10k),
+ sf::st_as_sf(data.table::rbindlist(nc_list_10k)),
+ times = 25L)
Unit: milliseconds
expr min lq mean median uq max neval cld
mapedit:::combine_list_of_sf(nc_list_10k) 12212.8645 12531.08956 13102.34104 12778.99254 13164.2622 15267.2328 25 b
do.call(what = sf:::rbind.sf, args = nc_list_10k) 41520.9927 42003.92886 43894.39777 42703.18971 45595.1675 52469.4150 25 c
sf::st_as_sf(data.table::rbindlist(nc_list_10k)) 61.6442 63.85123 72.41966 66.14747 70.3542 112.2899 25 a I think for me, personally it is pretty clear how I will do such an operation in the future :-). |
promoting data.table, I like it! I still remember, @mattdowle presenting his package at the EARL conference in London a few years ago, slightly confused about all the fuss that was made back then about the pipe operator, saying almost exasperatedly "but we have been doing chaining in data.table for years". Don't get me wrong I also like dplyr, just a nice anecdote I wanted to share. |
That's great! the On the original mapedit issue there were some good questions on One other thought from the new-ish to @tim-salabim as an aside, I've run across several helpful tweets from you on blending |
I will. I've just revisited a long intended implementation for building quadtrees using Rcpp and data.table (after a gentle reminder). https://twitter.com/TimSalabim3/status/943194334333227008 Hopefully I can come up with a general function for this in the not too distant future... It's such a fun problem! |
I was recently dealing with State Plane Coordinate Systems (SPCSes) and ran into this issue. I had WGS84-projected data and wanted to get the area, in meters, of each of a few million polygons spread throughout the state of Washington. My options were:
I went with 1 due to time constraints. I would have gone with 2 if the split/apply/combine elements were handled automatically in As for the right way to handle this, my intuition is that we should operate on each We might also want to allow a named vector argument of |
It would help if you could provide a minimal reproducible data example, along with what you've tried. |
I know this is a necro comment, however I just implemented the data.table 'melting' of a list of sf objects into a single sf object, and while it works and is fast, the newly created object's extent is set to the extent of the first element of the original list without (to my knowledge) a simple way to reassign. This behavior is not reproduced when using do.call(rbind...). |
If anyone rocks up here wondering if For the sake of completeness, here's an example and the error we currently get (with dplyr 0.8.3). # dataframe of US states with coordinates
states <- data.frame(name = state.name,
x = state.center$x,
y = state.center$y,
this = "this",
that = "that")
# create sf subsets, with different columns
library(sf)
states1 <- st_as_sf(states[1:10,1:4], coords = c("x", "y"))
states2 <- st_as_sf(states[21:30,c(1:3,5)], coords = c("x", "y"))
# with dplyr
library(dplyr)
bind_rows(states1, states2) Which results in the following error and warnings:
|
Good news, dplyr v1 is coming, and if you install the most recent version from github now... |
Just hit this after my computer's CPU ran all night trying to bind together 100k linestrings broken up into 1000 chunks of 100 rows each. I re-ran Tim's benchmark to check and yes, library(sf)
#> Linking to GEOS 3.10.2, GDAL 3.4.1, PROJ 8.2.1; sf_use_s2() is TRUE
library(tidyverse)
nc <- sf::read_sf(system.file("shape/nc.shp", package = "sf"))
set.seed(1234)
nc_list_10 <- nc %>%
dplyr::sample_n(size = 10, replace = TRUE) %>%
mutate(sample_id = paste0("sample_", row_number())) %>%
split(.$sample_id)
bench::mark(max_iterations = 2, check = FALSE,
rbind = do.call(what = rbind, args = nc_list_10),
bind_rows = do.call(what = bind_rows, args = nc_list_10),
rbindlist = sf::st_as_sf(data.table::rbindlist(nc_list_10)) # incorrect bb
)
#> # A tibble: 3 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 rbind 3.19ms 3.27ms 306. 185.02KB 0
#> 2 bind_rows 3.98ms 4.05ms 247. 689.34KB 0
#> 3 rbindlist 236.51µs 264.04µs 3787. 3.26MB 0 Created on 2023-02-15 with reprex v2.0.2 Thoughts:
There may be new approaches, e.g. using the There may be recent developments on this, tempted to open a new issue on |
It would be interesting to know whether |
@Robinlovelace, could you also include unlist2d from Like this: Edit: In this answer on SO is done some magic with base |
Thanks @kadyb will give it a spin! |
Update, you're right @kadyb, the library(sf)
#> Linking to GEOS 3.10.2, GDAL 3.4.1, PROJ 8.2.1; sf_use_s2() is TRUE
library(tidyverse)
nc <- sf::read_sf(system.file("shape/nc.shp", package = "sf"))
set.seed(1234)
nc_list_10 <- nc %>%
dplyr::sample_n(size = 10, replace = TRUE) %>%
mutate(sample_id = paste0("sample_", row_number())) %>%
split(.$sample_id)
bench::mark(max_iterations = 2, check = FALSE,
rbind = (res_rbind <<- do.call(what = rbind, args = nc_list_10)),
bind_rows = (res_bind_rows <<- do.call(what = bind_rows, args = nc_list_10)),
rbindlist = (res_rbindlist <<- sf::st_as_sf(data.table::rbindlist(nc_list_10))), # incorrect bb
res_unlist2d = (res_unlist2d <<- st_as_sf(collapse::unlist2d(nc_list_10, idcols = FALSE, recursive = FALSE)))
# ,
# unlist = {res_unlist2d <<- lapply(1:n, function(x) nc_list_10)
# list2DF(lapply(setNames(seq_along(.[[1]]), names(.[[1]])), \(i)
# unlist(lapply(., `[[`, i), FALSE, FALSE)))}
)
#> # A tibble: 4 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 rbind 8.54ms 9.67ms 103. 185.02KB 0
#> 2 bind_rows 13.84ms 13.84ms 72.3 689.34KB 72.3
#> 3 rbindlist 548.26µs 582.35µs 1717. 3.26MB 0
#> 4 res_unlist2d 253.51µs 277.65µs 3602. 1016.23KB 0
waldo::compare(res_rbind, res_bind_rows)
#> ✔ No differences
waldo::compare(res_rbind, res_rbindlist) # bbox is wrong
#> `class(old)`: "sf" "tbl_df" "tbl" "data.frame"
#> `class(new)`: "sf" "data.table" "data.frame"
#>
#> `names(old)[12:16]`: "BIR79" "SID79" "NWBIR79" "geometry" "sample_id"
#> `names(new)[12:16]`: "BIR79" "SID79" "NWBIR79" "sample_id" "geometry"
#>
#> `attr(old$geometry, 'bbox')`: "((-82.96275,33.94867),(-75.77316,36.55716))"
#> `attr(new$geometry, 'bbox')`: "((-77.33221,35.81418),(-76.69376,36.24098))"
waldo::compare(res_rbind, res_unlist2d) # bbox is wrong
#> `class(old)`: "sf" "tbl_df" "tbl" "data.frame"
#> `class(new)`: "sf" "data.frame"
#>
#> `names(old)[12:16]`: "BIR79" "SID79" "NWBIR79" "geometry" "sample_id"
#> `names(new)[12:16]`: "BIR79" "SID79" "NWBIR79" "sample_id" "geometry"
#>
#> `attr(old$geometry, 'bbox')`: "((-82.96275,33.94867),(-75.77316,36.55716))"
#> `attr(new$geometry, 'bbox')`: "((-77.33221,35.81418),(-76.69376,36.24098))"
waldo::compare(res_rbindlist, res_unlist2d) # bbox is wrong
#> `class(old)`: "sf" "data.table" "data.frame"
#> `class(new)`: "sf" "data.frame" Created on 2023-02-16 with reprex v2.0.2 Remaining questions:
One other thought: |
|
## collapse
f1 = function(x) {
if (length(x) == 0) stop("Empty list")
geom_name = attr(x[[1]], "sf_column")
x = collapse::unlist2d(x, idcols = FALSE, recursive = FALSE)
x[[geom_name]] = st_sfc(x[[geom_name]], recompute_bbox = TRUE)
x = st_as_sf(x)
}
## data.table
f2 = function(x) {
if (length(x) == 0) stop("Empty list")
geom_name = attr(x[[1]], "sf_column")
x = data.table::rbindlist(x, use.names = FALSE)
x[[geom_name]] = st_sfc(x[[geom_name]], recompute_bbox = TRUE)
x = st_as_sf(x)
} Will it be okay? Edit: If these functions work correctly, someone could now compare their performance on real (big) dataset and we will see which is better choice. |
I'd take the default from |
@Robinlovelace there's not out-of-the-box solution in |
These functions are impressively fast. Many thanks Krzysztof and Edzer, I really hope everyone using |
Sorry for being late to the party here...it seems like there's a good workaround! Neither geos nor wk can help here...combining those vectors has all the same safety/speed tradeoffs that exist in sf (e.g., checking CRS equivalence). Neither geos nor wk bother with the precomputed bounding box (since it's very cheap to compute and rarely used compared to operations that invalidate it) but that doesn't seem like it's a consideration here. |
So Robin maybe just for completeness you should include also CRS and column checks, but maybe leave it as an optional choice because it is quite slow operation. check = function(x) {
ref_crs = st_crs(x[[1]])
ref_colname = colnames(x[[1]])
for (i in seq_len(length(x))) {
if (isFALSE(st_crs(x[[i]]) == ref_crs)) stop("Diffrent CRS")
if (isFALSE(all(colnames(x[[i]]) == ref_colname))) stop("Diffrent columns")
}
} Edit: I posted the prepared function on GIST: fastrbindsf |
Filing this issue after brief discussion on #r-spatial/mapedit/issues/46 and a bit more research.
Background: split/map/combine patterns are very useful for sf objects too, and I couldn't find a fast and officially-recommended way of recombining a list of sf objects to a single data frame.
The three approaches I've found in other issues/repos were:
purrr::reduce(list_of_sf, sf:::rbind.sf)
is the slowest (creates, for N elements, N-1 intermediate objects that are progressively rbind-ed together)do.call(what = sf:::rbind.sf, args = list_of_sf)
which is fast for small lists, but slower past a few thousand elements/moderate data size (details below); andmapedit:::combine_list_of_sf(list_of_sf)
which scales quite well by using dplyr::bind_rows under the hood and handling geometry/attrs separately.I profiled the three approaches in this gist, confirming above.
Performance over 1,000 list elements:
Performance over 10,000 list elements (dropping the purrr::reduce strategy to run in finite time).
@edzer 's comments in the other issue:
As always, thanks for
sf
!The text was updated successfully, but these errors were encountered: