-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestion for sf vignette #179
Comments
Hi Grant, thanks for this, it's a good suggestion and I can add this to the vignette as I find time. In general I wonder why Thanks also for promoting collapse. I am eager to tell those STATA people that collapse offers a lot of the panel data and complex aggregation stuff they are missing in R, yet I somewhat lack the appropriate means of communication. Regarding your data.table example, I'd just note that qDT(ncl)[, .(mean_births = mean(births),
geometry = st_union(geometry)),
by = county] |> copyMostAttrib(ncl) |> setRownames()
#> Simple feature collection with 100 features and 2 fields
#> geometry type: MULTIPOLYGON
#> dimension: XY
#> bbox: xmin: -79.54099 ymin: 35.83699 xmax: -79.23799 ymax: 36.24614
#> geographic CRS: NAD27
#> First 10 features:
#> county mean_births geometry
#> 1 Alamance 5219.5 MULTIPOLYGON (((-79.24619 3...
#> 2 Alexander 1508.0 MULTIPOLYGON (((-81.10889 3...
#> 3 Alleghany 514.5 MULTIPOLYGON (((-81.23989 3...
#> 4 Anson 1722.5 MULTIPOLYGON (((-79.91995 3...
#> 5 Ashe 1227.5 MULTIPOLYGON (((-81.47276 3...
#> 6 Avery 879.0 MULTIPOLYGON (((-81.94135 3...
#> 7 Beaufort 2800.5 MULTIPOLYGON (((-77.10377 3...
#> 8 Bertie 1470.0 MULTIPOLYGON (((-76.78307 3...
#> 9 Bladen 1917.0 MULTIPOLYGON (((-78.2615 34...
#> 10 Brunswick 2418.0 MULTIPOLYGON (((-78.65572 3... Created on 2021-07-27 by the reprex package (v0.3.0) The reason for this is that Another thing I recognized while seing this is that using Final FYI: for the new release, especially with |
Yeah, it's interesting. The underlying source code is here and it looks to be doing a bunch of sensible things (e.g. checking for commonality before doing any computation). I'd have to do some profiling to see where the bottleneck is exactly, but will have to wait until I have time!
Always happy to promote good open-source software like collapse. A few people have told me how much they like it / how impressed they are by it. Hopefully you're seeing a continual uptick in usage!
Apart from spatial geometries, the case that I use most often is probably nested data frames/tables. "Often" being a relative concept, b/c I don't think it makes sense for general regression work. But nesting is pretty nice for running Monte Carlo simulations as per this example. At any rate, I think we run into the problem you were describing for this use case. A not-very-sensible example: library(dplyr, warn.conflicts = FALSE)
library(collapse, warn.conflicts = FALSE)
## Pure dplyr example
mtcars |>
nest_by(am) |>
summarise(sum_table = list(summary(data)))
## Passing qsu within summarise works too
mtcars |>
nest_by(am) |>
summarise(sum_table = list(qsu(data)))
## But fsummarise doesn't
mtcars |>
nest_by(am) |>
fsummarise(sum_table = list(summary(data)))
## Same
mtcars |>
nest_by(am) |>
fsummarise(sum_table = list(qsu(data)))
Ah, good to know. Thanks for the tips! Happy to leave this open until you decide on the vignette. (Or let me know if you want me to add it in a PR.) |
Thanks a lot for those examples. It makes sense to me to use nesting in those cases. I will see what can be done about this in collapse. For now
|
Oh, very nice. I'm not sure if you saw the follow-up post where I call |
So the fastest way to run an lm in base R, to my knowledge, is using the choleski decomposition: |
You may be interested -- we have recently found that library(sf)
nc = read_sf(system.file("shape/nc.shp", package = "sf"))
nc_list = vector("list", 500)
for (i in seq_along(nc_list)) nc_list[[i]] = nc
fastrbindsf = function(x) {
geom_name = attr(x[[1]], "sf_column")
x = collapse::unlist2d(x, idcols = FALSE, recursive = FALSE)
x[[geom_name]] = st_sfc(x[[geom_name]], recompute_bbox = TRUE)
x = st_as_sf(x)
return(x)
}
t = bench::mark(
check = FALSE,
rbind = do.call(what = rbind, args = nc_list),
collapse = fastrbindsf(nc_list)
)
t[, 1:5]
#> expression min median `itr/sec` mem_alloc
#> 1 rbind 9.28s 9.28s 0.108 1.19GB
#> 2 collapse 35.24ms 39.2ms 22.8 8.26MB |
Hi Sebastian,
Congrats on the new release. I'm digging the new
sf
integration. Speaking of which...In the sf vignette you write: "It needs to be noted here that typically most of the time in aggregation is consumed by st_union so that the speed of collapse does not really become visible on most datasets."
There's one exception to this rule that you may want to highlight. Namely, in panel data where you want to aggregate over time values of the same geometry. In such cases, you can simply use the first geometric observation per group (since it's repeated) and this will make it much quicker than
st_union()
. It's a trick that I sometimes use when manipulating sf objects as data.tables and appears to carry over nicely here. Example:Created on 2021-07-26 by the reprex package (v2.0.0)
PS. Just to show that it's generating valid geometries:
The text was updated successfully, but these errors were encountered: