Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion for sf vignette #179

Closed
grantmcdermott opened this issue Jul 26, 2021 · 6 comments
Closed

Suggestion for sf vignette #179

grantmcdermott opened this issue Jul 26, 2021 · 6 comments

Comments

@grantmcdermott
Copy link

grantmcdermott commented Jul 26, 2021

Hi Sebastian,

Congrats on the new release. I'm digging the new sf integration. Speaking of which...

In the sf vignette you write: "It needs to be noted here that typically most of the time in aggregation is consumed by st_union so that the speed of collapse does not really become visible on most datasets."

There's one exception to this rule that you may want to highlight. Namely, in panel data where you want to aggregate over time values of the same geometry. In such cases, you can simply use the first geometric observation per group (since it's repeated) and this will make it much quicker than st_union(). It's a trick that I sometimes use when manipulating sf objects as data.tables and appears to carry over nicely here. Example:

library(sf)
library(tidyverse)
library(data.table)
library(collapse)
library(microbenchmark)

nc = st_read(system.file("shape/nc.shp", package = "sf"), quiet = TRUE)

## First create a (very!) short panel of our NC data, recording births in each
## county in 1974 and 1979
ncl =
    nc |>
    fselect(county = NAME, BIR74, BIR79) |> 
    gather(year, births, BIR74, BIR79) |> 
    ftransform(year = as.integer(gsub("BIR", "19", year))) |> 
    roworder(county, year)

ncl
#> Simple feature collection with 200 features and 3 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> Geodetic CRS:  NAD27
#> First 10 features:
#>       county year births                       geometry
#> 1   Alamance 1974   4672 MULTIPOLYGON (((-79.24619 3...
#> 2   Alamance 1979   5767 MULTIPOLYGON (((-79.24619 3...
#> 3  Alexander 1974   1333 MULTIPOLYGON (((-81.10889 3...
#> 4  Alexander 1979   1683 MULTIPOLYGON (((-81.10889 3...
#> 5  Alleghany 1974    487 MULTIPOLYGON (((-81.23989 3...
#> 6  Alleghany 1979    542 MULTIPOLYGON (((-81.23989 3...
#> 7      Anson 1974   1570 MULTIPOLYGON (((-79.91995 3...
#> 8      Anson 1979   1875 MULTIPOLYGON (((-79.91995 3...
#> 9       Ashe 1974   1091 MULTIPOLYGON (((-81.47276 3...
#> 10      Ashe 1979   1364 MULTIPOLYGON (((-81.47276 3...

## data.table version
ncl_dt = as.data.table(ncl)

## Benchmarks

microbenchmark(
    dplyr = ncl |> 
        group_by(county, year) |> 
        summarise(mean_births = mean(births)),
    
    datatable_union = ncl_dt[,
                             .(mean_births = mean(births), 
                               geometry = st_union(geometry)),
                             by = .(county, year)],
    
    datatable_first = ncl_dt[,
                             .(mean_births = mean(births), 
                               geometry = data.table::first(geometry)),
                             by = .(county, year)],
    
    collapse_union = ncl |> 
        fgroup_by(county, year) |> 
        fsummarise(mean_births = fmean(births),
                   geometry = st_union(geometry)),
    
    collapse_first = ncl |> 
        fgroup_by(county, year) |> 
        fsummarise(mean_births = fmean(births),
                   geometry = ffirst(geometry)),
    
    times = 2
)
#> Unit: microseconds
#>             expr        min         lq       mean     median         uq         max neval cld
#>            dplyr 749877.901 749877.901 752414.346 752414.346 754950.790  754950.790     2   c
#>  datatable_union 704897.821 704897.821 708573.421 708573.421 712249.022  712249.022     2  b 
#>  datatable_first   3789.810   3789.810   6139.954   6139.954   8490.099    8490.099     2 a  
#>   collapse_union 724663.886 724663.886 739028.508 739028.508 753393.130  753393.130     2  bc
#>   collapse_first    154.334    154.334    198.569    198.569    242.804     242.804     2 a

Created on 2021-07-26 by the reprex package (v2.0.0)

PS. Just to show that it's generating valid geometries:

ncl |> 
    fgroup_by(county, year) |> 
    fsummarise(mean_births = fmean(births),
               geometry = ffirst(geometry)) |> 
    ggplot(aes(fill = mean_births)) + 
    geom_sf() + 
    scale_fill_viridis_c() +
    theme_minimal()

@SebKrantz
Copy link
Owner

SebKrantz commented Jul 27, 2021

Hi Grant,

thanks for this, it's a good suggestion and I can add this to the vignette as I find time. In general I wonder why st_union is so slow, in 0.7 seconds you can crunch a few million observations with collapse or data.table which it takes here to union 100 relatively simple geometries. But I know little about spatial computations.

Thanks also for promoting collapse. I am eager to tell those STATA people that collapse offers a lot of the panel data and complex aggregation stuff they are missing in R, yet I somewhat lack the appropriate means of communication.

Regarding your data.table example, I'd just note that qDT is now a fully-fledged efficient everything to data.table converter (i.e. data.tables created with qDT now support reference semantics :=) that can substitute for as.data.table. The example I have given in the vignette can also be applied here with the exception that it needs to be followied by setRownames:

qDT(ncl)[, .(mean_births = mean(births), 
             geometry = st_union(geometry)),
         by = county] |> copyMostAttrib(ncl) |> setRownames()
#> Simple feature collection with 100 features and 2 fields
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: -79.54099 ymin: 35.83699 xmax: -79.23799 ymax: 36.24614
#> geographic CRS: NAD27
#> First 10 features:
#>       county mean_births                       geometry
#> 1   Alamance      5219.5 MULTIPOLYGON (((-79.24619 3...
#> 2  Alexander      1508.0 MULTIPOLYGON (((-81.10889 3...
#> 3  Alleghany       514.5 MULTIPOLYGON (((-81.23989 3...
#> 4      Anson      1722.5 MULTIPOLYGON (((-79.91995 3...
#> 5       Ashe      1227.5 MULTIPOLYGON (((-81.47276 3...
#> 6      Avery       879.0 MULTIPOLYGON (((-81.94135 3...
#> 7   Beaufort      2800.5 MULTIPOLYGON (((-77.10377 3...
#> 8     Bertie      1470.0 MULTIPOLYGON (((-76.78307 3...
#> 9     Bladen      1917.0 MULTIPOLYGON (((-78.2615 34...
#> 10 Brunswick      2418.0 MULTIPOLYGON (((-78.65572 3...

Created on 2021-07-27 by the reprex package (v0.3.0)

The reason for this is that copyMostAttib, just as mostattributes<- only excludes "names", "dim" and "dimnames" and thus also copies "row.names" which are reduced upon aggregation. Thinking about this here I might modify copyMostAttrib (which at the moment is a plain export from the C-API) to also handle "row.names".

Another thing I recognized while seing this is that using fsummarise with list columns will only work if these columns have a class attribute, which fortunately is the case for sf geometries. The reason for this is that all Fast Statistical Functions have a hidden *.list method which dispatches to the data frame method so that something like fmode(unclass(mtcars)) does not give an error but returns the same as fmode(mtcars). With plain list columns this could be problematic (at least for fsummarise, collap with parallel = FALSE always applies functions to subsets of the frame), because here we want the default method of these functions to apply. So I might have to get rid of that hidden method for ffirst, flast and fnobs, which would however introduce inconsistency in the Fast Statistical Functions. Anyway, for sf it is unproblematic. What other list-column structures do you use? And do they have classes?.

Final FYI: for the new release, especially with collap or when directly applying ffirst and flast to a frame, it makes sense to use na.rm = FALSE whenever more than one column is aggregated. With na.rm = FALSE these functions determine the corresponding rows in a first pass, and then process the remaining columns at the speed of a data frame subset (significantly faster than data.table).

@grantmcdermott
Copy link
Author

In general I wonder why st_union is so slow.

Yeah, it's interesting. The underlying source code is here and it looks to be doing a bunch of sensible things (e.g. checking for commonality before doing any computation). I'd have to do some profiling to see where the bottleneck is exactly, but will have to wait until I have time!

Thanks also for promoting collapse.

Always happy to promote good open-source software like collapse. A few people have told me how much they like it / how impressed they are by it. Hopefully you're seeing a continual uptick in usage!

What other list-column structures do you use? And do they have classes?

Apart from spatial geometries, the case that I use most often is probably nested data frames/tables. "Often" being a relative concept, b/c I don't think it makes sense for general regression work. But nesting is pretty nice for running Monte Carlo simulations as per this example. At any rate, I think we run into the problem you were describing for this use case. A not-very-sensible example:

library(dplyr, warn.conflicts = FALSE)
library(collapse, warn.conflicts = FALSE)

## Pure dplyr example 
mtcars |>
  nest_by(am) |>
  summarise(sum_table = list(summary(data)))

## Passing qsu within summarise works too
mtcars |>
  nest_by(am) |>
  summarise(sum_table = list(qsu(data)))

## But fsummarise doesn't
mtcars |>
  nest_by(am) |>
  fsummarise(sum_table = list(summary(data)))

## Same
mtcars |>
  nest_by(am) |>
  fsummarise(sum_table = list(qsu(data)))

Final FYI: for the new release, especially with collap or when directly applying ffirst and flast to a frame, it makes sense to use na.rm = FALSE whenever more than one column is aggregated.

Ah, good to know. Thanks for the tips!

Happy to leave this open until you decide on the vignette. (Or let me know if you want me to add it in a PR.)

@SebKrantz
Copy link
Owner

SebKrantz commented Jul 27, 2021

Thanks a lot for those examples. It makes sense to me to use nesting in those cases. I will see what can be done about this in collapse. For now fsummarise is a very simple function. collapse has a few list processing functions which I discuss a bit in the collapse and data.table vignette, but I guess for simple nesting data.table is just more efficient. Regarding the sf vignette, I can add it in like 1-2 weeks, but you can also do a PR if you like (what is key is to have an example without tidyverse e.g. using melt). I have also just read your blog post about interactions, it's a nice use case, I'll put the paper on my reading list the next time I do something similar. As it is in my nature to play around, I have also tried to beat the implementation. I think especially the fitting can still be optimized. Here's my take:

library(data.table)
library(collapse)

gen_data = function(sims = 1) {
  
  ## Total time periods in the the panel = 500
  tt = 500
  
  sim = rep(rep(1:sims, each = 10), times = 2) ## Repeat twice b/c we have two countries
  
  ## x1 covariates
  x1_A = 1 + rnorm(tt*sims, 0, 1)
  x1_B = 1/4 + rnorm(tt*sims, 0, 1)
  
  ## Add second, nested x2 covariates for each country
  x2_A = 1 + x1_A + rnorm(tt*sims, 0, 1)
  x2_B = 1 + x1_B + rnorm(tt*sims, 0, 1)
  
  ## Outcomes (notice different slope coefs for x2_A and x2_B)
  y_A = x1_A + 1*x2_A + rnorm(tt*sims, 0, 1)
  y_B = x1_B + 2*x2_B + rnorm(tt*sims, 0, 1)
  
  ## Combine in a data table (basically just an enhanced data frame)
  dat = 
    qDT(list(
      sim = rep(sim, each = 50),
      id = structure(c(alloc(1L, length(x1_A)), alloc(2L, length(x1_B))), 
                     levels = c("A", "B"), class = "factor"),
      x1 = c(x1_A, x1_B),
      x2 = c(x2_A, x2_B),
      y = c(y_A, y_B)
    ))
  

  ## Demeaned covariates (grouped by country and simulation)
  settransform(dat,
    fwithin(list(x1_dmean = x1, x2_dmean = x2), list(sim, id), na.rm = FALSE)
  )

  ## Optional set order i.t.o sims
  setorder(dat, sim)

  return(dat)
}

set.seed(123)

## Generate our large dataset of 20k simulations
d = gen_data(2e4)

## Optional: Set key for better collapse performance
setkey(d, sim)

## Collapse into a nested data.table (1 row per simulation), with matrix list columns
d = d[, 
      .(Y = list(y),
        X_level = list(cbind(intercept = 1, x1 = x1, x2 = x2, 'x1:x2' = x1*x2, id = id)),
        X_dmean = list(cbind(intercept = 1, x1 = x1_dmean, x2 = x2_dmean, 'x1:x2' = x1_dmean*x2_dmean, id = id))),
      by = sim]

## Run our simulation
names(d)[3:4] <- c("Xl", "Xd")
tic = Sys.time()
sims =  d[, .(level = flm(Y[[1]], Xl[[1]], method = "chol")['x1:x2', ],   # or method = "eigen"
              demean = flm(Y[[1]], Xd[[1]], method = "chol")['x1:x2', ]), # or method = "eigen" 
          by = sim]
Sys.time() - tic

@grantmcdermott
Copy link
Author

As it is in my nature to play around, I have also tried to beat the implementation. I think especially the fitting can still be optimized. Here's my take:

Oh, very nice. I'm not sure if you saw the follow-up post where I call .lm.fit directly (which I know flm defaults to). This also leads to large speed-up, but I think your code is probably the neatest. Maybe another vignette ;-)

@SebKrantz
Copy link
Owner

So the fastest way to run an lm in base R, to my knowledge, is using the choleski decomposition: chol2inv(chol(crossprod(X))) %*% crossprod(X, y), which is what flm(y, X, method = "chol") does. Another very fast implementation is RcppEigen::fastLmPure(X, y, method = 3L), which is what flm(y, X, method = "eigen") does. These two can be >3x .lm.fit. But in most practical cases it doesn't matter. Let me close the issue, we have gone far beyond it. I will try do add the example this weekend, otherwise next weekend. Thanks again for flagging it, and it gave me some good impulses for further developemnt.

@kadyb
Copy link

kadyb commented Apr 5, 2023

You may be interested -- we have recently found that collapse::unlist2d() is very effective for merging vector datasets more than rbind() (or even data.table::rbindlist()). Part of the speedup is due to the lack of CRS checking, but it is still fast. Check out this issue r-spatial/sf#798 too.

library(sf)
nc = read_sf(system.file("shape/nc.shp", package = "sf"))
nc_list = vector("list", 500)
for (i in seq_along(nc_list)) nc_list[[i]] = nc

fastrbindsf = function(x) {
  geom_name = attr(x[[1]], "sf_column")
  x = collapse::unlist2d(x, idcols = FALSE, recursive = FALSE)
  x[[geom_name]] = st_sfc(x[[geom_name]], recompute_bbox = TRUE)
  x = st_as_sf(x)
  return(x)
}

t = bench::mark(
  check = FALSE,
  rbind = do.call(what = rbind, args = nc_list),
  collapse = fastrbindsf(nc_list)
)
t[, 1:5]
#>   expression      min   median `itr/sec` mem_alloc
#> 1 rbind         9.28s    9.28s     0.108    1.19GB
#> 2 collapse    35.24ms   39.2ms    22.8      8.26MB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants