-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Writing a large gpkg file taking forever #1409
Comments
Have you tried with layer creation option |
No. Will try now and aim to put in a PR documenting that feature if it works. Many thanks for fast reply! |
I gave it a go on a smaller dataset (61k vs ~6m rows) and the spatial index seemed to make it a bit faster. Assuming the impact of that option increases with dataset size that could solve it (gave up trying the other day): remotes::install_cran("pct")
#> Skipping install of 'pct' from a cran remote, the SHA1 (0.4.0) has not changed since last install.
#> Use `force = TRUE` to force installation
remotes::install_github("r-spatial/sf")
#> Using github PAT from envvar GITHUB_PAT
#> Skipping install of 'sf' from a github remote, the SHA1 (2ca6483f) has not changed since last install.
#> Use `force = TRUE` to force installation
library(sf)
#> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 7.0.0
l = pct::get_pct_routes_fast(region = "london")
# test writing both ways
f = function(x) file.path(tempdir(), paste0(x, ".gpkg"))
f("l1")
#> [1] "/tmp/RtmpTuVzyM/l1.gpkg"
system.time(
st_write(l, f("l1"))
)
#> Writing layer `l1' to data source `/tmp/RtmpTuVzyM/l1.gpkg' using driver `GPKG'
#> Writing 61051 features with 141 fields and geometry type Line String.
#> user system elapsed
#> 8.268 0.369 8.686
system.time(
st_write(l, f("l2"), layer_options = "SPATIAL_INDEX=NO")
)
#> Writing layer `l2' to data source `/tmp/RtmpTuVzyM/l2.gpkg' using driver `GPKG'
#> options: SPATIAL_INDEX=NO
#> Writing 61051 features with 141 fields and geometry type Line String.
#> user system elapsed
#> 7.722 0.314 8.038 Created on 2020-05-30 by the reprex package (v0.3.0) |
Update: building on the previous example I explored the impact of the layer option on different sized datasets, no clear finding: bench::press(
n = c(10, 100, 1000, 10000),
layer_options = c("", "SPATIAL_INDEX=NO"),
{
bench::mark(
time_unit = "ms",
sf = st_write(l[1:n, ], f(paste0(n, layer_options, runif(1))), layer_options = layer_options)
)
}
)
|
Trying on the full dataset, which takes over a minute to load as an .Rds file:
mm_subset = mm_roads_uk[1:100000, ]
bench::press(
n = c(10, 100, 1000, 100000),
layer_options = c("", "SPATIAL_INDEX=NO"),
{
bench::mark(
time_unit = "ms",
sf = write_sf(mm_subset[1:n, ], f(paste0(n, layer_options, runif(1))), layer_options = layer_options)
)
}
) Waiting for results... |
Seems that the relative speed-up associated with
|
Final benchmark on 10% sample: t1 = system.time({
write_sf(mm_roads_uk[1:500000, ], "/tmp/test1.gpkg")
})
t2 = system.time({
write_sf(mm_roads_uk[1:500000, ], "/tmp/test2.gpkg", layer_options = "SPATIAL_INDEX=NO")
})
t3 = system.time({
saveRDS(mm_roads_uk[1:500000, ], "/tmp/test3.Rds")
}) I get:
So writing to .Rds is about 70 and 50 times faster than writing to .gpkg with and without the spatial index from R on my computer. I will try out writing this same 10% sample with QGIS as a test. Tempted to try .shp as an output and upgrade to GDAL 3.1.0 for FlatGeobuff outputs. |
Minor update on this: I left it running over the weekend and 33.5 hours later the file still hasn't finished writing. The output file is still growing in size, currently it is:
bytes. A few minutes later it is If you'd like any further info on this let me know. I'm not sure if this issue is specific to the dataset I have which is has many variables and |
I can confirm this issue. Also other filteypes are affected (e.g. geojson). I tried to explore the issue a little bit and noticed, that the problem (in my case) was writing library("sf")
library("dplyr")
nc <-
st_read(system.file("shapes/sids.shp", package = "spData")[1], quiet =
TRUE)
st_crs(nc) <- "+proj=longlat +datum=NAD27"
nc <-
st_transform(nc, crs = 3395)
testgrid <-
st_make_grid(nc, cellsize = 1000)
starttime <- Sys.time()
st_write(testgrid, "testgrid.gpkg")
endtime <- Sys.time()
starttime - endtime
# add column with dummy
testgrid <-
testgrid %>%
st_as_sf() %>%
mutate(dummy = 1:length(testgrid))
testgrid$dummy <- ifelse(testgrid$dummy < 100, 1, 0)
starttime <- Sys.time()
st_write(testgrid, dsn = "testgrid2.gpkg", driver = "GPKG")
endtime <- Sys.time()
starttime - endtime
#
testgrid$logical <- 1:length(testgrid)
testgrid$logical <- ifelse(testgrid$logical < 100, T, F)
starttime <- Sys.time()
st_write(testgrid, "testgrid3.gpkg") # hangs forever
endtime <- Sys.time()
starttime - endtime
|
Whatever is causing this is in the C(++?) code. I just did some R profiling and 98% of the time in my tests was in the CPL_write_ogr function, which is Test code attached: Usage:
returns a data frame of timings, number of rows, and
feed into ggplot if you want to plot it and see the difference.... If I knew how to profile C++ code within R I'd go deeper... |
> times
user.self sys.self elapsed user.child sys.child n logical
1 0.756 0.029 0.792 0 0 10000 FALSE
2 0.261 0.001 0.263 0 0 20000 FALSE
3 0.386 0.016 0.401 0 0 30000 FALSE
4 0.524 0.001 0.524 0 0 40000 FALSE
5 0.409 0.268 0.675 0 0 10000 TRUE
6 1.308 1.051 2.360 0 0 20000 TRUE
7 2.706 2.740 5.450 0 0 30000 TRUE
8 4.682 5.092 9.779 0 0 40000 TRUE with > times
user.self sys.self elapsed user.child sys.child n logical
1 0.735 0.025 0.761 0 0 10000 FALSE
2 0.225 0.008 0.233 0 0 20000 FALSE
3 0.352 0.004 0.356 0 0 30000 FALSE
4 0.483 0.000 0.483 0 0 40000 FALSE
5 0.354 0.288 0.642 0 0 10000 TRUE
6 1.133 1.171 2.315 0 0 20000 TRUE
7 2.601 2.812 5.413 0 0 30000 TRUE
8 4.626 5.257 9.888 0 0 40000 TRUE |
Out of curiosity, I also checked library("sf")
library("terra")
n = 50000
df = data.frame(x = runif(n), y = runif(n), z = logical(n))
sf = st_as_sf(df, coords = c("x", "y"))
terra = vect(df, geom = c("x", "y"))
## with logical column
system.time( write_sf(sf, "test.gpkg") ) #> 3.30
system.time( writeVector(terra, "test.gpkg", overwrite = TRUE) ) #> 0.65
## without logical column
system.time( write_sf(sf[, -1], "test.gpkg") ) #> 0.77
system.time( writeVector(terra[, -1], "test.gpkg", overwrite = TRUE) ) #> 0.66 |
> times
user.self sys.self elapsed user.child sys.child n logical
1 0.703 0.030 0.733 0 0 10000 FALSE
2 0.226 0.000 0.226 0 0 20000 FALSE
3 0.340 0.000 0.340 0 0 30000 FALSE
4 0.450 0.005 0.454 0 0 40000 FALSE
5 0.116 0.000 0.116 0 0 10000 TRUE
6 0.223 0.000 0.223 0 0 20000 TRUE
7 0.333 0.000 0.333 0 0 30000 TRUE
8 0.445 0.002 0.447 0 0 40000 TRUE |
@kadyb thanks! @rhijmans library("sf")
# Linking to GEOS 3.11.1, GDAL 3.6.2, PROJ 9.1.1; sf_use_s2() is TRUE
library("terra")
# terra 1.7.3
n = 3
df = data.frame(x = runif(n), y = runif(n), z = c(TRUE, FALSE, NA))
sf = st_as_sf(df, coords = c("x", "y"))
terra = vect(df, geom = c("x", "y"))
## with logical column
system.time( write_sf(sf, "test.gpkg") ) #> 3.30
# writing: substituting ENGCRS["Undefined Cartesian SRS with unknown unit"] for missing CRS
# user system elapsed
# 0.022 0.001 0.123
read_sf("test.gpkg")
# Simple feature collection with 3 features and 1 field
# Geometry type: POINT
# Dimension: XY
# Bounding box: xmin: 0.659066 ymin: 0.009806918 xmax: 0.8063622 ymax: 0.7639054
# Projected CRS: Undefined Cartesian SRS with unknown unit
# # A tibble: 3 × 2
# z geom
# <lgl> <POINT>
# 1 TRUE (0.8063622 0.009806918)
# 2 FALSE (0.659066 0.7639054)
# 3 NA (0.6713299 0.5175844)
system.time( writeVector(terra, "test.gpkg", overwrite = TRUE) ) #> 0.65
# user system elapsed
# 0.011 0.000 0.011
# Warning messages:
# 1: In x@ptr$write(filename, layer, filetype, insert[1], overwrite[1], :
# GDAL Message 6: dataset test.gpkg does not support layer creation option ENCODING
# 2: In x@ptr$write(filename, layer, filetype, insert[1], overwrite[1], :
# GDAL Message 1: Only 0 or 1 should be passed for a OFSTBoolean subtype. Considering this non-zero value as 1.
read_sf("test.gpkg")
# Simple feature collection with 3 features and 1 field
# Geometry type: POINT
# Dimension: XY
# Bounding box: xmin: 0.659066 ymin: 0.009806918 xmax: 0.8063622 ymax: 0.7639054
# Geodetic CRS: Undefined geographic SRS
# # A tibble: 3 × 2
# z geom
# <lgl> <POINT [°]>
# 1 TRUE (0.8063622 0.009806918)
# 2 FALSE (0.659066 0.7639054)
# 3 TRUE (0.6713299 0.5175844) |
For the record this still seems to be taking forever compared with Context: #2142 |
But did just complete, around 5x slower than RDS but workable:
|
I had a similar problem. Having a POINT layer coerced from a terra raster. At the first moment I was writing only the geometry column and it was ok. But, when I added a column to be populated in QGIS, the writing time was way more slower. I created the column like this:
I solved removing the distance column. |
What was the class of the distance column and what version of |
I can tell from experience that despite logical (as indicated in my previous comment) also "NAs" create issues for me. I guess the problem could be that the driver does not know or is slow in converting and writing R specific data types into the above mentioned geospatial formats. |
To make any progress, we need data points and reprexes, not experience. I can only see this: library(sf)
# Linking to GEOS 3.11.1, GDAL 3.6.2, PROJ 9.1.1; sf_use_s2() is TRUE
n = 100000
df = data.frame(x = runif(n), y = runif(n), z= rnorm(n))
sf = st_as_sf(df, coords = c("x", "y"), crs = 4326)
system.time(write_sf(sf, "x.gpkg"))
# user system elapsed
# 1.294 0.125 1.625
sf$z[1] = NA
system.time(write_sf(sf, "x.gpkg"))
# user system elapsed
# 1.294 0.118 1.642
sf$z = rep(NA_real_, n)
system.time(write_sf(sf, "x.gpkg"))
# user system elapsed
# 1.229 0.156 1.608 |
@Robinlovelace, the class is logical. I can't provide the raster here, but it generated a sf object with 2.240.799 features. If sf object:
Session info:
|
@RikFerreira, this is fixed in version 1.0.10, so the update should fix this problem. |
Thak you for your feedback! |
I'm trying to write a large (
300~600 MB as .Rds) file to disk. It saved in about 5 minutes in the .Rds format and took around 10 minutes to read in from a load of compressed .gml file using this mini package developed for the job: https://github.com/ITSLeeds/mastermaprHas been running for over an hour now and am wondering if it will ever finish! I know this is likely to be an issue upstream with GDAL but I'm raising the issue here in case others have had similar issues and in case it's of use. It's related to wider question of which geographic file format to save data as.
This is my set-up:
Created on 2020-05-28 by the reprex package (v0.3.0)
Session info
The text was updated successfully, but these errors were encountered: