Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StatCan Table being Pulled Too Large -- Any Work Around? #567

Open
jackhere-lab opened this issue Nov 1, 2024 · 3 comments
Open

StatCan Table being Pulled Too Large -- Any Work Around? #567

jackhere-lab opened this issue Nov 1, 2024 · 3 comments

Comments

@jackhere-lab
Copy link

I want to download the CANSIM table 12-10-0128-01 using the stats_can library. However I am getting an HDF5 error. I can load smaller tables without issues. Is there any workaround to this?

@ianepreston
Copy link
Owner

Hi @jackhere-lab,
It's difficult for me to help troubleshoot without the actual error message and some information on your environment. I haven't encountered this error personally, can you provide more details?

@jackhere-lab
Copy link
Author

I am running the code in Azure Databricks. Here is the full error message:

HDF5ExtError: HDF5 error back trace

File "H5D.c", line 1371, in H5Dwrite
can't synchronously write data
File "H5D.c", line 1317, in H5D__write_api_common
can't write data
File "H5VLcallback.c", line 2282, in H5VL_dataset_write_direct
dataset write failed
File "H5VLcallback.c", line 2237, in H5VL__dataset_write
dataset write failed
File "H5VLnative_dataset.c", line 420, in H5VL__native_dataset_write
can't write data
File "H5Dio.c", line 824, in H5D__write
can't write data
File "H5Dchunk.c", line 3295, in H5D__chunk_write
unable to read raw data chunk
File "H5Dchunk.c", line 4626, in H5D__chunk_lock
unable to preempt chunk(s) from cache
File "H5Dchunk.c", line 4286, in H5D__chunk_cache_prune
unable to preempt one or more raw data cache entry
File "H5Dchunk.c", line 4138, in H5D__chunk_cache_evict
cannot flush indexed storage buffer
File "H5Dchunk.c", line 4061, in H5D__chunk_flush_entry
unable to write raw data to file
File "H5Fio.c", line 179, in H5F_shared_block_write
write through page buffer failed
File "H5PB.c", line 992, in H5PB_write
write through metadata accumulator failed
File "H5Faccum.c", line 821, in H5F__accum_write
file write failed
File "H5FDint.c", line 318, in H5FD_write
driver write request failed
File "H5FDsec2.c", line 808, in H5FD__sec2_write
file write failed: time = Tue Nov 12 19:22:38 2024
, filename = '/Workspace/Users/######@###ex.com/stats_can.h5', file descriptor = 90, errno = 27, error message = 'File too large', buf = 0x78770d0, total write size = 259856, bytes this sub-write = 259856, bytes actually written = 18446744073709551615, offset = 0

End of HDF5 error back trace

Problems appending the records.
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/stats_can/sc.py:335, in table_from_h5(table, h5file, path)
334 with pd.HDFStore(h5, "r") as store:
--> 335 df = pd.read_hdf(store, key=table)
336 except (KeyError, OSError):
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/tables/tableextension.pyx:542, in tables.tableextension.Table._append_records()

@ianepreston
Copy link
Owner

Gotcha. I've been using the library in databricks recently and the hdf backend the library uses for retention and updating really doesn't play nice with it. To be honest the whole idea of using hdfs to store things, or handle table retention in the library at all was a mistake on my part. An upcoming release is going to rip all that out and focus on just retrieving data from the API and getting it into a dataframe, leaving storage and updating to other tools.

I'd recommend just using stats_can.sc.download_tables and stats_can.sc.zip_table_to_dataframe with the path set to a dbfs mount or unity catalog volume. That will download the zipped CSV and then extract it into a pandas dataframe. From there you can take over with spark or write the dataframe out or whatever else you need to do.

You could even do just the download_tables part and then unzip it and read it in with spark directly if you want a more databricks native way to do things.

Hope that helps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants