-
Notifications
You must be signed in to change notification settings - Fork 229
Add pygmt.gmtread to read a dataset/grid/image into pandas.DataFrame/xarray.DataArray #3673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
cac7d74
to
c50232e
Compare
Now, the
|
pygmt/src/read.py
Outdated
case "dataset": | ||
return lib.virtualfile_to_dataset( | ||
vfname=voutfile, | ||
column_names=column_names, | ||
header=header, | ||
dtype=dtype, | ||
index_col=index_col, | ||
) | ||
case "grid" | "image": | ||
raster = lib.virtualfile_to_raster(vfname=voutfile, kind=kind) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debating on whether we should have a low-level clib read
that reads into a GMT virtualfile, and a high-level read
that wraps around that to do both read + convert virtualfile to a pandas.DataFrame
or xarray.DataArray
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can have a low-level Session.read
method, but as far as I can see, currently it will be used only in the high-level read
function, so it seems unecessary. We can refactor and add the low-level Session.read
method when we need it in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, ok with just making a high-level read
method for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we already have the Session.read_data
method which is almost the same as the low-level read
method you proposed.
Lines 1112 to 1123 in 99a6340
def read_data( | |
self, | |
infile: str, | |
kind: Literal["dataset", "grid", "image"], | |
family: str | None = None, | |
geometry: str | None = None, | |
mode: str = "GMT_READ_NORMAL", | |
region: Sequence[float] | None = None, | |
data=None, | |
): | |
""" | |
Read a data file into a GMT data container. |
The fun fact is that gmtread
calls the gmt_copy
function, which is a wrapper of the GMT_Read_Data
function (https://github.com/GenericMappingTools/gmt/blob/9a8769f905c2b55cf62ed57cd0c21e40c00b3560/src/gmt_api.c#L1294)
load_dataarray | ||
read |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The load_dataarray
function was put under the pygmt.io
namespace. Should we consider putting read
under pygmt.io
too? (Thinking about whether we need a low-level pygmt.clib.read
and high-level pygmt.io.read
in my other comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that sounds good. I have two questions:
- Should we place the
read
source code inpygmt/io.py
, or restructureio.py
into a directory and put it inpygmt/io/read.py
instead? - Should we deprecate the
load_dataarray
function in favor of the newread
function?
I'm expecting to have a write
function that writes a pandas.DataFrame/xarray.DataArray into a tabular/netCDF file
GMT.jl also wraps the read
module (xref: https://www.generic-mapping-tools.org/GMTjl_doc/documentation/utilities/gmtread/). The differences are:
- It uses name
gmtread
, which I think is better sinceread
is a little to general. - It returns custom data types like
GMTVector
,GMTGrid
. [This doesn't work in PyGMT] - It guesses the data kind based on the extensions. [Perhaps we can also do a similar guess?]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Should we place the
read
source code inpygmt/io.py
, or restructureio.py
into a directory and put it inpygmt/io/read.py
instead?
I think making the io
directory sounds good, especially if you're planning on making a write
function in the future.
Should we deprecate the
load_dataarray
function in favor of the newread
function?
No, let's keep load_dataarray
for now. Something I'm contemplating is to make an xarray BackendEntrypoint that uses GMT read
, so that users can then do pygmt.io.load_dataarray(..., engine="gmtread")
or something like that. The load_dataarray
function would use this new gmtread
backend engine by default instead of netcdf4
.
@pytest.mark.skipif(not _HAS_NETCDF4, reason="netCDF4 is not installed.") | ||
def test_io_gmtread_grid(): | ||
""" | ||
Test that reading a grid returns an xr.DataArray and the grid is the same as the one | ||
loaded via xarray.load_dataarray. | ||
""" | ||
grid = gmtread("@static_earth_relief.nc", kind="grid") | ||
assert isinstance(grid, xr.DataArray) | ||
expected_grid = xr.load_dataarray(which("@static_earth_relief.nc", download="a")) | ||
assert np.allclose(grid, expected_grid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also should have a similar test for kind="image"
, comparing against rioxarray.open_rasterio
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in a6c4ee7.
When I tried to add a test for reading datasets, I realized that the DataFrame returned by the load_sample_data
is not ideal:
In [1]: from pygmt.datasets import load_sample_data
In [2]: data = load_sample_data("hotspots")
In [3]: data.dtypes
Out[3]:
longitude float64
latitude float64
symbol_size float64
place_name object
dtype: object
The last column place_name
should be string dtype, rather than object
. We also have similar issues for other sample datasets.
We have three options:
- Do nothing and keep them unchanged
- Fix and use appropriate dtypes
- Use the new
gmtread
function instead ofpd.read_csv
in_load_xxx
functions.
I'm inclined to option 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3. Use the new
gmtread
function instead ofpd.read_csv
in_load_xxx
functions.I'm inclined to option 3.
Agree with this. We should also add dtype related checks for the tabular dataset tests in pygmt/tests/test_datasets_samples.py
.
column_names | ||
A list of column names. | ||
header | ||
Row number containing column names. ``header=None`` means not to parse the | ||
column names from table header. Ignored if the row number is larger than the | ||
number of headers in the table. | ||
dtype | ||
Data type. Can be a single type for all columns or a dictionary mapping | ||
column names to types. | ||
index_col | ||
Column to set as index. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we indicate in the docstring that these params are only used for kind="dataset"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At line 31:
For datasets, keyword arguments
column_names
,header
,dtype
, and
index_col
are supported.
def gmtread( | ||
file: str | PurePath, | ||
kind: Literal["dataset", "grid", "image"], | ||
region: Sequence[float] | str | None = None, | ||
header: int | None = None, | ||
column_names: pd.Index | None = None, | ||
dtype: type | Mapping[Any, type] | None = None, | ||
index_col: str | int | None = None, | ||
) -> pd.DataFrame | xr.DataArray: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, I'm thinking if we should make gmtread
a private function for internal use only for now, the fact that it can read either tabular or grid/image files seems like a lot of magic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the gmtread
function is no longer needed if PR #3919 is implemented, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, not needed for grids/images, but we could still use gmtread
for tabular datasets? Though let's think about #3673 (comment).
|
||
def gmtread( | ||
file: str | PurePath, | ||
kind: Literal["dataset", "grid", "image"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does GMT read also handle 'cube'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes (xref: https://github.com/GenericMappingTools/gmt/blob/9a8769f905c2b55cf62ed57cd0c21e40c00b3560/src/gmtread.c#L75-L81), but need to wait for #3150, which may have upstream bugs.
Description of proposed changes
This PR adds the
pygmt.read
function to read any recognized data files (currently dataset, grid, or image) into a pandas.DataFrame/xarray.DataArray object.The new
read
function can replace mostload_dataarray
/xr.open_dataarray
/xr.load_dataarray
calls.Related to #3643 (comment).
Preview: https://pygmt-dev--3673.org.readthedocs.build/en/3673/api/generated/pygmt.read.html
Reminders
make format
andmake check
to make sure the code follows the style guide.doc/api/index.rst
.Slash Commands
You can write slash commands (
/command
) in the first line of a comment to performspecific operations. Supported slash command is:
/format
: automatically format and lint the code