Skip to content

Add pygmt.gmtread to read a dataset/grid/image into pandas.DataFrame/xarray.DataArray #3673

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 40 commits into
base: main
Choose a base branch
from

Conversation

seisman
Copy link
Member

@seisman seisman commented Dec 4, 2024

Description of proposed changes

This PR adds the pygmt.read function to read any recognized data files (currently dataset, grid, or image) into a pandas.DataFrame/xarray.DataArray object.

The new read function can replace most load_dataarray/xr.open_dataarray/xr.load_dataarray calls.

Related to #3643 (comment).

Preview: https://pygmt-dev--3673.org.readthedocs.build/en/3673/api/generated/pygmt.read.html

Reminders

  • Run make format and make check to make sure the code follows the style guide.
  • Add tests for new features or tests that would have caught the bug that you're fixing.
  • Add new public functions/methods/classes to doc/api/index.rst.
  • Write detailed docstrings for all functions/methods.
  • If wrapping a new module, open a 'Wrap new GMT module' issue and submit reasonably-sized PRs.
  • If adding new functionality, add an example to docstrings or tutorials.
  • Use underscores (not hyphens) in names of Python files and directories.

Slash Commands

You can write slash commands (/command) in the first line of a comment to perform
specific operations. Supported slash command is:

  • /format: automatically format and lint the code

@seisman seisman added feature Brand new feature needs review This PR has higher priority and needs review. and removed needs review This PR has higher priority and needs review. labels Dec 4, 2024
@seisman seisman force-pushed the feature/read branch 2 times, most recently from cac7d74 to c50232e Compare December 4, 2024 10:18
@seisman seisman marked this pull request as draft December 5, 2024 03:23
@seisman seisman added this to the 0.14.0 milestone Dec 9, 2024
@seisman seisman marked this pull request as ready for review December 9, 2024 09:47
@seisman seisman added the needs review This PR has higher priority and needs review. label Dec 9, 2024
@seisman
Copy link
Member Author

seisman commented Dec 9, 2024

Now, the load_dataarray function is used in pygmt/src/grdcut.py only (related to #3115).

xr.open_dataarray is used in test_accessors.py.

@seisman seisman mentioned this pull request Dec 19, 2024
49 tasks
@seisman seisman mentioned this pull request Dec 19, 2024
3 tasks
Comment on lines 102 to 111
case "dataset":
return lib.virtualfile_to_dataset(
vfname=voutfile,
column_names=column_names,
header=header,
dtype=dtype,
index_col=index_col,
)
case "grid" | "image":
raster = lib.virtualfile_to_raster(vfname=voutfile, kind=kind)
Copy link
Member

@weiji14 weiji14 Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debating on whether we should have a low-level clib read that reads into a GMT virtualfile, and a high-level read that wraps around that to do both read + convert virtualfile to a pandas.DataFrame or xarray.DataArray.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have a low-level Session.read method, but as far as I can see, currently it will be used only in the high-level read function, so it seems unecessary. We can refactor and add the low-level Session.read method when we need it in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, ok with just making a high-level read method for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we already have the Session.read_data method which is almost the same as the low-level read method you proposed.

pygmt/pygmt/clib/session.py

Lines 1112 to 1123 in 99a6340

def read_data(
self,
infile: str,
kind: Literal["dataset", "grid", "image"],
family: str | None = None,
geometry: str | None = None,
mode: str = "GMT_READ_NORMAL",
region: Sequence[float] | None = None,
data=None,
):
"""
Read a data file into a GMT data container.

The fun fact is that gmtread calls the gmt_copy function, which is a wrapper of the GMT_Read_Data function (https://github.com/GenericMappingTools/gmt/blob/9a8769f905c2b55cf62ed57cd0c21e40c00b3560/src/gmt_api.c#L1294)

Comment on lines 174 to +175
load_dataarray
read
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The load_dataarray function was put under the pygmt.io namespace. Should we consider putting read under pygmt.io too? (Thinking about whether we need a low-level pygmt.clib.read and high-level pygmt.io.read in my other comment).

Copy link
Member Author

@seisman seisman Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that sounds good. I have two questions:

  1. Should we place the read source code in pygmt/io.py, or restructure io.py into a directory and put it in pygmt/io/read.py instead?
  2. Should we deprecate the load_dataarray function in favor of the new read function?

I'm expecting to have a write function that writes a pandas.DataFrame/xarray.DataArray into a tabular/netCDF file

GMT.jl also wraps the read module (xref: https://www.generic-mapping-tools.org/GMTjl_doc/documentation/utilities/gmtread/). The differences are:

  1. It uses name gmtread, which I think is better since read is a little to general.
  2. It returns custom data types like GMTVector, GMTGrid. [This doesn't work in PyGMT]
  3. It guesses the data kind based on the extensions. [Perhaps we can also do a similar guess?]

Copy link
Member

@weiji14 weiji14 Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Should we place the read source code in pygmt/io.py, or restructure io.py into a directory and put it in pygmt/io/read.py instead?

I think making the io directory sounds good, especially if you're planning on making a write function in the future.

Should we deprecate the load_dataarray function in favor of the new read function?

No, let's keep load_dataarray for now. Something I'm contemplating is to make an xarray BackendEntrypoint that uses GMT read, so that users can then do pygmt.io.load_dataarray(..., engine="gmtread") or something like that. The load_dataarray function would use this new gmtread backend engine by default instead of netcdf4.

@seisman seisman removed this from the 0.14.0 milestone Dec 20, 2024
@seisman seisman removed the needs review This PR has higher priority and needs review. label Dec 20, 2024
Comment on lines +15 to +24
@pytest.mark.skipif(not _HAS_NETCDF4, reason="netCDF4 is not installed.")
def test_io_gmtread_grid():
"""
Test that reading a grid returns an xr.DataArray and the grid is the same as the one
loaded via xarray.load_dataarray.
"""
grid = gmtread("@static_earth_relief.nc", kind="grid")
assert isinstance(grid, xr.DataArray)
expected_grid = xr.load_dataarray(which("@static_earth_relief.nc", download="a"))
assert np.allclose(grid, expected_grid)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also should have a similar test for kind="image", comparing against rioxarray.open_rasterio?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in a6c4ee7.

When I tried to add a test for reading datasets, I realized that the DataFrame returned by the load_sample_data is not ideal:

In [1]: from pygmt.datasets import load_sample_data

In [2]: data = load_sample_data("hotspots")

In [3]: data.dtypes
Out[3]: 
longitude      float64
latitude       float64
symbol_size    float64
place_name      object
dtype: object

The last column place_name should be string dtype, rather than object. We also have similar issues for other sample datasets.

We have three options:

  1. Do nothing and keep them unchanged
  2. Fix and use appropriate dtypes
  3. Use the new gmtread function instead of pd.read_csv in _load_xxx functions.

I'm inclined to option 3.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3. Use the new gmtread function instead of pd.read_csv in _load_xxx functions.

I'm inclined to option 3.

Agree with this. We should also add dtype related checks for the tabular dataset tests in pygmt/tests/test_datasets_samples.py.

Comment on lines +43 to +53
column_names
A list of column names.
header
Row number containing column names. ``header=None`` means not to parse the
column names from table header. Ignored if the row number is larger than the
number of headers in the table.
dtype
Data type. Can be a single type for all columns or a dictionary mapping
column names to types.
index_col
Column to set as index.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we indicate in the docstring that these params are only used for kind="dataset"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At line 31:

For datasets, keyword arguments column_names, header, dtype, and
index_col are supported.

Comment on lines +16 to +24
def gmtread(
file: str | PurePath,
kind: Literal["dataset", "grid", "image"],
region: Sequence[float] | str | None = None,
header: int | None = None,
column_names: pd.Index | None = None,
dtype: type | Mapping[Any, type] | None = None,
index_col: str | int | None = None,
) -> pd.DataFrame | xr.DataArray:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, I'm thinking if we should make gmtread a private function for internal use only for now, the fact that it can read either tabular or grid/image files seems like a lot of magic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the gmtread function is no longer needed if PR #3919 is implemented, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, not needed for grids/images, but we could still use gmtread for tabular datasets? Though let's think about #3673 (comment).


def gmtread(
file: str | PurePath,
kind: Literal["dataset", "grid", "image"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does GMT read also handle 'cube'?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seisman seisman marked this pull request as draft April 17, 2025 02:47
@seisman seisman changed the title Add pygmt.read to read a dataset/grid/image into pandas.DataFrame/xarray.DataArray Add pygmt.gmtread to read a dataset/grid/image into pandas.DataFrame/xarray.DataArray Apr 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Brand new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants