Add pygmt.gmtread to read a dataset/grid/image into pandas.DataFrame/xarray.DataArray #3673

seisman · 2024-12-04T10:05:38Z

Description of proposed changes

This PR adds the pygmt.read function to read any recognized data files (currently dataset, grid, or image) into a pandas.DataFrame/xarray.DataArray object.

The new read function can replace most load_dataarray/xr.open_dataarray/xr.load_dataarray calls.

Related to #3643 (comment).

Preview: https://pygmt-dev--3673.org.readthedocs.build/en/3673/api/generated/pygmt.read.html

Reminders

Run make format and make check to make sure the code follows the style guide.
Add tests for new features or tests that would have caught the bug that you're fixing.
Add new public functions/methods/classes to doc/api/index.rst.
Write detailed docstrings for all functions/methods.
If wrapping a new module, open a 'Wrap new GMT module' issue and submit reasonably-sized PRs.
If adding new functionality, add an example to docstrings or tutorials.
Use underscores (not hyphens) in names of Python files and directories.

Slash Commands

You can write slash commands (/command) in the first line of a comment to perform
specific operations. Supported slash command is:

/format: automatically format and lint the code

…ray.DataArray

seisman · 2024-12-09T10:33:19Z

Now, the load_dataarray function is used in pygmt/src/grdcut.py only (related to #3115).

xr.open_dataarray is used in test_accessors.py.

pygmt/src/read.py

weiji14 · 2024-12-19T16:38:08Z

pygmt/src/read.py

+            case "dataset":
+                return lib.virtualfile_to_dataset(
+                    vfname=voutfile,
+                    column_names=column_names,
+                    header=header,
+                    dtype=dtype,
+                    index_col=index_col,
+                )
+            case "grid" | "image":
+                raster = lib.virtualfile_to_raster(vfname=voutfile, kind=kind)


Debating on whether we should have a low-level clib read that reads into a GMT virtualfile, and a high-level read that wraps around that to do both read + convert virtualfile to a pandas.DataFrame or xarray.DataArray.

We can have a low-level Session.read method, but as far as I can see, currently it will be used only in the high-level read function, so it seems unecessary. We can refactor and add the low-level Session.read method when we need it in the future.

Yep, ok with just making a high-level read method for now.

Actually, we already have the Session.read_data method which is almost the same as the low-level read method you proposed.

pygmt/pygmt/clib/session.py

Lines 1112 to 1123 in 99a6340

def read_data(

self,

infile: str,

kind: Literal["dataset", "grid", "image"],

family: str | None = None,

geometry: str | None = None,

mode: str = "GMT_READ_NORMAL",

region: Sequence[float] | None = None,

data=None,

):

"""

Read a data file into a GMT data container.

The fun fact is that gmtread calls the gmt_copy function, which is a wrapper of the GMT_Read_Data function (https://github.com/GenericMappingTools/gmt/blob/9a8769f905c2b55cf62ed57cd0c21e40c00b3560/src/gmt_api.c#L1294)

weiji14 · 2024-12-19T16:40:50Z

doc/api/index.rst

    load_dataarray
+    read


The load_dataarray function was put under the pygmt.io namespace. Should we consider putting read under pygmt.io too? (Thinking about whether we need a low-level pygmt.clib.read and high-level pygmt.io.read in my other comment).

Yes, that sounds good. I have two questions:

Should we place the read source code in pygmt/io.py, or restructure io.py into a directory and put it in pygmt/io/read.py instead?

Should we deprecate the load_dataarray function in favor of the new read function?

I'm expecting to have a write function that writes a pandas.DataFrame/xarray.DataArray into a tabular/netCDF file

GMT.jl also wraps the read module (xref: https://www.generic-mapping-tools.org/GMTjl_doc/documentation/utilities/gmtread/). The differences are:

It uses name gmtread, which I think is better since read is a little to general.

It returns custom data types like GMTVector, GMTGrid. [This doesn't work in PyGMT]

It guesses the data kind based on the extensions. [Perhaps we can also do a similar guess?]

Should we place the read source code in pygmt/io.py, or restructure io.py into a directory and put it in pygmt/io/read.py instead?

I think making the io directory sounds good, especially if you're planning on making a write function in the future.

Should we deprecate the load_dataarray function in favor of the new read function?

No, let's keep load_dataarray for now. Something I'm contemplating is to make an xarray BackendEntrypoint that uses GMT read, so that users can then do pygmt.io.load_dataarray(..., engine="gmtread") or something like that. The load_dataarray function would use this new gmtread backend engine by default instead of netcdf4.

weiji14 · 2025-04-17T00:42:06Z

pygmt/tests/test_io_gmtread.py

+@pytest.mark.skipif(not _HAS_NETCDF4, reason="netCDF4 is not installed.")
+def test_io_gmtread_grid():
+    """
+    Test that reading a grid returns an xr.DataArray and the grid is the same as the one
+    loaded via xarray.load_dataarray.
+    """
+    grid = gmtread("@static_earth_relief.nc", kind="grid")
+    assert isinstance(grid, xr.DataArray)
+    expected_grid = xr.load_dataarray(which("@static_earth_relief.nc", download="a"))
+    assert np.allclose(grid, expected_grid)


Also should have a similar test for kind="image", comparing against rioxarray.open_rasterio?

Done in a6c4ee7.

When I tried to add a test for reading datasets, I realized that the DataFrame returned by the load_sample_data is not ideal:

In [1]: from pygmt.datasets import load_sample_data In [2]: data = load_sample_data("hotspots") In [3]: data.dtypes Out[3]: longitude float64 latitude float64 symbol_size float64 place_name object dtype: object

The last column place_name should be string dtype, rather than object. We also have similar issues for other sample datasets.

We have three options:

Do nothing and keep them unchanged

Fix and use appropriate dtypes

Use the new gmtread function instead of pd.read_csv in _load_xxx functions.

I'm inclined to option 3.

3. Use the new gmtread function instead of pd.read_csv in _load_xxx functions.

I'm inclined to option 3.

Agree with this. We should also add dtype related checks for the tabular dataset tests in pygmt/tests/test_datasets_samples.py.

weiji14 · 2025-04-17T00:51:00Z

pygmt/io/gmtread.py

+    column_names
+        A list of column names.
+    header
+        Row number containing column names. ``header=None`` means not to parse the
+        column names from table header. Ignored if the row number is larger than the
+        number of headers in the table.
+    dtype
+        Data type. Can be a single type for all columns or a dictionary mapping
+        column names to types.
+    index_col
+        Column to set as index.


Should we indicate in the docstring that these params are only used for kind="dataset"?

At line 31:

For datasets, keyword arguments column_names, header, dtype, and
index_col are supported.

weiji14 · 2025-04-17T00:59:16Z

pygmt/io/gmtread.py

+def gmtread(
+    file: str | PurePath,
+    kind: Literal["dataset", "grid", "image"],
+    region: Sequence[float] | str | None = None,
+    header: int | None = None,
+    column_names: pd.Index | None = None,
+    dtype: type | Mapping[Any, type] | None = None,
+    index_col: str | int | None = None,
+) -> pd.DataFrame | xr.DataArray:


On second thought, I'm thinking if we should make gmtread a private function for internal use only for now, the fact that it can read either tabular or grid/image files seems like a lot of magic.

It seems the gmtread function is no longer needed if PR #3919 is implemented, right?

Yes, not needed for grids/images, but we could still use gmtread for tabular datasets? Though let's think about #3673 (comment).

weiji14 · 2025-04-17T01:00:15Z

pygmt/io/gmtread.py

+
+def gmtread(
+    file: str | PurePath,
+    kind: Literal["dataset", "grid", "image"],


Does GMT read also handle 'cube'?

Yes (xref: https://github.com/GenericMappingTools/gmt/blob/9a8769f905c2b55cf62ed57cd0c21e40c00b3560/src/gmtread.c#L75-L81), but need to wait for #3150, which may have upstream bugs.

seisman added feature Brand new feature needs review This PR has higher priority and needs review. and removed needs review This PR has higher priority and needs review. labels Dec 4, 2024

seisman force-pushed the feature/read branch 2 times, most recently from cac7d74 to c50232e Compare December 4, 2024 10:18

seisman marked this pull request as draft December 5, 2024 03:23

seisman force-pushed the feature/read branch from c50232e to cef4cdf Compare December 5, 2024 03:24

Add pygmt.read to read a dataset/grid/image into pandas.DataFrame/xar…

d913c86

…ray.DataArray

seisman force-pushed the feature/read branch from cef4cdf to d913c86 Compare December 5, 2024 07:58

seisman added 15 commits December 5, 2024 23:17

Set GMT accessor

f456bf8

Need to set 'source' encoding to make GMT accessor work

c3cbb6e

Merge branch 'main' into feature/read

f2a4ce4

Fix the source encoding

1dd97c6

No need to set the source encoding in load_remote_dataset.py

7790ea3

Revert changes in pygmt/datasets/load_remote_dataset.py

e588008

Improve docstring in pygmt/helpers/testing.py

40d12ee

Improve docstrinbgs

fa1021d

Get rid of decorators

c378225

Improve comment

7b749e0

Get rid of the fmt_docstring alias

8befa58

Fix type hints issue with overload

a758752

Remove the type ignore flag

9d66cf4

region defaults to None

a05383a

Merge branch 'main' into feature/read

6ca4ef2

seisman added this to the 0.14.0 milestone Dec 9, 2024

Improve type hints and add tests

7851ced

seisman marked this pull request as ready for review December 9, 2024 09:47

Improve the checking of return value of which

084b87a

seisman added the needs review This PR has higher priority and needs review. label Dec 9, 2024

Use the read funciton in pygmt/tests/test_datatypes_dataset.py

b21997c

seisman added 3 commits December 9, 2024 18:42

Fix a typo

6aef3ca

Replace xr.open_dataarray with read

72afbfe

Fix a typo

03de9b7

michaelgrund approved these changes Dec 11, 2024

View reviewed changes

pygmt/src/read.py Outdated Show resolved Hide resolved

pygmt/src/read.py Outdated Show resolved Hide resolved

seisman mentioned this pull request Dec 19, 2024

Release PyGMT v0.14.0 #3691

Closed

49 tasks

Merge branch 'main' into feature/read

85c533d

seisman mentioned this pull request Dec 19, 2024

Remove the dependency of netCDF4 #3643

Merged

3 tasks

weiji14 reviewed Dec 19, 2024

View reviewed changes

seisman removed this from the 0.14.0 milestone Dec 20, 2024

seisman removed the needs review This PR has higher priority and needs review. label Dec 20, 2024

seisman added 11 commits March 12, 2025 12:31

Merge branch 'main' into feature/read

663c76d

Fix styling

3ed1032

Merge branch 'main' into feature/read

7d320f4

Merge branch 'main' into feature/read

2e72ebe

Minor fix

6d634cc

Add parameter name

4dc7974

Rename read to gmtread

69f5c45

Restructure io.py into a directory

061f5f2

Move gmtread from src to io

4f0779e

Fixes, and clean up

a06ddca

Fix a doctest

82b80f5

weiji14 reviewed Apr 17, 2025

View reviewed changes

Add tests for reading images

a6c4ee7

weiji14 mentioned this pull request Apr 17, 2025

Implement gmt xarray BackendEntrypoint #3919

Merged

10 tasks

seisman marked this pull request as draft April 17, 2025 02:47

seisman changed the title ~~Add pygmt.read to read a dataset/grid/image into pandas.DataFrame/xarray.DataArray~~ Add pygmt.gmtread to read a dataset/grid/image into pandas.DataFrame/xarray.DataArray Apr 17, 2025

weiji14 mentioned this pull request Apr 29, 2025

Allow passing region to GMTBackendEntrypoint.open_dataset #3932

Draft

6 tasks

seisman added 2 commits May 2, 2025 12:53

Merge branch 'main' into feature/read

b4a0b9d

Revert pygmt/tests/test_clib_put_matrix.py

37fc1de

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pygmt.gmtread to read a dataset/grid/image into pandas.DataFrame/xarray.DataArray #3673

Add pygmt.gmtread to read a dataset/grid/image into pandas.DataFrame/xarray.DataArray #3673

seisman commented Dec 4, 2024 •

edited

Loading

seisman commented Dec 9, 2024 •

edited

Loading

weiji14 Dec 19, 2024 •

edited

Loading

seisman Apr 16, 2025

weiji14 Apr 16, 2025

seisman Apr 18, 2025

weiji14 Dec 19, 2024

seisman Apr 16, 2025 •

edited

Loading

weiji14 Apr 16, 2025 •

edited

Loading

weiji14 Apr 17, 2025

seisman Apr 17, 2025

weiji14 Apr 17, 2025

weiji14 Apr 17, 2025

seisman Apr 17, 2025

weiji14 Apr 17, 2025

seisman Apr 17, 2025

weiji14 Apr 17, 2025

weiji14 Apr 17, 2025

seisman Apr 17, 2025

	def read_data(
	self,
	infile: str,
	kind: Literal["dataset", "grid", "image"],
	family: str \| None = None,
	geometry: str \| None = None,
	mode: str = "GMT_READ_NORMAL",
	region: Sequence[float] \| None = None,
	data=None,
	):
	"""
	Read a data file into a GMT data container.

Add pygmt.gmtread to read a dataset/grid/image into pandas.DataFrame/xarray.DataArray #3673

Are you sure you want to change the base?

Add pygmt.gmtread to read a dataset/grid/image into pandas.DataFrame/xarray.DataArray #3673

Conversation

seisman commented Dec 4, 2024 • edited Loading

seisman commented Dec 9, 2024 • edited Loading

weiji14 Dec 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seisman Apr 16, 2025 • edited Loading

Choose a reason for hiding this comment

weiji14 Apr 16, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seisman commented Dec 4, 2024 •

edited

Loading

seisman commented Dec 9, 2024 •

edited

Loading

weiji14 Dec 19, 2024 •

edited

Loading

seisman Apr 16, 2025 •

edited

Loading

weiji14 Apr 16, 2025 •

edited

Loading