Skip to content

WIP: Add return_table helper function #1336

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 15 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion examples/gallery/images/track_sampling.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,9 @@
grid = pygmt.datasets.load_earth_relief()
points = pygmt.datasets.load_ocean_ridge_points()
# Sample the bathymetry along the world's ocean ridges at specified track points
track = pygmt.grdtrack(points=points, grid=grid, newcolname="bathymetry")
track = pygmt.grdtrack(
points=points, grid=grid, df_columns=["longitude", "latitude", "bathymetry"]
)

fig = pygmt.Figure()
# Plot the earth relief grid on Cylindrical Stereographic projection, masking land areas
Expand Down
1 change: 1 addition & 0 deletions pygmt/helpers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,4 +15,5 @@
dummy_context,
is_nonstr_iter,
launch_external_viewer,
return_table,
)
55 changes: 55 additions & 0 deletions pygmt/helpers/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,9 @@
from collections.abc import Iterable
from contextlib import contextmanager

import geopandas as gpd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import geopandas line shouldn't be here at the top-level. I'd suggest importing geopandas in the return_table function itself if you need it, and only under elif data_format=="geopandas".

Suggested change
import geopandas as gpd

import numpy as np
import pandas as pd
import xarray as xr
from pygmt.exceptions import GMTInvalidInput

Expand Down Expand Up @@ -267,3 +270,55 @@ def args_in_kwargs(args, kwargs):
If one of the required arguments is in ``kwargs``.
"""
return any(arg in kwargs for arg in args)


def return_table(result, data_format, format_parameter, df_columns):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks promising! A few comments:

  1. I think you will need to use this with one of the existing functions so that we can test it out in this PR. In my opinion, grdtrack would be a good option.
  2. My preference would be to add a format_options parameter for return_table. This could take a list, with defaults values including numpy, pandas, and str. This way, if it doesn't for example make sense to return string values then that could not be given as an option in the individual function documentation/implementation.
  3. table-like options for input are now str or numpy.ndarray or pandas.DataFrame or xarray.Dataset or geopandas.GeoDataFrame. It would be nice if all of these were output options too.
  4. I would prefer a more description argument for requested data format than a|d|s. Something like numpy|pandas|str seems more readable.
  5. I think df_columns needs to be optional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3. table-like options for input are now `str or numpy.ndarray or 
pandas.DataFrame or xarray.Dataset or geopandas.GeoDataFrame`. 
It would be nice if all of these were output options too.

Sounds good; I'll have to get smarter on the last two but I don't see why it should be a problem.

4. I would prefer a more description argument for requested data format than **a**|**d**|**s**. 
Something like `numpy`|`pandas`|`str` seems more readable.

I like the idea of keeping it short, especially when there is a default option (I anticipate it being a numpy array) and the strings are not also the same word as Python modules or variable types. But I understand how the single letters could be confusing.

5. I think `df_columns` needs to be optional.

Since this is a helper function, I envisioned that the argument for df_columns would be set up in the GMT function that is using it, such as using ["x", "y", "z"] when calling this function inside of grd2xyz.

Copy link
Contributor Author

@willschlitzer willschlitzer Jun 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3. table-like options for input are now `str or numpy.ndarray or pandas.DataFrame 
or xarray.Dataset or geopandas.GeoDataFrame`. 
It would be nice if all of these were output options too.

Added in c8cef5e

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I would prefer a more description argument for requested data format than a|d|s.
    Something like numpy|pandas|str seems more readable.

I like the idea of keeping it short, especially when there is a default option (I anticipate it being a numpy array) and the strings are not also the same word as Python modules or variable types. But I understand how the single letters could be confusing.

I'll have to agree with Meghan that long descriptive names like numpy|pandas|str are preferable 🙂

r"""
Take the table output from the GMT API and return it as either a string,
array, or DataFrame.

Parameters
----------
result : str
The table returned from the GMT API as a string.
data_format : str
A single-letter string that specifies requested data format of the
table.
**a** : numpy array
**d** : pandas DataFrame
**s** : string
**x** : xarray DataArray
format_parameter : str
The name of the parameter used to specify the data format in the
pygmt function. This name is used when raising the GMTInvalidInput
error to ensure module-specific parameters are consistent with the
error raised.
df_columns : list
The column names of the returned pandas DataFrame.
"""

if data_format == "s":
return result
data_list = []
for string_entry in result.strip().split("\n"):
float_entry = []
string_list = string_entry.strip().split()
for i in string_list:
try:
float_entry.append(float(i))
except ValueError:
continue
if float_entry != []:
data_list.append(float_entry)
data_array = np.array(data_list)
if data_format == "a":
result = data_array
elif data_format == "x":
result = xr.DataArray(data_array)
elif data_format == "d":
result = pd.DataFrame(data_array, columns=df_columns)
else:
raise GMTInvalidInput(
f"""Must specify {format_parameter} as either a, d, s, or x."""
)
return result
25 changes: 16 additions & 9 deletions pygmt/src/grdtrack.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
data_kind,
fmt_docstring,
kwargs_to_strings,
return_table,
use_alias,
)

Expand All @@ -33,7 +34,14 @@
n="interpolation",
)
@kwargs_to_strings(R="sequence", S="sequence")
def grdtrack(points, grid, newcolname=None, outfile=None, **kwargs):
def grdtrack(
points,
grid,
data_format="d",
df_columns=["longitude", "latitude", "z-value"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it's a good idea to have default column names, especially for the x (longitude) and y (latitude) columns since someone passing in a pandas.DataFrame table with existing column names would have their column names overridden by this default.

outfile=None,
**kwargs
):
r"""
Sample grids at specified (x,y) locations.

Expand Down Expand Up @@ -248,9 +256,6 @@ def grdtrack(points, grid, newcolname=None, outfile=None, **kwargs):
- None if ``outfile`` is set (track output will be stored in file set
by ``outfile``)
"""
if data_kind(points) == "matrix" and newcolname is None:
raise GMTInvalidInput("Please pass in a str to 'newcolname'")

with GMTTempFile(suffix=".csv") as tmpfile:
with Session() as lib:
# Choose how data will be passed into the module
Expand All @@ -272,11 +277,13 @@ def grdtrack(points, grid, newcolname=None, outfile=None, **kwargs):

# Read temporary csv output to a pandas table
if outfile == tmpfile.name: # if user did not set outfile, return pd.DataFrame
try:
column_names = points.columns.to_list() + [newcolname]
result = pd.read_csv(tmpfile.name, sep="\t", names=column_names)
except AttributeError: # 'str' object has no attribute 'columns'
result = pd.read_csv(tmpfile.name, sep="\t", header=None, comment=">")
result_data = tmpfile.read()
result = return_table(
result=result_data,
data_format=data_format,
format_parameter="data_format",
df_columns=df_columns,
)
elif outfile != tmpfile.name: # return None if outfile set, output in outfile
result = None

Expand Down
72 changes: 59 additions & 13 deletions pygmt/tests/test_grdtrack.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,11 @@
"""
import os

import numpy as np
import numpy.testing as npt
import pandas as pd
import pytest
import xarray as xr
from pygmt import grdtrack, which
from pygmt.datasets import load_earth_relief, load_ocean_ridge_points
from pygmt.exceptions import GMTInvalidInput
Expand Down Expand Up @@ -54,7 +56,12 @@ def test_grdtrack_input_dataframe_and_dataarray(dataarray, dataframe):
Run grdtrack by passing in a pandas.DataFrame and xarray.DataArray as
inputs.
"""
output = grdtrack(points=dataframe, grid=dataarray, newcolname="bathymetry")
output = grdtrack(
points=dataframe,
grid=dataarray,
data_format="d",
df_columns=["longitude", "latitude", "bathymetry"],
)
assert isinstance(output, pd.DataFrame)
assert output.columns.to_list() == ["longitude", "latitude", "bathymetry"]
npt.assert_allclose(output.iloc[0], [-110.9536, -42.2489, -2790.488422])
Expand Down Expand Up @@ -84,7 +91,12 @@ def test_grdtrack_input_dataframe_and_ncfile(dataframe, ncfile):
Run grdtrack by passing in a pandas.DataFrame and netcdf file as inputs.
"""

output = grdtrack(points=dataframe, grid=ncfile, newcolname="bathymetry")
output = grdtrack(
points=dataframe,
grid=ncfile,
data_format="d",
df_columns=["longitude", "latitude", "bathymetry"],
)
assert isinstance(output, pd.DataFrame)
assert output.columns.to_list() == ["longitude", "latitude", "bathymetry"]
npt.assert_allclose(output.iloc[0], [-32.2971, 37.4118, -1939.748245])
Expand Down Expand Up @@ -118,7 +130,7 @@ def test_grdtrack_wrong_kind_of_points_input(dataarray, dataframe):

assert data_kind(invalid_points) == "grid"
with pytest.raises(GMTInvalidInput):
grdtrack(points=invalid_points, grid=dataarray, newcolname="bathymetry")
grdtrack(points=invalid_points, grid=dataarray)


def test_grdtrack_wrong_kind_of_grid_input(dataarray, dataframe):
Expand All @@ -130,22 +142,56 @@ def test_grdtrack_wrong_kind_of_grid_input(dataarray, dataframe):

assert data_kind(invalid_grid) == "matrix"
with pytest.raises(GMTInvalidInput):
grdtrack(points=dataframe, grid=invalid_grid, newcolname="bathymetry")
grdtrack(points=dataframe, grid=invalid_grid)


def test_grdtrack_without_newcolname_setting(dataarray, dataframe):
def test_grdtrack_without_outfile_setting(csvfile, ncfile):
"""
Run grdtrack by not passing in newcolname parameter setting.
Run grdtrack by not passing in outfile parameter setting.
"""
with pytest.raises(GMTInvalidInput):
grdtrack(points=dataframe, grid=dataarray)
output = grdtrack(points=csvfile, grid=ncfile, data_format="a")
npt.assert_allclose(output[0], [-32.2971, 37.4118, -1939.748245])
return output


def test_grdtrack_without_outfile_setting(csvfile, ncfile):
def test_grdtrack_output_types(dataarray):
"""
Run grdtrack by not passing in outfile parameter setting.
Tests output formats for grdtrack.
"""
output = grdtrack(points=csvfile, grid=ncfile)
npt.assert_allclose(output.iloc[0], [-32.2971, 37.4118, -1939.748245])
dataframe = load_ocean_ridge_points()
result_string = grdtrack(
points=dataframe,
grid=dataarray,
data_format="s",
)
assert isinstance(result_string, str)
result_array = grdtrack(
points=dataframe,
grid=dataarray,
data_format="a",
)
assert isinstance(result_array, np.ndarray)
result_df = grdtrack(
points=dataframe,
grid=dataarray,
data_format="d",
)
assert isinstance(result_df, pd.DataFrame)
result_xarray = grdtrack(
points=dataframe,
grid=dataarray,
data_format="x",
)
assert isinstance(result_xarray, xr.core.dataarray.DataArray)

return output
def test_grdtrack_output_type_fail(dataarray):
"""
Tests that grdtrack fails with an invalid output type specified.
"""
dataframe = load_ocean_ridge_points()
with pytest.raises(GMTInvalidInput):
result_string = grdtrack(
points=dataframe,
grid=dataarray,
data_format="w",
)