Skip to content

Wrap GMT's standard data type GMT_DATASET for table inputs #2729

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 124 commits into from
Mar 7, 2024
Merged
Changes from all commits
Commits
Show all changes
124 commits
Select commit Hold shift + click to select a range
713ad0a
Add data types GMT_DATASEGMENT, GMT_DATATABLE and GMT_DATASET
seisman Mar 15, 2023
9c580f9
Updates
seisman Mar 15, 2023
3ce3341
Fix formatting
seisman Mar 16, 2023
b655f88
Finally, a working version
seisman Oct 9, 2023
221efec
Fix the code structure
seisman Oct 9, 2023
6f4a651
Simplify the two options for grd2xyz
seisman Oct 9, 2023
9a1dc0c
fix
seisman Oct 9, 2023
1843b35
Add docstrings to pygmt/datatypes.py
seisman Oct 9, 2023
bd52024
Improve the docstrings
seisman Oct 9, 2023
19eb3aa
Get rid of temporary files from grdtrack
seisman Oct 9, 2023
8546048
Revert "Get rid of temporary files from grdtrack"
seisman Oct 10, 2023
76944a8
pygmt.grdtrack: Support consistent table-like outputs
seisman Oct 9, 2023
bac13bd
Update to virtualfile_to_data which can also be used for grids
seisman Oct 10, 2023
d59092b
Merge branch 'datatypes/gmtdataset' into tempfile/grdtrack
seisman Oct 10, 2023
ba6d94e
Add read_virtualfile_to_data to simplify the logic
seisman Oct 10, 2023
9cd1502
Merge branch 'datatypes/gmtdataset' into tempfile/grdtrack
seisman Oct 10, 2023
589a9dd
Fix a typo
seisman Oct 10, 2023
ff7459c
Merge branch 'main' into datatypes/gmtdataset
seisman Oct 10, 2023
7b485b0
Merge branch 'datatypes/gmtdataset' into tempfile/grdtrack
seisman Oct 10, 2023
47a79e1
Merge branch 'main' into datatypes/gmtdataset
seisman Oct 11, 2023
a06a5b4
Merge read_virtualfile and read_virtualfile_to_data into a single fun…
seisman Oct 11, 2023
7e59970
Merge branch 'datatypes/gmtdataset' into tempfile/grdtrack
seisman Oct 11, 2023
f5a9d44
Refactor the virtualfile_to_data function to support writing to a rea…
seisman Oct 11, 2023
276b6f7
Merge branch 'datatypes/gmtdataset' into tempfile/grdtrack
seisman Oct 11, 2023
5470495
Simplify the codes following the gmtdataset changes
seisman Oct 11, 2023
555bfe3
Move gmtdataset_to_vectors as a method of the GMT_DATASET class
seisman Oct 11, 2023
40510f5
Merge branch 'datatypes/gmtdataset' into tempfile/grdtrack
seisman Oct 11, 2023
6d26c10
Update after gmtdataset changes
seisman Oct 11, 2023
2205cbb
Add Pythonic objects
seisman Oct 11, 2023
4e67ebf
Add more notes about GMT.jl implementation
seisman Oct 11, 2023
4707b87
Refactor the codes using nested classes
seisman Oct 12, 2023
013f4bb
Add more examples
seisman Oct 12, 2023
19fd164
Deal with text column
seisman Oct 12, 2023
0772711
Merge branch 'main' into datatypes/gmtdataset
seisman Oct 13, 2023
0b8733d
Merge branch 'main' into datatypes/gmtdataset
seisman Oct 25, 2023
87f4cf0
Merge branch 'tempfile/grdtrack' into datatypes/gmtdataset
seisman Oct 25, 2023
3cff015
Fix linting issues
seisman Oct 25, 2023
9843eca
Standarize virtual file names
seisman Oct 25, 2023
9af418a
Improve the GMT_DATASET doctest
seisman Oct 25, 2023
7160a4f
Fix a typo
seisman Oct 25, 2023
6e07b95
Improve the doctest for GMT_DATASET.to_vectors()
seisman Oct 25, 2023
7da3654
Add more comments to the GMT_DATASET.to_vectors function
seisman Oct 25, 2023
afecfc0
Remove the GMT_DATASET.to_vectors_v2 method
seisman Oct 25, 2023
89efc18
Remove the GMT_DATASET.to_pydata method to focus on the GMT dataset s…
seisman Oct 25, 2023
c7a0982
Fix linting issues
seisman Oct 25, 2023
3c46aca
Remove two blank lines
seisman Oct 25, 2023
78c3959
Let pandas deal with the conversion to numpy
seisman Oct 25, 2023
10d0c1e
Merge branch 'main' into datatypes/gmtdataset
seisman Oct 26, 2023
c86208c
Disable some pylint warnings
seisman Oct 26, 2023
caa9d10
Let pandas deal with 1-D arrays directly
seisman Oct 26, 2023
d37efcf
Fix linting issues
seisman Oct 27, 2023
5c3f1b1
Add a return_table function
seisman Oct 27, 2023
cd11f98
Add the validate_output_type function to check the output_type parameter
seisman Oct 27, 2023
4a42c4d
Use pd.DataFrame.from_dict to construct the DataFrame object
seisman Oct 28, 2023
de31384
Refactor grdvolume.py
seisman Oct 28, 2023
333c5eb
Refactor select.py
seisman Oct 28, 2023
4e14d30
Merge branch 'validators/output_type' into datatypes/gmtdataset
seisman Oct 28, 2023
061355f
Use validate_output_type
seisman Oct 28, 2023
f36448f
Refactor blockm*
seisman Oct 28, 2023
1ca2b76
Refactor filter1d
seisman Oct 28, 2023
2760a21
Fix a bug in filter1d test
seisman Oct 28, 2023
e99cd1e
Refactor project.py
seisman Oct 28, 2023
4ff0a15
Refactor the table part of grdhisteq
seisman Oct 28, 2023
1bd46ad
Refactor the table part of triangulate
seisman Oct 28, 2023
48f94cf
Fix a bug in grdhisteq
seisman Oct 29, 2023
49625df
Fix grdhisteq
seisman Oct 29, 2023
a3c37fa
Merge branch 'main' into datatypes/gmtdataset
seisman Oct 30, 2023
473dc7b
Fix merge errors
seisman Oct 30, 2023
df48ada
grdtrack: Use validate_output_table_type
seisman Oct 30, 2023
fd95fa1
Merge branch 'main' into datatypes/gmtdataset
seisman Oct 31, 2023
232973a
Merge branch 'main' into datatypes/gmtdataset
seisman Nov 1, 2023
3b5e4f8
Formatting
seisman Nov 1, 2023
0f1deec
Merge branch 'main' into datatypes/gmtdataset
seisman Nov 8, 2023
c1a95b4
Merge branch 'main' into datatypes/gmtdataset
seisman Nov 10, 2023
9f701b7
Merge branch 'main' into datatypes/gmtdataset
seisman Nov 13, 2023
3d5147d
Consistently use column_names
seisman Nov 13, 2023
b9454ce
Always convert text data to string dtype
seisman Nov 14, 2023
8742467
Change the to_vectors method to to_dataframe which returns a pd.DataF…
seisman Nov 14, 2023
2a2b607
Merge branch 'main' into datatypes/gmtdataset
seisman Nov 16, 2023
2ca768f
Merge branch 'main' into datatypes/gmtdataset
seisman Nov 18, 2023
b7e0823
Requires pandas>=1.2.0
seisman Nov 18, 2023
e9de4bb
Fix formatting
seisman Nov 18, 2023
b09b13a
Merge branch 'main' into datatypes/gmtdataset
seisman Nov 21, 2023
2cba0c7
Remove pylint directives
seisman Nov 21, 2023
99fc619
Merge branch 'main' into datatypes/gmtdataset
seisman Nov 26, 2023
2baa7af
Merge branch 'main' into datatypes/gmtdataset
seisman Dec 4, 2023
1754e9a
Minor fix
seisman Dec 4, 2023
af96106
Merge branch 'main' into datatypes/gmtdataset
seisman Dec 19, 2023
d7d0164
Merge branch 'main' into datatypes/gmtdataset
seisman Dec 25, 2023
ccb9f47
Merge branch 'main' into datatypes/gmtdataset
seisman Jan 1, 2024
96643ad
Merge branch 'main' into datatypes/gmtdataset
seisman Jan 2, 2024
6ddab7f
Merge branch 'main' into datatypes/gmtdataset
seisman Jan 7, 2024
d70a7c4
Rewrap and add type hints
seisman Jan 7, 2024
a73927e
Temporarily enable benchmarks
seisman Jan 7, 2024
cad0bbe
Merge branch 'main' into datatypes/gmtdataset
seisman Feb 19, 2024
79c499d
Move dataset definition into pygmt/datatypes/dataset.py
seisman Feb 19, 2024
c4d47db
Remove unused imports
seisman Feb 19, 2024
1fdb9f9
Merge branch 'main' into datatypes/gmtdataset
seisman Feb 20, 2024
76a09f0
Fix open_virtual_file to open_virtualfile
seisman Feb 20, 2024
9a035ef
Improve docstrings
seisman Feb 20, 2024
828c9c1
Add doctests for virtualfile_to_data
seisman Feb 20, 2024
4a5eea3
isort
seisman Feb 20, 2024
a878635
clib.Session: Add the virtualfile_to_data method for creating virtual…
seisman Feb 20, 2024
c72701e
Improve docstrings
seisman Feb 20, 2024
761aff4
Improve the return_table function
seisman Feb 20, 2024
279eeb4
column_names default to None
seisman Feb 21, 2024
e933497
Update select
seisman Feb 21, 2024
d1f3150
Update blockm
seisman Feb 21, 2024
77739f8
Update grdtrack
seisman Feb 21, 2024
4815c59
Merge branch 'virtualfile_to_data' into datatypes/gmtdataset
seisman Feb 21, 2024
2b1d565
Remove the old codes
seisman Feb 21, 2024
9ddca59
Merge branch 'main' into datatypes/gmtdataset
seisman Feb 28, 2024
a21b2df
Revert non-related changes and focus on the _GMT_DATASET class
seisman Feb 28, 2024
180f3ec
Minor fixes and improvements
seisman Feb 28, 2024
84b3bf4
Merge branch 'main' into datatypes/gmtdataset
seisman Feb 29, 2024
6e71924
Merge branch 'main' into datatypes/gmtdataset
seisman Mar 1, 2024
08be754
Merge branch 'main' into datatypes/gmtdataset
seisman Mar 2, 2024
cc5c4ec
Move the object->str conversion of text column to the to_dataframe me…
seisman Mar 2, 2024
f654c93
Merge branch 'main' into datatypes/gmtdataset
seisman Mar 5, 2024
e5698a5
Merge branch 'main' into datatypes/gmtdataset
seisman Mar 6, 2024
661af0b
Apply suggestions from code review
seisman Mar 6, 2024
394a054
Update pygmt/datatypes/dataset.py
seisman Mar 6, 2024
63ed85f
Merge branch 'main' into datatypes/gmtdataset
seisman Mar 7, 2024
17d2b4c
Remove a blank line
seisman Mar 7, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
206 changes: 205 additions & 1 deletion pygmt/datatypes/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,211 @@
"""

import ctypes as ctp
from typing import ClassVar

import numpy as np
import pandas as pd


class _GMT_DATASET(ctp.Structure): # noqa: N801
pass
"""
GMT dataset structure for holding multiple tables (files).

This class is only meant for internal use by PyGMT and is not exposed to users.
See the GMT source code gmt_resources.h for the original C struct definitions.

Examples
--------
>>> from pygmt.helpers import GMTTempFile
>>> from pygmt.clib import Session
>>>
>>> with GMTTempFile(suffix=".txt") as tmpfile:
... # Prepare the sample data file
... with open(tmpfile.name, mode="w") as fp:
... print(">", file=fp)
... print("1.0 2.0 3.0 TEXT1 TEXT23", file=fp)
... print("4.0 5.0 6.0 TEXT4 TEXT567", file=fp)
... print(">", file=fp)
... print("7.0 8.0 9.0 TEXT8 TEXT90", file=fp)
... print("10.0 11.0 12.0 TEXT123 TEXT456789", file=fp)
... # Read the data file
... with Session() as lib:
... with lib.virtualfile_out(kind="dataset") as vouttbl:
... lib.call_module("read", f"{tmpfile.name} {vouttbl} -Td")
... # The dataset
... ds = lib.read_virtualfile(vouttbl, kind="dataset").contents
... print(ds.n_tables, ds.n_columns, ds.n_segments)
... print(ds.min[: ds.n_columns], ds.max[: ds.n_columns])
... # The table
... tbl = ds.table[0].contents
... print(tbl.n_columns, tbl.n_segments, tbl.n_records)
... print(tbl.min[: tbl.n_columns], ds.max[: tbl.n_columns])
... for i in range(tbl.n_segments):
... seg = tbl.segment[i].contents
... for j in range(seg.n_columns):
... print(seg.data[j][: seg.n_rows])
... print(seg.text[: seg.n_rows])
1 3 2
[1.0, 2.0, 3.0] [10.0, 11.0, 12.0]
3 2 4
[1.0, 2.0, 3.0] [10.0, 11.0, 12.0]
[1.0, 4.0]
[2.0, 5.0]
[3.0, 6.0]
[b'TEXT1 TEXT23', b'TEXT4 TEXT567']
[7.0, 10.0]
[8.0, 11.0]
[9.0, 12.0]
[b'TEXT8 TEXT90', b'TEXT123 TEXT456789']
"""

class _GMT_DATATABLE(ctp.Structure): # noqa: N801
"""
GMT datatable structure for holding a table with multiple segments.
"""

class _GMT_DATASEGMENT(ctp.Structure): # noqa: N801
"""
GMT datasegment structure for holding a segment with multiple columns.
"""

_fields_: ClassVar = [
# Number of rows/records in this segment
("n_rows", ctp.c_uint64),
# Number of fields in each record
("n_columns", ctp.c_uint64),
# Minimum coordinate for each column
("min", ctp.POINTER(ctp.c_double)),
# Maximum coordinate for each column
("max", ctp.POINTER(ctp.c_double)),
# Data x, y, and possibly other columns
("data", ctp.POINTER(ctp.POINTER(ctp.c_double))),
# Label string (if applicable)
("label", ctp.c_char_p),
# Segment header (if applicable)
("header", ctp.c_char_p),
# text beyond the data
("text", ctp.POINTER(ctp.c_char_p)),
# Book-keeping variables "hidden" from the API
("hidden", ctp.c_void_p),
]

_fields_: ClassVar = [
# Number of file header records (0 if no header)
("n_headers", ctp.c_uint),
# Number of columns (fields) in each record
("n_columns", ctp.c_uint64),
# Number of segments in the array
("n_segments", ctp.c_uint64),
# Total number of data records across all segments
("n_records", ctp.c_uint64),
# Minimum coordinate for each column
("min", ctp.POINTER(ctp.c_double)),
# Maximum coordinate for each column
("max", ctp.POINTER(ctp.c_double)),
# Array with all file header records, if any
("header", ctp.POINTER(ctp.c_char_p)),
# Pointer to array of segments
("segment", ctp.POINTER(ctp.POINTER(_GMT_DATASEGMENT))),
# Book-keeping variables "hidden" from the API
("hidden", ctp.c_void_p),
]

_fields_: ClassVar = [
# The total number of tables (files) contained
("n_tables", ctp.c_uint64),
# The number of data columns
("n_columns", ctp.c_uint64),
# The total number of segments across all tables
("n_segments", ctp.c_uint64),
# The total number of data records across all tables
("n_records", ctp.c_uint64),
# Minimum coordinate for each column
("min", ctp.POINTER(ctp.c_double)),
# Maximum coordinate for each column
("max", ctp.POINTER(ctp.c_double)),
# Pointer to array of tables
("table", ctp.POINTER(ctp.POINTER(_GMT_DATATABLE))),
# The datatype (numerical, text, or mixed) of this dataset
("type", ctp.c_int32),
# The geometry of this dataset
("geometry", ctp.c_int32),
# To store a referencing system string in PROJ.4 format
("ProjRefPROJ4", ctp.c_char_p),
# To store a referencing system string in WKT format
("ProjRefWKT", ctp.c_char_p),
# To store a referencing system EPSG code
("ProjRefEPSG", ctp.c_int),
# Book-keeping variables "hidden" from the API
("hidden", ctp.c_void_p),
]

def to_dataframe(self) -> pd.DataFrame:
"""
Convert a _GMT_DATASET object to a :class:`pandas.DataFrame` object.

Currently, the number of columns in all segments of all tables are assumed to be
the same. The same column in all segments of all tables are concatenated. The
trailing text column is also concatenated as a single string column.

Returns
-------
df
A :class:`pandas.DataFrame` object.

Examples
--------
>>> from pygmt.helpers import GMTTempFile
>>> from pygmt.clib import Session
>>>
>>> with GMTTempFile(suffix=".txt") as tmpfile:
... # prepare the sample data file
... with open(tmpfile.name, mode="w") as fp:
... print(">", file=fp)
... print("1.0 2.0 3.0 TEXT1 TEXT23", file=fp)
... print("4.0 5.0 6.0 TEXT4 TEXT567", file=fp)
... print(">", file=fp)
... print("7.0 8.0 9.0 TEXT8 TEXT90", file=fp)
... print("10.0 11.0 12.0 TEXT123 TEXT456789", file=fp)
... with Session() as lib:
... with lib.virtualfile_out(kind="dataset") as vouttbl:
... lib.call_module("read", f"{tmpfile.name} {vouttbl} -Td")
... ds = lib.read_virtualfile(vouttbl, kind="dataset")
... df = ds.contents.to_dataframe()
>>> df
0 1 2 3
0 1.0 2.0 3.0 TEXT1 TEXT23
1 4.0 5.0 6.0 TEXT4 TEXT567
2 7.0 8.0 9.0 TEXT8 TEXT90
3 10.0 11.0 12.0 TEXT123 TEXT456789
>>> df.dtypes.to_list()
[dtype('float64'), dtype('float64'), dtype('float64'), string[python]]
"""
# Deal with numeric columns
vectors = []
for icol in range(self.n_columns):
colvector = []
for itbl in range(self.n_tables):
dtbl = self.table[itbl].contents
for iseg in range(dtbl.n_segments):
dseg = dtbl.segment[iseg].contents
colvector.append(
np.ctypeslib.as_array(dseg.data[icol], shape=(dseg.n_rows,))
)
vectors.append(pd.Series(data=np.concatenate(colvector)))

# Deal with trailing text column
textvector = []
for itbl in range(self.n_tables):
dtbl = self.table[itbl].contents
for iseg in range(dtbl.n_segments):
dseg = dtbl.segment[iseg].contents
if dseg.text:
textvector.extend(dseg.text[: dseg.n_rows])
if textvector:
vectors.append(
pd.Series(data=np.char.decode(textvector), dtype=pd.StringDtype())
)

df = pd.concat(objs=vectors, axis=1)
return df