Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: 📝 pseudo code and docstring for write_resource_parquet() #816

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
22 changes: 15 additions & 7 deletions docs/design/interface/functions.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ more details.

## Data resource functions

### {{< var done >}}`create_resource_structure(path)`
### {{< var done >}} `create_resource_structure(path)`

See the help documentation with `help(create_resource_structure)` for
more details.
Expand All @@ -127,13 +127,21 @@ flowchart
function --> out
```

### {{< var wip >}} `write_resource_parquet(raw_files, path)`
### {{< var wip >}} `build_resource_parquet(raw_files_path, resource_properties)`
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure of the naming here. And I'm unsure if it should output a DataFrame and have another function write_resource_parquet() that does the writing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, if that DataFrame output is used somewhere else, have 2 functions, otherwise have one function that does the writing as well?
build or create sounds okay to me.


This function takes the files provided by `raw_files` and merges them
into a `data.parquet` file provided by `path`. Use
`path_resource_data()` to provide the correct path location for `path`
and `path_resource_raw_files()` for the `raw_files` argument. Outputs
the path object of the created file.
See the help documentation with `help(build_resource_parquet)` for more
details.

```{mermaid}
flowchart
in_raw_files_path[/raw_files_path/]
in_properties[/resource_properties/]
function("build_resource_parquet()")
out[("./resources/{id}/data.parquet")]
in_raw_files_path --> function
in_properties --> function
function --> out
```

### {{< var wip >}} `edit_resource_properties(path, properties)`

Expand Down
73 changes: 73 additions & 0 deletions docs/design/interface/pseudocode/build_resource_parquet.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# ruff: noqa
def build_resource_parquet(
raw_files_path: list[Path], resource_properties: ResourceProperties
) -> Path:
"""Merge all raw resource file(s) and write into a Parquet file.
This function takes the file(s) provided by `raw_files_path` and merges them into
a `data.parquet` file. The Parquet file will be stored at the path found in `ResourceProperties.path`.
While Sprout generally assumes
that the files stored in the `resources/raw/` folder have already been
verified and validated, this function does some quick verification checks
of the data after reading it into Python from the raw file(s) by comparing
with the current properties given by the `resource_properties`. All data in the
Comment on lines +9 to +13
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
While Sprout generally assumes
that the files stored in the `resources/raw/` folder have already been
verified and validated, this function does some quick verification checks
of the data after reading it into Python from the raw file(s) by comparing
with the current properties given by the `resource_properties`. All data in the
While Sprout generally assumes
that the files stored in the `resources/raw/` folder are already correctly
structured and tidy, it still runs checks to ensure the data are correct
by comparing to the properties. All data in the

`resources/raw/` folder will be merged into one single data object and then
written back to the Parquet file. The Parquet file will be overwritten.
If there are any duplicate observation units in the data, only the most recent
observation unit will be kept. This way, if there are any errors or mistakes
in older raw files that has been corrected in later files, the mistake can still
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
in older raw files that has been corrected in later files, the mistake can still
in older raw files that have been corrected in later files, the mistake can still

be kept, but won't impact the data that will actually be used.
Examples:
``` python
import seedcase_sprout.core as sp
sp.write_resource_parquet(
raw_files_path=sp.path_resources_raw_files(1, 1),
parquet_path=sp.path_resource_data(1, 1),
properties_path=sp.path_package_properties(1, 1),
)
Comment on lines +27 to +31
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be updated?

```
Args:
raw_files_path: A list of paths for all the raw files, mostly commonly stored in the
`.csv.gz` format. Use `path_resource_raw_files()` to help provide the
correct paths to the raw files.
resource_properties: The `ResourceProperties` object that contains the properties
of the resource you want to create the Parquet file for.
Returns:
Outputs the path object of the created Parquet file.
"""
# Not sure if this is the correct way to verify multiple files.
[check_is_file(path) for path in raw_files_path]
check_resource_properties(resource_properties)

data = read_raw_files(raw_files_path)
data = drop_duplicate_obs_units(data)

# This function could be several functions or the one full function.
check_data(data, resource_properties)

return write_parquet(data, resource_properties["path"])


def write_parquet(data: DataFrame, path: Path) -> Path:
return path


def read_raw_files(paths: list[Path]) -> DataFrame:
# Can read gzip files.
data_list = [polars.read_csv(path) for path in paths]
# Merge them all together.
data = polars.concat(data_list)
return data


def drop_duplicate_obs_units(data: DataFrame) -> DataFrame:
# Drop duplicates based on the observation unit, keeping only the most
# recent one. This allows older raw files to contain potentially wrong
# data that was corrected in the most recent file.
return data.drop_duplicates()