Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the RoutesDataClass as a generic data format #204

Open
wants to merge 63 commits into
base: main
Choose a base branch
from

Conversation

chenyangkang
Copy link

@chenyangkang chenyangkang commented Jan 18, 2025

  • Add Routes, BirdFlowRoutes, BirdFlowIntervals Make a unified data format (data class) for tracking & banding data #202
  • Add interval validation for data classes.
  • Add conversion functions and helper functions for conversion among the classes. (resolves as_routes() - function to convert movement data to BirdFlowRoutes object. #158)
  • Add generic S3 methods for the three classes:
    • print
    • plot or plot_routes?
  • Add initial tests for the three data classes
  • Add complete tests for the three data classes
  • Update all components in BirdFlowR regarding the generic data format. Use the new data classes to replace them:
    • route in route.R
    • format_trajectory in route.R
    • calc_predictive_distance_metric.R
    • interval_log_likelihood.R
    • make_tracks in function.R of BirdFlowPipeline package.
    • preprocess_with_recovery, preprocess_sort_band_date, preprocess_calc_distance_days in function.R of BirdFlowPipeline package.
    • tracking_data.R in BirdFlowPipeline package.
    • amwo_track_info_new.R in BirdFlowPipeline
    • More to add

Minor changes:

yc85_illinois_edu and others added 30 commits December 10, 2024 23:13
…necessary. If bf not provided, use birdflow_crs in the environment. This will make data loading easier without loading the model.
…necessary. If bf not provided, use birdflow_crs in the environment. This will make data loading easier without loading the model.
…timestep 1 can cause inconsistency. So fix the random state.
@chenyangkang
Copy link
Author

Seems to be a floating-point precision errors induced by %% operation. I changed it in another way using %/% and the problem seems to be solved. Very strange, might depend on system version.

@chenyangkang
Copy link
Author

@ethanplunkett Hi Ethan do you want to review the current changes before I go further to edit other functions? Especially about Routes, BirdFlowRoutes, BirdFlowIntervals (in RoutesDataClass.R), as_BirdFlowRoutes, as_BirdFlowIntervals (in RoutesDataFunction.R).

I still implemented the transformation function as_*, and kept the class functions open to dataframes that are already in the right format (like the dataframe generated by format_trajectory in the route function).

@ethanplunkett
Copy link
Contributor

ethanplunkett commented Jan 20, 2025

@chenyangkangyes. I want to review before it's merged. I have come down with something nasty though so am unlikely to get to it for a few days. I'm currently running a 102 fever.

Overall it's great to see you getting so much done!

I do have a couple thoughts based on what the code looked like Saturday morning when O last looked it over. Disregard anything that you've already addressed.

  1. Ideally we'd have no new Notes in the package check. There were a couple that predate this update that you are unlikely to address.

  2. When you align to birdflow while converting to 'BirdFlowRoutes' use 'snap_to-birdflow()' as that checks validity against the dynamic mask using the date and time simultaneously. It looked like your code was checking them independently which would allow locations that are excluded by the dynamic mask as long as they are in the static mask. I would expose the 'aggregate' argument too and add a 'random' option to mimic the default behavior of the new function.

  3. We need to make sure 'plot_routes()' works with the modified routes. It's a bit of a mess so and the mess is all mine so let me know if you want me to clean it up.

@chenyangkang
Copy link
Author

chenyangkang commented Jan 20, 2025

@ethanplunkett Sorry to hear that! And thanks for the quick feedback. I will check the first two points.

Regarding the plot_routes function, it seems to be working fine with the new class -- The data structure is not changed. Not sure if some hidden issues exist about that "crossing boundary" thing. I attached it and let me know if it doesn't look right. Also I'm thinking how about transforming (renaming) the plot_routes to a generic plot function on the BirdFlowRoutes and Routes class.

Hope you get better soon and we can discuss later!

image

@ethanplunkett
Copy link
Contributor

ethanplunkett commented Jan 20, 2025 via email

Copy link
Contributor

@ethanplunkett ethanplunkett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenyangkang I left lots of specific comments suggesting changes but overall this is great!

Some overarching comments:

  • Stylistically I prefer more succinct names. I don't think that matters enough to change it for the private functions and internal code, but it is worth shortening the public facing function and argument names.
  • I don't remember deciding whether we wanted to base these classes on data frames or lists. I do remember saying data frames might be cool. I like that they can then be treated like data frames but as you discovered manipulating them can lose the attributes, a problem people are less likely to bump up against with a list based object. Let's talk about this in person; I don't have a strong feeling either way but want to make sure we are making an informed decision.

NAMESPACE Outdated
export(sparsify)
export(species)
export(species_info)
export(truncate_birdflow)
export(validate_BirdFlow)
export(validate_BirdFlowIntervals_birdflow_intervals)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should just be validate_BirdFlowIntervals and the following two exports should be validate_Routes and validate_BirdFlowRoutes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh... I see there's also a validate_BirdFLowRoutes_geom etc. I still think the public validation functions shouldn't have the almost repeated text (_birdflow_route_df, route_df).


#' @rdname RouteDataClass
#' @export
Routes <- function(route_df, species = NULL, metadata = NULL, source = NULL) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change route_df to data or route_data or just df .

routes |>
dplyr::group_by(route_type) |>
dplyr::summarize(
unique_route_count = dplyr::n_distinct(route_id),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid the notes about undefined variables you need to use .data$route_id here.

@@ -60,7 +60,7 @@
#' space index corresponding to x and y coordinates or row and column indices.}
#' }
#'
#' \item{`latlon_to_xy(lat, lon, bf)`}{Returns a two column matrix of the x
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was right as it was bf is an argument to latlon_to_xy()

@@ -33,7 +33,7 @@ if (FALSE) {
#' @param radius A point is considered between two locations if it is within
#' `radius` meters (along a great circle) of the great circle line between the
#' locations. `radius` defaults to half the cell size (`mean(res(bf))/2`).
#' @param n_direction The number of (equally spaced) directional bins to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this!

R/route.R Outdated
points <- points |>
dplyr::group_by(route_id) |>
dplyr::mutate(
date = as.Date(unlist(purrr::accumulate(date, ~ ifelse(.y < .x, .y + lubridate::years(1), .y))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently synthetic routes are allowed to be backwards in time. This assumes forward. We should either change route() to exclude backwards routes or clean things up here. purr is iterating through each pair or adjacent rows so is likely to be slow. It might make sense to fix this in format_trajectory().

# check the features required by direct initiation of BirdFlowIntervals class
for (name in get_target_columns_BirdFlowIntervals(type='input')){
if (!name %in% colnames(birdflow_interval_df)){
stop(sprintf(glue::glue("'{name}' is not found in the input dataframe.")))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop dependency on glue and use paste() or sprintf()


#' @rdname target_columns
#' @export
get_target_columns_Routes <- function(type='input'){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These names feel clunky. I'd be tempted not to export them as I don't think they are for end users, but if we do want to export them I think they should be shorter. get_routes_cols() get_birdflowroutes_cols() ? There's probably something better.

@@ -13,7 +13,22 @@ test_that("preprocess_species runs on test dataset", {
expect_no_error(a <- preprocess_species("example_data", hdf5 = FALSE))
expect_no_error(validate_BirdFlow(a, allow_incomplete = TRUE))
expect_error(validate_BirdFlow(a))
expect_true(all((ext(a)[, ] %% xres(a)) < 1e-9)) # Test if origin is at 0, 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is making sure the cells align with the origin and I think is important to keep. It has nothing to do with your code.

I have two thoughts:

  1. I could add code to preprocess species to check if the extent and cell size numbers are close to integers and round them if they are. It wouldn't help when the cell size is floating point (say some sort of spherical coordinates) but would help with all practical applications we've had to date.
  2. We could lower the tolerance on the check to allow the test to pass as is. I can't see the failed test output anymore but if I recall the difference was very small but not quite as small as this test requires.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I will leave the current modification as it is -- To suppress the error. I'm not deleting the test -- just using another logic to validate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed the added lines!

Yours is definitely better.

Your test passes if you are within the tolerance either above or below the exact number, whereas mine only passes if you are above but within the tolerance.
What's weird is when it was failing before it was for a value that was slightly too big given the tolerance so, I think, should have failed either test. Definitely leave yours though, it's both theoretically and empirically working better.

Thanks!

@ethanplunkett
Copy link
Contributor

ethanplunkett commented Jan 22, 2025

@ethanplunkett Hi Ethan, could you please help review this bug at your convenience?

(About the failed test due to slight discrepancies in cell alignment, that is unrelated to this PR.)
I made a comment here:
https://github.com/birdflow-science/BirdFlowR/pull/204/files#r1925521860
I'll try to fix it in another branch. It's unrelated to your code.

@ethanplunkett
Copy link
Contributor

RE: plot_routes() that 31 week stay looks suspicious to me.

@chenyangkang
Copy link
Author

RE: plot_routes() that 31 week stay looks suspicious to me.

Oh I see, the elapsed_stay is calculated by row_number int he plot_routes function -- Will change it. It should represent 'days' now.

@ethanplunkett
Copy link
Contributor

Looks like you are close!

This is great.

I have one more comment.

Can you add to the top of the news the way this breaks existing code. Something along the lines of:

Breaking Changes:

  1. Old route objects will no longer plot with the updated plot_routes()

  2. Anyone who is using plot_routes() to plot data not generated with route() will have to update their objects. [describe how]

@chenyangkang
Copy link
Author

@ethanplunkett Done! Could you please help review that next week at your convenience? Ready to be merged.

Still need to do several things:

  1. Go through the historical functions and update the code using the new classes.
  2. Add a plot function for class Routes so that routes in non-BirdFlow coordinates can also be plotted. I think we can use a default CRS and spatiotemporal cell size as argument for that function.

-- These will go to other PRs. I think the current changes are all for this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

as_routes() - function to convert movement data to BirdFlowRoutes object.
2 participants