Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make a unified data format (data class) for tracking & banding data #202

Open
chenyangkang opened this issue Jan 6, 2025 · 15 comments
Open
Assignees

Comments

@chenyangkang
Copy link

chenyangkang commented Jan 6, 2025

Considering using R6 to make a data class.

Two data classes:

  1. The first class (say TrackCollection) stores each track (no matter tracking data or capture-recover pair) as an object (say Track)-- can be used for general storage or plotting.
  2. The second class (say PairedTrackCollection) inherits from the first one, but has a process function that is called upon initiation, which will process the data to be "paired" (say PairedTrack) -- this object can be used for evaluation.

Implement it in BirdFlowR instead of BirdFlowPipeline can help end-users to test the models with custom data.

@chenyangkang chenyangkang self-assigned this Jan 6, 2025
@ethanplunkett
Copy link
Contributor

Dave and I talked about this a bunch. There's a related route object generated by the route() function that maybe should also be covered by this class. Also, my preference would be to stick with S3 classes unless there is a really compelling reason to use S6.

@ethanplunkett
Copy link
Contributor

Note also that the interval log likelihood function has observations and intervals as inputs that I think correspond somewhat with the two types of objects you are thinking about. Although, for that function they are just data frames and the intervals reference but don't contain the observations.

@chenyangkang
Copy link
Author

Thanks for the suggestions. I will cover the route() function. I will try to use S3 for consistency. I was thinking about R6 as it's most similar to the OOP system that I know (and for my brain).

And yes I saw the track_info object. I think it would be beneficial to make it a class that cover across BirdFlowR and BirdFlowPipeline. We only need to use the tracks in "BirdFlow grids", and all the preprocessing of banding & tracking should be covered on initiation of the object -- so can reduce the redundant preprocessing code distributed in multiple scripts.

Now my idea is to use only one class, not two. This class will store preprocessed tracks. And a function "process_to_pairs" will add paired data to this class -- a little redundant in terms of data storage but this class can be passed into all functions.

@ethanplunkett
Copy link
Contributor

R6 handles object oriented programming a way that feels very foreign to R. Hadley Wickham's take on it in his introduction to a chapter on R6 is relevant here:

If you’ve learned OOP in another programming language, it’s likely that R6 will feel very natural, and you’ll be inclined to prefer it over S3. Resist the temptation to follow the path of least resistance: in most cases R6 will lead you to non-idiomatic R code. We’ll come back to this theme in Section 16.3.

S3 does support inheritance so you could still have a second class that inherits from the first produced by your stand alone process_to_pairs(). The only reason to do this is if you wanted to dispatch to different functions depending on whether there are pairs present.

I'm not sure what "BirdFlow grids" references in your comment above but be aware that aligning tracks with a BirdFlow model is model specific. Even if two models have the same extent, CRS, and cell size the location indices won't be consistent because they will have different masks. Thus location indices, row and column indices, CRS (and projected x and y coordinates), and even timesteps (if it's not a full year model) can all be inconsistent between models. Additionally, depending on the eBird year the model is trained on the week of the year that a date aligns with may change. If we want to include any space or time information other than Latitude, Longitude, and a date-time we should also be sure to include the spatial and possibly temporal references - the $geom and $dates components of a BirdFlow model.

Before you implement the class I think we should hammer out it's structure and make sure it's flexible enough to handle banding, tracking, modus, and synthetic tracks (from route()).

For reference the BirdFlowRoute class is a data frame with some extra attributes for convenience. I didn't anticipate that it would ever encompass non-synthetic bird movement information but I know track and banding data have been converted to BirdFlowRoute objects for use in the plot_routes() function.

BirdFlowRoutes objects are at their core data frames with the following columns:

  • x, y: Projected coordinates in the BirdFlow mode's CRS.
  • route_id: Integer from 1 to the number of unique routes in the object.
  • timestep: Timestep in the BirdFlow model associated with each location
  • date: Date associated with that timestep
  • i: Location index for each location in the BirdFlow model
  • stay_id: A unique ID for each location visited
  • stay_len: How long the bird was at that location.

An example of what that looks like:

x y route_id timestep date i stay_id stay_len
225000 -825000 1 2 2021-01-11 300 1 3
225000 -825000 1 3 2021-01-18 300 1 3
225000 -825000 1 4 2021-01-25 300 1 3
-525000 -225000 1 5 2021-02-01 227 2 1
-675000 -225000 1 6 2021-02-08 226 3 2
-675000 -225000 1 7 2021-02-15 226 3 2

Note each timestep has a row even if the bird didn't move. Also a mistake I made is that the date is always the nominal date from the model even though the route itself can cross into a second year resulting in the transition from Dec to January being backwards in time:

x y route_id timestep date i stay_id stay_len
-225000 -375000 1 51 2021-12-21 246 1 7
-225000 -375000 1 52 2021-12-28 246 1 7
-225000 -375000 1 1 2021-01-04 246 1 7

I regret this and would like to fix this by incrementing the year in this situation. That would simplify the plotting code and make it work better with non-synthetic routes.

The BirdFlowRoute objects also include as attributes three components of the parent BirdFlow model:
geom, metadata, and species which are used when plotting if the bf argument isn't supplied. The object inherits from a data frame so functions as a data frame in almost every way. The primary exception being that has it's own print() method.

@ethanplunkett
Copy link
Contributor

ethanplunkett commented Jan 8, 2025

Related existing data formats in BirdFlowR are the inputs to interval_log_likelihood() and BirdFlowRoutes mentioned above.

A possible structure for this object is a list with components:

  • locations data frame. possibly use observations as the name but that is not a good name for synthetic routes
    • id Unique id for each row in locations.
    • bird_id Unique ID for each bird in locations. Do we need additional ID columns for some of these data sets?
    • lat, lon latitude and longitude in EPSG:4326
    • date_time Date and time in UTC - is there a convention in any of these datasets with regard to time? What time should we use if there is only a date in the source data? We could just call this date and allow it to be either a date or a date-time object.
    • stay_id - optional sequential integer within each ID indicating unique locations. is this necessary?
    • stay_len - optional stay length Necessary? Units?
  • intervals - data frame defining pairs of locations to be used for evaluation
    • from - locations id for start of movement
    • to - locations id for end of movement
  • species - similar to a BirdFlow object species this is a list of species information derived from ebirdst we could add additional information if necessary. Omit the quality items.
  • source - list with information on the source of the data. We need to flesh this out
    • citation
    • contact
    • limits (limits to use, distribution)
    • type one of "modus", "bands", "tracks", "synthetic" would we want to break tracks down more "gps", "light", "light-pressure" others?

We could optionally include the geom component of a BirdFlow model either as a list item or attribute - primarily for use with plot_routes() probably not though.

I've written a lot for one comment but I realize that we probably also want to allow an aligned version of the object that has been snapped toa BirdFlow model's cells and timesteps - in this case we'd definitely want to add the geom component and would probably add i, x,y,timestep, n, error, and message to the first table - see snap_to_birdflow for descriptions of these columns.

@ethanplunkett
Copy link
Contributor

ethanplunkett commented Jan 8, 2025

Issue #158 is related to this and will be resolved if we include aligned routes.

@chenyangkang
Copy link
Author

chenyangkang commented Jan 8, 2025

Thanks for the clarification!

Thus location indices, row and column indices, CRS (and projected x and y coordinates), and even timesteps (if it's not a full year model) can all be inconsistent between models.

I understand it, but will at least the CRS be consistent across models? @ethanplunkett

@ethanplunkett
Copy link
Contributor

No the CRS isn't fixed. It's an input to preprocess_species() which actually will default to a different custom projection for each species. In our big runs we've used that argument to input a common projection that works well for the Americas.

@chenyangkang
Copy link
Author

chenyangkang commented Jan 9, 2025

@ethanplunkett Hey Ethan, based on the need and existing structures, I propose the following structure. Nothing settled and happy to discuss.

  • check_valid.R (I suggest putting all checking functions or is_valid function in this script)

    • Generic function: check_TrackDataClass (the method will be called to check input data and output object during instance initiation)
      • check_TrackDataClass.Track
      • check_TrackDataClass.BirdFlowTrack
      • check_TrackDataClass.TrackCollection
      • check_TrackDataClass.BirdFlowTrackCollection
      • check_TrackDataClass.PairedBirdFlowTrackCollection
  • TrackDataClass.R

    • Track
      • class: Track
      • structure:
        • data: instance of data.frame: track_lon, track_lat, track_id, track_date, track_info
        • track_type: string, one of 'banding', 'tracking', 'motus'
        • species: input one of common_name, scientific_name, ebird_code
        • source (I’m not sure if we need a source for loaded tracking/banding data as it will only be for internal use)
      • Method:
        • print
        • to_BirdFlowTrack(track, bf, masked_only=TRUE): return a BirdFlowTrack object. Argument: masked_only — whether only includes points that fall within the dynamic mask.
    • BirdFlowTrack (to save a track projected to BirdFlow spatiotemporal coordinate systems; “aligned”)
      • Inherit from Track, add class name BirdFlowTrack
      • structure:
        • data: In addition to Track class: track_x, track_y, track_i, timestep, stay_id, stay_len
        • Additional track_type: 'synthetic'
        • species: Crosswalk the custom input of species name in Track object into BirdFlow naming.
      • Method:
        • print
    • TrackCollection
      • class: TrackCollection
      • structure:
        • data: a list of Track objects
      • Method:
        • print
        • to_df: summarize all tracks into a single data.frame.
        • reset_index: reindexing all the tracks with new sorted track_id. Return a new TrackCollection object.
        • check_input_Track_list: check that all input track is a Track object
        • to_BirdFlowTrackCollection(track_collection, bf, masked_only=TRUE): return a to_BirdFlowTrackCollection object. Argument: masked_only — whether only includes points that fall within the dynamic mask.
    • BirdFlowTrackCollection (formerly BirdFlowRoutes, save all tracks projected to BirdFlow spatiotemporal coordinate system)
      • Inherit from TrackCollection, add class name BirdFlowTrackCollection
      • structure:
        • data: a list of BirdFlowTrack objects
        • species
        • geom (due to the size of ‘geom’, it is only available at collection level)
        • dates (similarly, only available at collection level)
      • Method:
        • print
        • to_df: summarize all tracks into a single data.frame.
        • reset_index: reindexing all the tracks with new sorted track_id. Return a new BirdFlowTrackCollection object.
        • check_input_BirdFlowTrack_list: check that all input track is a BirdFlowTrack object.
        • to_PairedBirdFlowTrackCollection: return a PairedBirdFlowTrackCollection object.
    • PairedBirdFlowTrackCollection
      • class: PairedBirdFlowTrackCollection
      • structure:
        • data: a dataframe with columns track_lon1, track_lon2, track_lat1, track_lat2, track_x1, track_x2, track_y1, track_y2, track_i1, track_i2, track_date1, track_date2, track_timestep1, track_timestep2, track_id, track_info.
        • Species
        • geom
        • dates
      • Method:
        • print
        • to_df: summarize all tracks into a single data.frame.
        • reset_index.

——

BirdFlowTrack inherits from Track,
BirdFlowTrackCollection inherits from TrackCollection.

Track and TrackCollection are for raw data. BirdFlowTrack, BirdFlowTrackCollection, and PairedBirdFlowTrackCollection are in BirdFlow spatiotemporal coords.

——

Possible workflow:

  1. Raw tracking/banding -> Track -> TrackCollection -> BirdFlowTrackCollection (for evaluation of track stats).
  2. BF model generated routes -> BirdFlowTrack -> BirdFlowTrackCollection (for plotting).
  3. BF model generated routes -> BirdFlowTrack -> BirdFlowTrackCollection -> PairedBirdFlowTrackCollection (for evaluation of predictive distance metrics and log_likelihood).

The goal of these workflows is to make sure that all functions in BirdFlowR & BirdFlowPipeline will find a suitable data class as input, and no more within-function data check, only class check. For interval_log_likelihood(), I think using a PairedBirdFlowTrackCollection data class as input will be more concise, save a bunch of data checking, and can be shared among other functions like calc_distance_metric (that I wrote), compared to using another interval dataframe as reference.

@ethanplunkett
Copy link
Contributor

ethanplunkett commented Jan 9, 2025

Some initial reactions.

  • I think we are on the right track.
  • I like the distinction between a Track (unaligned) and a BirdFlowTrack (aligned). Specifically that the BirdFlow prefix implies that it's been aligned.
  • I might prefer "Route" to "Track" as a generic term for both synthetic and observed movements especially given that one type of observational data is "tracking". I don't feel super strongly about this. Let's bounce it off the group.
  • It feels like too many classes. My gut is to drop the "Collection" components and instead allow the data frames in the BirdFlowTrack and Track objects to include multiple tracks - possibly adding an "s" to their names. This is in part because I can't imagine I'd do anything other than combine the constituent data frames in from the collection object list items into one data frame. This also avoids the issue of only including the geom component at the collection level. With the BirdFlowTrack as defined above not including the geom component they don't really stand alone. This would also allow us to let the Tracks and BirdFlowTracks classes inherit from data.frame which might be elegant.
  • I think we need a different name than PairedTrackCollection as that implies to me that it is a pair of track collections not paired locations that each encode a single movement. Brainstorming: "Intervals", "Movements", "EvaluationMovements", "PairedLocations", ... ?
  • I want to push on keeping the data source and any usage restrictions in the object. This class isn't just for internal use even if our data is, and its easy to lose this information, especially as a project evolves and people move on. We can include it in the class definition and allow it to be NA for cases where it doesn't make sense.
  • We should think through what's necessary to plot an unaligned track with the code currently in plot_routes(). The stay length might be the trickiest part - but we could just omit it for unaligned data. plot_routes() also uses the common name so maybe we should require that. We can define a plot method that dispatches to plot_routes() for all these these classes.
  • My plan is to replace the current output of route() with this class which slightly simplifies the workflows you outlined.

@chenyangkang
Copy link
Author

chenyangkang commented Jan 9, 2025

@ethanplunkett Thanks! I will incorporate your suggestions. For the Tracks/Routes naming I'm happy to change it any time - Let's discuss on the next meeting. I will look into the plot_routes function for the needed data. My initial thoughts would be to set some default values or set some alternative plotting method in the plot_routes if some parameters are not provided.

The updated plan:

  • check_valid.R (I suggest putting all checking functions or is_valid function in this script)

    • Generic function: check_TrackDataClass (the method will be called to check input data and output object during instance initiation)
      • check_TrackDataClass.Tracks
      • check_TrackDataClass.BirdFlowTracks
      • check_TrackDataClass.BirdFlowIntervals
  • TrackDataClass.R

    • Tracks
      • class: Tracks, data.frame
      • structure:
        • data: instance of data.frame: track_lon, track_lat, track_id, track_date, track_info
        • track_type: string, one of 'banding', 'tracking', 'motus'
        • species: input one of common_name, scientific_name, ebird_code
        • source: citation, contact, limits (limits to use, distribution)
      • Method:
        • print
        • to_BirdFlowTracks(track, bf, masked_only=TRUE): return a BirdFlowTracks object. Argument: masked_only — whether only includes points that fall within the dynamic mask.
        • reset_index: reindexing all the tracks with new sorted track_id. Return a new TrackCollection object.
    • BirdFlowTracks (to save a track projected to BirdFlow spatiotemporal coordinate systems; “aligned”)
      • Inherit from Tracks, add class name BirdFlowTracks
      • structure:
        • data: In addition to Tracks class: track_x, track_y, track_i, timestep, stay_id, stay_len
        • Additional track_type: 'synthetic'
        • species: Crosswalk the custom input of species name in Tracks object into BirdFlow naming.
        • geom
        • dates
      • Method:
        • print
        • reset_index (inherit from Tracks)
        • to_BirdFlowIntervals: return a BirdFlowIntervals object.
    • BirdFlowIntervals
      • class: BirdFlowIntervals
      • structure:
        • data: a dataframe with columns track_lon1, track_lon2, track_lat1, track_lat2, track_x1, track_x2, track_y1, track_y2, track_i1, track_i2, track_date1, track_date2, track_timestep1, track_timestep2, track_id, track_info.
        • Species
        • geom
        • dates
      • Method:
        • print
        • reset_index.

Possible workflow:

  1. Raw tracking/banding -> Tracks -> BirdFlowTracks
  2. BF model generated routes -> BirdFlowTracks
  3. BF model generated routes -> BirdFlowTracks -> BirdFlowIntervals (for evaluation)

Another related issue is how should we sample the intervals from tracking data -- might need to reduce temporary correlation of samples.

@ethanplunkett
Copy link
Contributor

I think we are close especially on the overall structure.

Naming

I have a bunch of naming thoughts though.

It's worth reading this: https://adv-r.hadley.nz/s3.html#s3 especially the section on classes. Note this is the second addition of the book which has expanded this chapter quite a bit, but search engines seem to often serve the first edition.

Based on that a few more name revisions:
check_[x] should be called validate_[x]
An internal constructor should be called new_[x].
A public facing "helper" (constructor) should be named with the class name - essentially drop the "to_" prefixes. Another common convention in R is to us "as_" for conversion. "to_" is less common.

I think we should also drop the "track_" prefix from the data frame columns for succinctness and to keep with BirdFlowR's naming conventions.

Conversion to BirdFlowTracks (aligning with model)

The conversion from Track to BirdFlowTrack will likely wrap BirdFlowR::snap_to_birdflow() so we should keep the
"aggregate" argument and the "error" and "message" columns produced by that function so we can control whether and how aggregation of multiple observations within a week is done. I like your suggestion of having an explicit "masked_only" argument, although there are reasons other than the mask that a point might throw an error and and we probably want this argument to exclude them as well? Maybe "valid_only"? "mask_only" does nicely imply as is most often the case that the error is because the point is falling out of the dynamic and/or static mask. When documenting the wrapper you can use "@inheritparams snap_to_birdflow" in the roxygen block to reuse the documentation for "aggregate" and "mask_only" - the second only after it's added to "snap_to_birdflow".

Note BirdFlowPipeline::tracks_to_rts() is a function Dave wrote to convert tracking data to the existing "BirdFlowRoutes" class. Its redundant with BirdFlowR::snap_to_birdflow() but unlike that function also adds the stop_id and stay_len information and might do other processing - I haven't looked at it closely.

Converting to BirdFlowIntervals

There are a ton of choices for how to sample intervals from tracks especially if they are high frequency GPS tracks.

One big benefit of this reorganization is consolidating that code in one place. Some existing code:

I suspect this conversion will take more thought and will likely evolve as we encounter different types of data and use cases. Happily I don't think we need to anticipate all the options now as we can add to the function as needed without changing the structure of the output object. We might want to include some metadata about what parameters were used when picking the intervals.

@dsheldon
Copy link
Contributor

dsheldon commented Jan 9, 2025

Great discussion -- thanks Yangkang and Ethan.

@chenyangkang
Copy link
Author

@ethanplunkett Thanks for pointing all these out. Super helpful.

Naming

  1. I read through the chapter you shared. My understanding is that the new_ functions skip the time-consuming check. For our data structure I don't expect the validation to be time consuming, therefore not super convinced by the "trade-off between performance vs. safety"-- But I will do it as it seems to be a common standard. I still think whoever initiates such class should use Tracks rather than new_Tracks.
  2. I will change it to validate_ rather than check_.
  3. Remove all the to_ function. The class initiation it self will serve as the function.
  4. Will drop the track_ prefix.

Conversion to BirdFlowTracks

  1. I propose to remove BirdFlowPipeline::tracks_to_rts() and BirdFlowR::snap_to_birdflow(). Make a new method snap_to_birdflow in the script TrackDataClass.R for class Tracks, as the function should only be used for Tracks class after the rearrangement, just like encapsulated in the class if using R6. Or maybe to make a new script called TrackDataFunction.R to encapsulate all functions that should only be used on TrackDataClass.
  2. Use valid_only.

Converting to BirdFlowIntervals

  1. I will try to find and rewrite codes that take such interval data and remove redundant ones.

@ethanplunkett
Copy link
Contributor

@chenyangkang this is great!

One, maybe last, thought is that we should keep snap_to_birdflow() as a stand alone function. I like having the logic of snapping points to a model encapsulated and accessible independent of making a BirdFlowTracks object. I can imagine someone wanting to, for instance, snap a set of starting locations for synthetic routes to a birdflow model without having a Track object or wanting a BirdFlowTracks object.

Two possible approaches:

  1. We could make snap_to_birdflow() a generic function and then have a method for data frames (the existing function) and a method for Tracks (the new function) which calls the data frame method to do the snapping but also handles the other aspects of conversion. That also keeps us backwards compatible with any existing third party code that is using the current snap_to_birdflow().
  2. We could leave snap_to_birdflow() exactly as it is and use BirdFlowTracks() as the function to convert Tracks to a BirdFlowTracks.

Your call on which you prefer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants