Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use date column to calculate stay_len and stay_id #207

Open
chenyangkang opened this issue Jan 19, 2025 · 1 comment
Open

Use date column to calculate stay_len and stay_id #207

chenyangkang opened this issue Jan 19, 2025 · 1 comment

Comments

@chenyangkang
Copy link

The current add_stay_id function is using number of the rows as stop length, this is valid if the timesteps are sampled in the same frequency.

BirdFlowR/R/route.R

Lines 176 to 184 in cef92ce

add_stay_id <- function(df) {
# Benjamin's function
df |>
dplyr::mutate(stay_id = cumsum(c(1, as.numeric(diff(.data$i)) != 0)),
stay_len = rep(rle(.data$stay_id)$lengths,
times = rle(.data$stay_id)$lengths))
}
points <- points |> dplyr::group_by(.data$route_id) |> add_stay_id()

However, custom data (e.g., tracking, motus, banding) seldom contains equally sampled timepoints. So consider using "real" calculation on the date column to get stay_len with default unit of day.

add_stay_id or similar transformation will be a default behavior for BirdFlowRoutes.

I removed the add_stay_id in route function.

# add_stay_id <- function(df) {

And add the function add_stay_id_with_varied_intervals

add_stay_id_with_varied_intervals <- function(df, timestep_col = "date", timediff_unit = "days", time_threshold = Inf) {
# Ensure the data is sorted by timestep
df <- df |> dplyr::arrange(.data[[timestep_col]])
new_df <- df |>
dplyr::mutate(
timestep_diff = c(1, as.numeric(diff(.data[[timestep_col]]), units = timediff_unit)), # Time differences
i_change = c(1, as.numeric(diff(.data$i)) != 0), # Changes in 'i'
stay_id = cumsum(i_change | (timestep_diff > time_threshold))
) |>
# Now the stay_id is assigned, calculate the duration (time difference) of each stay
dplyr::group_by(route_id, stay_id) |>
dplyr::mutate(
stay_len = as.numeric(max(.data[[timestep_col]]) - min(.data[[timestep_col]]) + 1, units = timediff_unit)
) |>
dplyr::select(-timestep_diff, -i_change)
return(new_df)
}

which is applied when a new BirdFlowRoutes object is created:

## Add stay id
birdflow_route_df <- birdflow_route_df |>
sort_by_id_and_dates() |>
dplyr::group_by(.data$route_id) |>
add_stay_id_with_varied_intervals(timestep_col = timestep_col, timediff_unit = timediff_unit) |>
# Here, using add_stay_id_with_varied_intervals, rather than add_stay_id.
# It takes 'timestep' as input so account for varying intervals,
# if the data is not sampled in a frequency.
dplyr::ungroup() |>
as.data.frame() |>
preserve_s3_attributes(original = birdflow_route_df)

Also now the synthetic routes generated by route function will not have circular dates (so cross the year boundary), but the timestep will circulate to 1 again. So we should calculate the stay based on date rather than timestep.

This change will be included in the next merge if nobody objects.

@ethanplunkett
Copy link
Contributor

Make sure you do some testing with plot_routes(). It will likely need some updating for the new format. I've wanted to drop the circular dates and drop a bunch of hackish stuff I did to deal with the circular dates, so this is an overdue change. Let me know if you want me to make the changes in the plotting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants