Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track status of granule ingest #71

Closed
frankinspace opened this issue Jan 3, 2024 · 11 comments
Closed

Track status of granule ingest #71

frankinspace opened this issue Jan 3, 2024 · 11 comments
Assignees
Labels
enhancement New feature or request

Comments

@frankinspace
Copy link
Member

Need to be able to determine which granules have been loaded into hydrocron and ensure that every granule ingested into CMR is also loaded in the database.

@nikki-t
Copy link
Collaborator

nikki-t commented Mar 28, 2024

track-status-granule-ingest-short

Here is a draft architecture for tracking the status of which granules have been ingested into Cumulus and which have been inserted into the Hydrocron database.

I think we need to discuss:

  • What action we want to take should there be granules ingested into Cumulus but not inserted into the Hydrocron database? This could be executing the load_granule Lambda or sending a list in an email to the OPS team.
  • What action we want to take should there be granules in the Hydrocron database that are no longer in Cumulus? This could be deleting the granule from the Hydrocron database or sending a list in an email to the OPS team. (I am not sure if this is likely to happen.)

@nikki-t nikki-t moved this to 🏗 In progress in SOTO PI 24.2 Mar 29, 2024
@torimcd
Copy link
Collaborator

torimcd commented Mar 30, 2024

Another thing to consider is that the relationship between granules and records in hydrocron is 1:many, since there are hundreds of river reaches in each reach granule, and thousands of nodes in each node granule. Every record in hydrocron is just one reach or node.

It is possible for some writes to succeed and others to fail when processing a granule, so we may need to also check that all the features get written. The number of features in each granule varies, but it should be constant over subsequent passes, so we could do something like have the number of feature ids expected for each pass hardcoded somewhere and then check that there are the same number for those pass ids? Or we could log the number of features in the shapefile when it's first opened and check against that number when querying for the granule name?

@nikki-t
Copy link
Collaborator

nikki-t commented Apr 1, 2024

@torimcd - That is a great point! So we need to track that all features in a granule have been written to the Hydrocron database.

We can create a map of pass identifiers and associate them with reach and node identifiers. Then we can check if the number of features in a granules matches the number of granules stored in the Hydrocron database for a specific cycle_id, pass_id which are present in the granule filename.

We can also keep track of missing features and modify the load_granule module to accept a granule shapefile and feature ID and load data only for that feature ID. OR we can just submit the entire granule for a rewrite to the Hydrocron database (assuming that we want to take some action with the missing granules and/or features).

@frankinspace frankinspace moved this from 🏗 In progress to 🔖 Ready in SOTO PI 24.2 Apr 18, 2024
@frankinspace frankinspace moved this from 🔖 Ready to 📋 Backlog in SOTO PI 24.2 May 7, 2024
@frankinspace
Copy link
Member Author

Just noticed we did log a distinct ticket for step 4 the delete feature: #140

@nikki-t
Copy link
Collaborator

nikki-t commented May 13, 2024

Notes on next steps:

  1. Understand how we might query CMR only for recently ingested granules without missing any that may have been reprocessed. The goal is to not query CMR for all SWOT granules every time the track ingest workflow executes.
  2. We may want to stand up a second DynamoDB table to keep track of which granules have been validated. This should include a timestamp and a count of feature IDs which is determined for each granule during the track ingest operations.
  3. We should update the architecture to create and deliver a CNM message to the CNM Lambda function for cases where the workflow identifies a missing granule in the Hydrocron database.
  4. Create a proof of concept to maintain state using the CMR revision_date to understand how to formulate a CMR query on newly available granules so that we do not miss any that may have been reprocessed.

@nikki-t
Copy link
Collaborator

nikki-t commented May 16, 2024

Here is an updated architecture based on the tag up next steps and a small proof of concept I completed.

track-status-granule-ingest-v2

I believe that we can query by a temporal range that does not include the entire SWOT River collection. Instead we can save the revision_date timestamp in a separate DynamoDB table.

We can retrieve the most recent revision_date from the table each time the track ingest workflow runs and use that as the starting date to a CMR query.

I am letting the proof of concept run over many hours to see how this might work. The proof of concept saves the revision_date to a JSON file mimicking items in a DynamoDB table.

@nikki-t nikki-t moved this from 📋 Backlog to 🔖 Ready in SOTO PI 24.2 May 16, 2024
@nikki-t nikki-t moved this from 🔖 Ready to 📋 Backlog in SOTO PI 24.2 May 16, 2024
@nikki-t
Copy link
Collaborator

nikki-t commented Jun 12, 2024

I think I have worked out the logic around querying CMR for a range and not returning the entire SWOT collection. The idea is to:

  1. Get a revision start date by querying the track-status table which keeps track of which Hydrocron granules (granule name) have been ingested with a status of "ingested" or "to_ingest", the granule's revision date, and the feature count for the granule (number of identifiers).

    1. The revision start date is determined by querying on the revision dates in the database and returning the max date. This was the last revision date that was stored from previous CMR queries.
  2. Query CMR with the revision start date and and end date time of either the current time or the current time minus some hours to prevent a race condition between active Hydrocron granule intests (CNM Lambda).

  3. Query Hydrocron and return granules with a to_ingest status or cases where the granule's revision date falls into the same range as the step 2 CMR query (as this will pull in any granules that may have changed).

  4. Determine the overlap in step 2 CMR query and step 3 Hydrocron query and create a list of granules that have not been ingested (do not exist in the track-status database).

  5. Count the number of features in the "to_ingest" list gathered from all granules returned in step 3 to determine if the granule has been fully ingested and if it has set it's status to "ingested". If it is has not been fully ingested add it to the list from step 4.

  6. Create a CNM message and publish to appropriate SNS topic to kick off Hydrocron granule ingestion for granules that have not been ingested in list created in step 4 and modified in step 5.

I think with this logic in place we can proceed with defining the full architecture. @frankinspace and @torimcd - let me know what you think!

@frankinspace frankinspace added enhancement New feature or request and removed help wanted Extra attention is needed labels Jun 13, 2024
@frankinspace
Copy link
Member Author

May need to add in tracking the file checksum in order to avoid re-ingest of granules when only metadata has been changed (which causes new revision in cmr).

@frankinspace
Copy link
Member Author

Needs to take into account the fact that river/node are in a distinct collection from the prior lake collection.

@nikki-t
Copy link
Collaborator

nikki-t commented Jul 2, 2024

Updated logic to accommodate file checksum to avoid re-ingest of granules already ingested:

  1. Get revision start time (The most recent date stored in the track-ingest database).
  2. Get revision end time (The current date and time).
  3. Query CMR with a temporal range of step 1) start time and step 2) end time on the revision_date query parameter.
    a. Store checksums for each granule.
  4. Query Hydrocron for granule URs taken from the CMR query (cannot query on a time range based on the current structure of the DynamoDB tables).
  5. Determine overlap between Hydrocron and CMR. Overlap is determined by step 4 since we cannot query on a time range.
    a. Create a dictionary of " to_ingest", with filename as key and CMR checksum as value.
  6. Query track-status for all granules set to "to_ingest" status. Add to dictionary of "to_ingest" (step 5).
  7. Count the features of all granules in the "to_ingest" list and determine if all features have been ingested into Hydrocron.
    a. If the granule has not been ingested, keep in "to_ingest" dictionary (created in Step 5 and modified in Step 6).
    b. If the granule has been ingested, create and add to "ingested" dictionary (same structure as "to_ingested") and remove from "to_ingested" dictionary.
  8. For all granules in the "to_ingest" dictionary create and publish a CNM message to the CNM Lambda to trigger Hydrocron ingestion.
    a. Report on granules to be ingested.
    b. Insert granules with "to_ingest" status into database.
  9. Insert or update all granules from "ingested" list (created step 7b) into track-status database with a status of "ingested".

Running for the first time and populating track-status.

  1. Get all CMR.
  2. Get all Hydrocron.
  3. Overlap. Checksums are taken from CMR.
  4. All granules should be inserted into the track-status database with a status of "to_ingest" and then when the track-status Lambda executes a second time, the features are counted for each granule and inserted into the track-status database with a status of "ingested".

@nikki-t nikki-t mentioned this issue Jul 19, 2024
4 tasks
@torimcd torimcd moved this from 🆕 New to 🏗 In progress in SOTO PI 24.3 Sep 10, 2024
@torimcd torimcd removed this from SOTO PI 24.3 Sep 10, 2024
@nikki-t
Copy link
Collaborator

nikki-t commented Dec 9, 2024

This work has been completed. Confirmed that track ingest operations are working as expected in OPS and are tracking river reaches and nodes and priors lakes granule ingestion.

@nikki-t nikki-t closed this as completed Dec 9, 2024
@github-project-automation github-project-automation bot moved this to ✅ Done in SOTO PI 24.4 Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: ✅ Done
Development

When branches are created from issues, their pull requests are automatically linked.

3 participants