Track status of granule ingest #71

frankinspace · 2024-01-03T22:13:02Z

Need to be able to determine which granules have been loaded into hydrocron and ensure that every granule ingested into CMR is also loaded in the database.

nikki-t · 2024-03-28T15:43:20Z

Here is a draft architecture for tracking the status of which granules have been ingested into Cumulus and which have been inserted into the Hydrocron database.

I think we need to discuss:

What action we want to take should there be granules ingested into Cumulus but not inserted into the Hydrocron database? This could be executing the load_granule Lambda or sending a list in an email to the OPS team.
What action we want to take should there be granules in the Hydrocron database that are no longer in Cumulus? This could be deleting the granule from the Hydrocron database or sending a list in an email to the OPS team. (I am not sure if this is likely to happen.)

torimcd · 2024-03-30T00:37:45Z

Another thing to consider is that the relationship between granules and records in hydrocron is 1:many, since there are hundreds of river reaches in each reach granule, and thousands of nodes in each node granule. Every record in hydrocron is just one reach or node.

It is possible for some writes to succeed and others to fail when processing a granule, so we may need to also check that all the features get written. The number of features in each granule varies, but it should be constant over subsequent passes, so we could do something like have the number of feature ids expected for each pass hardcoded somewhere and then check that there are the same number for those pass ids? Or we could log the number of features in the shapefile when it's first opened and check against that number when querying for the granule name?

nikki-t · 2024-04-01T14:29:18Z

@torimcd - That is a great point! So we need to track that all features in a granule have been written to the Hydrocron database.

We can create a map of pass identifiers and associate them with reach and node identifiers. Then we can check if the number of features in a granules matches the number of granules stored in the Hydrocron database for a specific cycle_id, pass_id which are present in the granule filename.

We can also keep track of missing features and modify the load_granule module to accept a granule shapefile and feature ID and load data only for that feature ID. OR we can just submit the entire granule for a rewrite to the Hydrocron database (assuming that we want to take some action with the missing granules and/or features).

frankinspace · 2024-05-08T21:01:50Z

Just noticed we did log a distinct ticket for step 4 the delete feature: #140

nikki-t · 2024-05-13T14:13:40Z

Notes on next steps:

Understand how we might query CMR only for recently ingested granules without missing any that may have been reprocessed. The goal is to not query CMR for all SWOT granules every time the track ingest workflow executes.
We may want to stand up a second DynamoDB table to keep track of which granules have been validated. This should include a timestamp and a count of feature IDs which is determined for each granule during the track ingest operations.
We should update the architecture to create and deliver a CNM message to the CNM Lambda function for cases where the workflow identifies a missing granule in the Hydrocron database.
Create a proof of concept to maintain state using the CMR revision_date to understand how to formulate a CMR query on newly available granules so that we do not miss any that may have been reprocessed.

nikki-t · 2024-05-16T17:21:42Z

Here is an updated architecture based on the tag up next steps and a small proof of concept I completed.

I believe that we can query by a temporal range that does not include the entire SWOT River collection. Instead we can save the revision_date timestamp in a separate DynamoDB table.

We can retrieve the most recent revision_date from the table each time the track ingest workflow runs and use that as the starting date to a CMR query.

I am letting the proof of concept run over many hours to see how this might work. The proof of concept saves the revision_date to a JSON file mimicking items in a DynamoDB table.

nikki-t · 2024-06-12T20:33:03Z

I think I have worked out the logic around querying CMR for a range and not returning the entire SWOT collection. The idea is to:

Get a revision start date by querying the track-status table which keeps track of which Hydrocron granules (granule name) have been ingested with a status of "ingested" or "to_ingest", the granule's revision date, and the feature count for the granule (number of identifiers).
1. The revision start date is determined by querying on the revision dates in the database and returning the max date. This was the last revision date that was stored from previous CMR queries.
Query CMR with the revision start date and and end date time of either the current time or the current time minus some hours to prevent a race condition between active Hydrocron granule intests (CNM Lambda).
Query Hydrocron and return granules with a to_ingest status or cases where the granule's revision date falls into the same range as the step 2 CMR query (as this will pull in any granules that may have changed).
Determine the overlap in step 2 CMR query and step 3 Hydrocron query and create a list of granules that have not been ingested (do not exist in the track-status database).
Count the number of features in the "to_ingest" list gathered from all granules returned in step 3 to determine if the granule has been fully ingested and if it has set it's status to "ingested". If it is has not been fully ingested add it to the list from step 4.
Create a CNM message and publish to appropriate SNS topic to kick off Hydrocron granule ingestion for granules that have not been ingested in list created in step 4 and modified in step 5.

I think with this logic in place we can proceed with defining the full architecture. @frankinspace and @torimcd - let me know what you think!

frankinspace · 2024-06-17T22:20:06Z

May need to add in tracking the file checksum in order to avoid re-ingest of granules when only metadata has been changed (which causes new revision in cmr).

frankinspace · 2024-06-27T18:45:07Z

Needs to take into account the fact that river/node are in a distinct collection from the prior lake collection.

nikki-t · 2024-07-02T14:51:51Z

Updated logic to accommodate file checksum to avoid re-ingest of granules already ingested:

Get revision start time (The most recent date stored in the track-ingest database).
Get revision end time (The current date and time).
Query CMR with a temporal range of step 1) start time and step 2) end time on the revision_date query parameter.
a. Store checksums for each granule.
Query Hydrocron for granule URs taken from the CMR query (cannot query on a time range based on the current structure of the DynamoDB tables).
~~Determine overlap between Hydrocron and CMR.~~ Overlap is determined by step 4 since we cannot query on a time range.
a. Create a dictionary of " to_ingest", with filename as key and CMR checksum as value.
Query track-status for all granules set to "to_ingest" status. Add to dictionary of "to_ingest" (step 5).
Count the features of all granules in the "to_ingest" list and determine if all features have been ingested into Hydrocron.
a. If the granule has not been ingested, keep in "to_ingest" dictionary (created in Step 5 and modified in Step 6).
b. If the granule has been ingested, create and add to "ingested" dictionary (same structure as "to_ingested") and remove from "to_ingested" dictionary.
For all granules in the "to_ingest" dictionary create and publish a CNM message to the CNM Lambda to trigger Hydrocron ingestion.
a. Report on granules to be ingested.
b. Insert granules with "to_ingest" status into database.
Insert or update all granules from "ingested" list (created step 7b) into track-status database with a status of "ingested".

Running for the first time and populating track-status.

Get all CMR.
Get all Hydrocron.
Overlap. Checksums are taken from CMR.
All granules should be inserted into the track-status database with a status of "to_ingest" and then when the track-status Lambda executes a second time, the features are counted for each granule and inserted into the track-status database with a status of "ingested".

nikki-t · 2024-12-09T13:09:31Z

This work has been completed. Confirmed that track ingest operations are working as expected in OPS and are tracking river reaches and nodes and priors lakes granule ingestion.

frankinspace added this to the Integrate database load with granule ingest milestone Jan 3, 2024

frankinspace added this to SOTO PI 24.1 Jan 3, 2024

frankinspace added the help wanted Extra attention is needed label Feb 22, 2024

frankinspace assigned nikki-t Mar 21, 2024

frankinspace removed this from SOTO PI 24.1 Mar 25, 2024

frankinspace removed this from the Integrate database load with granule ingest milestone Mar 27, 2024

frankinspace added this to SOTO PI 24.2 Mar 27, 2024

frankinspace added this to the Operational tools for monitoring milestone Mar 27, 2024

nikki-t moved this to 🏗 In progress in SOTO PI 24.2 Mar 29, 2024

This was referenced Apr 2, 2024

Provide a way to delete records associated with a given granule #140

Open

Feature/issue 155 - Log feature ids on db write #160

Merged

frankinspace moved this from 🏗 In progress to 🔖 Ready in SOTO PI 24.2 Apr 18, 2024

frankinspace moved this from 🔖 Ready to 📋 Backlog in SOTO PI 24.2 May 7, 2024

nikki-t moved this from 📋 Backlog to 🔖 Ready in SOTO PI 24.2 May 16, 2024

nikki-t moved this from 🔖 Ready to 📋 Backlog in SOTO PI 24.2 May 16, 2024

frankinspace removed this from SOTO PI 24.2 Jun 13, 2024

frankinspace added enhancement New feature or request and removed help wanted Extra attention is needed labels Jun 13, 2024

frankinspace added this to SOTO PI 24.3 Jun 27, 2024

frankinspace moved this to 🆕 New in SOTO PI 24.3 Jun 27, 2024

nikki-t mentioned this issue Jul 19, 2024

Feature/issue 198 #207

Merged

4 tasks

frankinspace modified the milestones: Operational tools for monitoring, Track status of granule ingest Aug 28, 2024

torimcd moved this from 🆕 New to 🏗 In progress in SOTO PI 24.3 Sep 10, 2024

torimcd added this to SOTO PI 24.4 Sep 10, 2024

torimcd removed this from SOTO PI 24.3 Sep 10, 2024

nikki-t closed this as completed Dec 9, 2024

github-project-automation bot moved this to ✅ Done in SOTO PI 24.4 Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track status of granule ingest #71

Track status of granule ingest #71

frankinspace commented Jan 3, 2024

nikki-t commented Mar 28, 2024

torimcd commented Mar 30, 2024

nikki-t commented Apr 1, 2024

frankinspace commented May 8, 2024

nikki-t commented May 13, 2024 •

edited

Loading

nikki-t commented May 16, 2024

nikki-t commented Jun 12, 2024

frankinspace commented Jun 17, 2024

frankinspace commented Jun 27, 2024

nikki-t commented Jul 2, 2024 •

edited

Loading

nikki-t commented Dec 9, 2024

Track status of granule ingest #71

Track status of granule ingest #71

Comments

frankinspace commented Jan 3, 2024

nikki-t commented Mar 28, 2024

torimcd commented Mar 30, 2024

nikki-t commented Apr 1, 2024

frankinspace commented May 8, 2024

nikki-t commented May 13, 2024 • edited Loading

nikki-t commented May 16, 2024

nikki-t commented Jun 12, 2024

frankinspace commented Jun 17, 2024

frankinspace commented Jun 27, 2024

nikki-t commented Jul 2, 2024 • edited Loading

nikki-t commented Dec 9, 2024

nikki-t commented May 13, 2024 •

edited

Loading

nikki-t commented Jul 2, 2024 •

edited

Loading