-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track status of granule ingest #71
Comments
Here is a draft architecture for tracking the status of which granules have been ingested into Cumulus and which have been inserted into the Hydrocron database. I think we need to discuss:
|
Another thing to consider is that the relationship between granules and records in hydrocron is 1:many, since there are hundreds of river reaches in each reach granule, and thousands of nodes in each node granule. Every record in hydrocron is just one reach or node. It is possible for some writes to succeed and others to fail when processing a granule, so we may need to also check that all the features get written. The number of features in each granule varies, but it should be constant over subsequent passes, so we could do something like have the number of feature ids expected for each pass hardcoded somewhere and then check that there are the same number for those pass ids? Or we could log the number of features in the shapefile when it's first opened and check against that number when querying for the granule name? |
@torimcd - That is a great point! So we need to track that all features in a granule have been written to the Hydrocron database. We can create a map of pass identifiers and associate them with reach and node identifiers. Then we can check if the number of features in a granules matches the number of granules stored in the Hydrocron database for a specific cycle_id, pass_id which are present in the granule filename. We can also keep track of missing features and modify the |
Just noticed we did log a distinct ticket for step 4 the delete feature: #140 |
Notes on next steps:
|
Here is an updated architecture based on the tag up next steps and a small proof of concept I completed. I believe that we can query by a temporal range that does not include the entire SWOT River collection. Instead we can save the revision_date timestamp in a separate DynamoDB table. We can retrieve the most recent revision_date from the table each time the track ingest workflow runs and use that as the starting date to a CMR query. I am letting the proof of concept run over many hours to see how this might work. The proof of concept saves the revision_date to a JSON file mimicking items in a DynamoDB table. |
I think I have worked out the logic around querying CMR for a range and not returning the entire SWOT collection. The idea is to:
I think with this logic in place we can proceed with defining the full architecture. @frankinspace and @torimcd - let me know what you think! |
May need to add in tracking the file checksum in order to avoid re-ingest of granules when only metadata has been changed (which causes new revision in cmr). |
Needs to take into account the fact that river/node are in a distinct collection from the prior lake collection. |
Updated logic to accommodate file checksum to avoid re-ingest of granules already ingested:
Running for the first time and populating track-status.
|
This work has been completed. Confirmed that track ingest operations are working as expected in OPS and are tracking river reaches and nodes and priors lakes granule ingestion. |
Need to be able to determine which granules have been loaded into hydrocron and ensure that every granule ingested into CMR is also loaded in the database.
The text was updated successfully, but these errors were encountered: