Skip to content
This repository has been archived by the owner on Oct 16, 2024. It is now read-only.

CAP static export script updates #2184

Merged
merged 3 commits into from
Dec 5, 2023
Merged

Conversation

jcushman
Copy link
Contributor

@jcushman jcushman commented Dec 4, 2023

This is my first pass at updating the convert_s3.py script for exporting CAP data to cap static format:

  • refactor the script to export to local disk instead of s3
  • rename to export_cap_static.py instead of convert_s3.py
  • update to use this workflow:
    • fab export_cap_static_volumes runs a celery task for each volume (or target volume) to export them to local disk. This step should take about 200 worker-hours, so less than a day given 16 workers.
    • fab summarize_cap_static gets run manually after all volumes are done, and writes reporter metadata and summary files
    • manually upload using something like rclone
    • a third command will be needed to move PDFs and CAPTAR files on S3 once the local files are uploaded -- Bex has copy_volume_pdf in the codebase that is mostly there
  • format changes:
    • per slack discussion, just use .json rather than .jsonlines since we're not exporting any very large files
    • include html case files
    • add jurisdiction metadata files and reporter-level volume files
  • Include unredacted volumes. We do this by temporarily redacting and then rolling back, which means the export task has to rely just on the database and not on the elasticsearch index.
    • Unredacted volumes are only exported if settings.REDACTION_KEY is set. My thought here is to use CELERY_TASK_ROUTES to route volume export tasks to a separate temporary celery worker that has that key set, but that might be over-complicated -- tbd with Ben.
  • Add a test for the export task that stores expected output in test_data/cap_static, so it's easy to see what we expect the task to do and check for regressions.

Next stuff:

  • This should be fine to pull now, if we're happy with the tests etc.
  • I'm interested to hear what Dakota encounters in using these files to create the CAP static website, since that's a good test of what's in there.
  • Let's try dumping some open jurisdictions and shipping to S3 to see how it goes.
  • And then we can make the script to add the PDFs and CAPTARs.
  • While I'm in here I'd like to throw in an export of the page structure for each volume, which is a small stretch goal but retains a bit of data we're otherwise not expressing anywhere.

Deployment notes:

  • See the thing on a separate celery queue.

@jcushman jcushman requested a review from a team as a code owner December 4, 2023 22:04
@jcushman jcushman requested review from tinykite and removed request for a team December 4, 2023 22:04
@jcushman jcushman changed the title CAP static updates CAP static export script updates Dec 4, 2023
@bensteinberg bensteinberg self-requested a review December 5, 2023 14:01
Copy link
Contributor

@bensteinberg bensteinberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Approved, that is, except I see tests are failing.)

@jcushman jcushman force-pushed the cap-static branch 2 times, most recently from 980dbb7 to 2bfffe9 Compare December 5, 2023 16:22
- Writes to local disk
- Includes HTML
- Includes unredacted versions
- Includeds jurisdiction metadata
- Formatting updates
Copy link

codecov bot commented Dec 5, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (183f001) 61.98% compared to head (cfbb23b) 63.51%.
Report is 1 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #2184      +/-   ##
===========================================
+ Coverage    61.98%   63.51%   +1.52%     
===========================================
  Files          107      107              
  Lines        11818    11746      -72     
===========================================
+ Hits          7325     7460     +135     
+ Misses        4493     4286     -207     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jcushman jcushman merged commit 28a066e into harvard-lil:develop Dec 5, 2023
2 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants