Save user uploads as WACZs #3679

bensteinberg · 2024-12-16T15:11:00Z

This is a first cut at preserving user uploads as WACZ files rather than WARCs. I'm not sure which Linear ticket is the best one to link here. I'm making this a draft for now.

This works, in the sense that it produces a valid WACZ that replayweb.page can play back, but it does not yet play back in Perma; the error message, from wabac.js. is e.g.

Archived Page Not Found

Sorry, this page was not found in this archive:

file:///8MXD-LZ6V/upload.jpg?version=040666551871161394
...

(That page is in fact in the archive.)

I'm guessing I need a slightly different set of options to py-wacz.

bensteinberg · 2024-12-16T16:02:01Z

Apart from getting this to work, the main question is what the metadata should look like.

bensteinberg · 2024-12-16T19:33:24Z

I think the problem here is that the CDX index is broken, because the entries for file:/// URIs are getting rewritten to reduce consecutive slashes to one; I think this is down to cdxj_indexer just using surt on the URL, rather than doing what warcio does with getSurt(), in the Scoop context:

    if (!url.startsWith("https:") && !url.startsWith("http:")) {
      return url;
    }

I'm going to see if I can demonstrate this. If this is the problem, I'm not sure where to make a fix.

rebeccacremona · 2025-01-06T17:35:46Z

(A possible thread to pull when working further with this: does using a response type record instead of a resource record affect the indexing? It shouldn't, in theory... but, since I recall that in Scoop-produced WARCs, everything is a response record, including attachments like the screenshot, provenance summary, etc., and I do not know why, I am a little suspicious this could be related.)

bensteinberg · 2025-01-06T21:25:17Z

Thanks. I've tried with both response and resource records; not sure it matters here. (Philosophically, these might better be resource records.) I've removed the dependency on the indexer and am now creating the WACZ inline. I'm still not producing a working file, but I've pushed the changes for now.

bensteinberg · 2025-01-06T21:33:50Z

The thread I'm pulling now is that replayweb claims "2 hashes verified, 1 invalid" -- but I haven't been able to find an invalid hash. I am not perfectly confident in the way I'm calculating offset in the WARC, which I suppose could be the problem.

codecov · 2025-01-06T21:41:34Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 69.91%. Comparing base (ffa57fa) to head (010007c).
Report is 22 commits behind head on develop.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #3679      +/-   ##
===========================================
+ Coverage    69.67%   69.91%   +0.23%     
===========================================
  Files           54       54              
  Lines         7661     7725      +64     
===========================================
+ Hits          5338     5401      +63     
- Misses        2323     2324       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

rebeccacremona · 2025-01-06T21:58:49Z

I can try and find the code we used to use for making CDXlines, back when that was a function of Perma. Those were not CDXJ, but it could be interesting just to see.

bensteinberg · 2025-01-06T22:27:43Z

One of my thoughts here is that since what we are trying to do is so minimal, there's no need for a general solution to indexing or any other part of WACZ production.

bensteinberg · 2025-01-07T21:19:16Z

Thanks @rebeccacremona and @matteocargnelutti for the help!

replayweb claims "2 hashes verified, 1 invalid"

(This is true for working WACZs.)

Is it worth adding the WARC header REFERS_TO_TARGET_URI?

rebeccacremona

Thanks so much for taking this on!! This is cool!

I read through your code once so far; I'd love to take a more detailed look later, including looking at a couple of the finished WACZs. But in the meantime, shared a few thoughts.

replayweb claims "2 hashes verified, 1 invalid"
(This is true for working WACZs.)

Hmmm, interesting. I wonder if this is true just for these new WACZs or for other WACZs we produce... Maybe let's dig.

Is it worth adding the WARC header REFERS_TO_TARGET_URI?

Interesting. TBD!

perma_web/api/tests/test_link_authorization.py

perma_web/api/tests/test_link_resource.py

rebeccacremona · 2025-01-08T15:21:32Z

perma_web/perma/templates/provenance-summary.html

+      <header>
+        <h1>Provenance Summary</h1>
+        <p>The data present in this capture were uploaded by a Perma user to replace a failed or unsatisfactory capture of {{ url }} on {{ now }}.</p>
+      </header>


Hooray for a provenance summary!

TBD! I think probably let's add some more details here before this goes live?

What details do you have in mind? This is minimal partly because it's a placeholder and opportunity for discussion, but also because I'm inclined to keep the file as short as possible. I don't think it's necessary to provide the level of provenance detail Scoop provides.

I don't think it's necessary to provide the level of provenance detail Scoop provides.

Agreed.

How about time of original capture (if any), time of replacement, replacement mime type?

See 63ca218 --

Looks good to me!

rebeccacremona · 2025-01-08T15:29:19Z

perma_web/perma/utils.py

+            "description": f"User upload for {url}",
+            "mainPageURL": warc_url,
+            "created": ts_string,
+            "software": "Perma.cc",  # version?


# version?

Hmmm...

So if we wanted to consider the production "tag" the version, then we'd have to somehow expose that to the application and interpolate here... Which sounds like a bit of a pain in the butt? (Some salt state that runs a command that gets the current tag and puts it into an ENV var, and then an application setting that reads that ENV var?)

Some new version that we just increment if/when we change this code (we'll be prone to forget, I imagine).

Or skip entirely?

Hmmmmmmmmm.

Yeah, the automated mechanism for accomplishing this would be slightly baroque and the manual way would be failure-prone. I think I still want to add it? Let me think about the best approach -- I think it will be something like what you describe, maybe writing settings directly rather than using an env var.

If we do set this up, should the setting be visible in templates? Would we want to expose it elsewhere in the application?

I don't know why I feel like we did this once before, or something like it, and it was exposed in the templates, and it was read by a LIL service or uptimerobot or something....

Here's how we did it before: 7e0a04b

Now see e7d9405 -- I haven't yet set up the mechanism for interpolation in deployment.

I now have a mechanism for interpolation in deployment; I don't love it, but it works, and I may revisit it later.

perma_web/perma/models.py

bensteinberg · 2025-01-08T17:06:58Z

replayweb claims "2 hashes verified, 1 invalid"
(This is true for working WACZs.)

Hmmm, interesting. I wonder if this is true just for these new WACZs or for other WACZs we produce... Maybe let's dig.

This is true for freshly-created Perma WACZs made with Scoop.

perma_web/api/tests/test_link_authorization.py

perma_web/perma/models.py

Save user uploads as WACZs

844f189

bensteinberg marked this pull request as draft December 16, 2024 15:16

Build WACZ inline

658c235

Update idna

4de815f

bensteinberg added 3 commits January 7, 2025 13:05

Replace ArchiveIterator with Indexer

bb28de6

Use Response, not Resource

b390dea

Revert to block digest and resource, and make timestamp consistent

14146a1

bensteinberg marked this pull request as ready for review January 7, 2025 21:20

bensteinberg requested a review from rebeccacremona January 7, 2025 21:21

rebeccacremona requested changes Jan 8, 2025

View reviewed changes

bensteinberg added 3 commits January 8, 2025 11:40

Add Capture for provenance summary

c2615cb

Update user upload test

4317059

Remove filetype parameter from assertRecordsInArchive

1d4875f

rebeccacremona reviewed Jan 8, 2025

View reviewed changes

perma_web/api/tests/test_link_authorization.py Outdated Show resolved Hide resolved

Correct provenance capture assertion

de10af3

rebeccacremona reviewed Jan 8, 2025

View reviewed changes

perma_web/perma/models.py Outdated Show resolved Hide resolved

bensteinberg added 3 commits January 8, 2025 13:26

Enhance provenance and correct timestamp

63ca218

Add minimal version mechanism

e7d9405

Use booleans; provenance summary is not a user upload

44191e7

rebeccacremona approved these changes Jan 8, 2025

View reviewed changes

bensteinberg added 2 commits January 8, 2025 16:25

Remove cache-buster for uploads

b07e5c0

Correct version string

010007c

bensteinberg merged commit 4dfa0f0 into harvard-lil:develop Jan 9, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save user uploads as WACZs #3679

Save user uploads as WACZs #3679

bensteinberg commented Dec 16, 2024 •

edited

Loading

bensteinberg commented Dec 16, 2024

bensteinberg commented Dec 16, 2024

rebeccacremona commented Jan 6, 2025

bensteinberg commented Jan 6, 2025

bensteinberg commented Jan 6, 2025

codecov bot commented Jan 6, 2025 •

edited

Loading

rebeccacremona commented Jan 6, 2025

bensteinberg commented Jan 6, 2025

bensteinberg commented Jan 7, 2025

rebeccacremona left a comment

rebeccacremona Jan 8, 2025

bensteinberg Jan 8, 2025

rebeccacremona Jan 8, 2025

bensteinberg Jan 8, 2025

rebeccacremona Jan 8, 2025

rebeccacremona Jan 8, 2025

bensteinberg Jan 8, 2025

bensteinberg Jan 8, 2025

rebeccacremona Jan 8, 2025 •

edited

Loading

bensteinberg Jan 8, 2025

bensteinberg Jan 8, 2025

bensteinberg Jan 8, 2025

bensteinberg commented Jan 8, 2025

Save user uploads as WACZs #3679

Save user uploads as WACZs #3679

Conversation

bensteinberg commented Dec 16, 2024 • edited Loading

bensteinberg commented Dec 16, 2024

bensteinberg commented Dec 16, 2024

rebeccacremona commented Jan 6, 2025

bensteinberg commented Jan 6, 2025

bensteinberg commented Jan 6, 2025

codecov bot commented Jan 6, 2025 • edited Loading

Codecov Report

rebeccacremona commented Jan 6, 2025

bensteinberg commented Jan 6, 2025

bensteinberg commented Jan 7, 2025

rebeccacremona left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rebeccacremona Jan 8, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bensteinberg commented Jan 8, 2025

bensteinberg commented Dec 16, 2024 •

edited

Loading

codecov bot commented Jan 6, 2025 •

edited

Loading

rebeccacremona Jan 8, 2025 •

edited

Loading