Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save user uploads as WACZs #3679

Merged
merged 15 commits into from
Jan 9, 2025

Conversation

bensteinberg
Copy link
Contributor

@bensteinberg bensteinberg commented Dec 16, 2024

This is a first cut at preserving user uploads as WACZ files rather than WARCs. I'm not sure which Linear ticket is the best one to link here. I'm making this a draft for now.

This works, in the sense that it produces a valid WACZ that replayweb.page can play back, but it does not yet play back in Perma; the error message, from wabac.js. is e.g.

Archived Page Not Found

Sorry, this page was not found in this archive:

file:///8MXD-LZ6V/upload.jpg?version=040666551871161394
...

(That page is in fact in the archive.)

I'm guessing I need a slightly different set of options to py-wacz.

@bensteinberg bensteinberg marked this pull request as draft December 16, 2024 15:16
@bensteinberg
Copy link
Contributor Author

Apart from getting this to work, the main question is what the metadata should look like.

@bensteinberg
Copy link
Contributor Author

I think the problem here is that the CDX index is broken, because the entries for file:/// URIs are getting rewritten to reduce consecutive slashes to one; I think this is down to cdxj_indexer just using surt on the URL, rather than doing what warcio does with getSurt(), in the Scoop context:

    if (!url.startsWith("https:") && !url.startsWith("http:")) {
      return url;
    }

I'm going to see if I can demonstrate this. If this is the problem, I'm not sure where to make a fix.

@rebeccacremona
Copy link
Contributor

(A possible thread to pull when working further with this: does using a response type record instead of a resource record affect the indexing? It shouldn't, in theory... but, since I recall that in Scoop-produced WARCs, everything is a response record, including attachments like the screenshot, provenance summary, etc., and I do not know why, I am a little suspicious this could be related.)

@bensteinberg
Copy link
Contributor Author

Thanks. I've tried with both response and resource records; not sure it matters here. (Philosophically, these might better be resource records.) I've removed the dependency on the indexer and am now creating the WACZ inline. I'm still not producing a working file, but I've pushed the changes for now.

@bensteinberg
Copy link
Contributor Author

The thread I'm pulling now is that replayweb claims "2 hashes verified, 1 invalid" -- but I haven't been able to find an invalid hash. I am not perfectly confident in the way I'm calculating offset in the WARC, which I suppose could be the problem.

Copy link

codecov bot commented Jan 6, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 69.91%. Comparing base (ffa57fa) to head (010007c).
Report is 22 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3679      +/-   ##
===========================================
+ Coverage    69.67%   69.91%   +0.23%     
===========================================
  Files           54       54              
  Lines         7661     7725      +64     
===========================================
+ Hits          5338     5401      +63     
- Misses        2323     2324       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rebeccacremona
Copy link
Contributor

I can try and find the code we used to use for making CDXlines, back when that was a function of Perma. Those were not CDXJ, but it could be interesting just to see.

@bensteinberg
Copy link
Contributor Author

One of my thoughts here is that since what we are trying to do is so minimal, there's no need for a general solution to indexing or any other part of WACZ production.

@bensteinberg
Copy link
Contributor Author

Thanks @rebeccacremona and @matteocargnelutti for the help!

replayweb claims "2 hashes verified, 1 invalid"

(This is true for working WACZs.)

Is it worth adding the WARC header REFERS_TO_TARGET_URI?

@bensteinberg bensteinberg marked this pull request as ready for review January 7, 2025 21:20
Copy link
Contributor

@rebeccacremona rebeccacremona left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for taking this on!! This is cool!

I read through your code once so far; I'd love to take a more detailed look later, including looking at a couple of the finished WACZs. But in the meantime, shared a few thoughts.

replayweb claims "2 hashes verified, 1 invalid"
(This is true for working WACZs.)

Hmmm, interesting. I wonder if this is true just for these new WACZs or for other WACZs we produce... Maybe let's dig.

Is it worth adding the WARC header REFERS_TO_TARGET_URI?

Interesting. TBD!

perma_web/api/tests/test_link_resource.py Outdated Show resolved Hide resolved
<header>
<h1>Provenance Summary</h1>
<p>The data present in this capture were uploaded by a Perma user to replace a failed or unsatisfactory capture of {{ url }} on {{ now }}.</p>
</header>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hooray for a provenance summary!

TBD! I think probably let's add some more details here before this goes live?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What details do you have in mind? This is minimal partly because it's a placeholder and opportunity for discussion, but also because I'm inclined to keep the file as short as possible. I don't think it's necessary to provide the level of provenance detail Scoop provides.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary to provide the level of provenance detail Scoop provides.

Agreed.

How about time of original capture (if any), time of replacement, replacement mime type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See 63ca218 --

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

"description": f"User upload for {url}",
"mainPageURL": warc_url,
"created": ts_string,
"software": "Perma.cc", # version?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# version?

Hmmm...

So if we wanted to consider the production "tag" the version, then we'd have to somehow expose that to the application and interpolate here... Which sounds like a bit of a pain in the butt? (Some salt state that runs a command that gets the current tag and puts it into an ENV var, and then an application setting that reads that ENV var?)

Some new version that we just increment if/when we change this code (we'll be prone to forget, I imagine).

Or skip entirely?

Hmmmmmmmmm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the automated mechanism for accomplishing this would be slightly baroque and the manual way would be failure-prone. I think I still want to add it? Let me think about the best approach -- I think it will be something like what you describe, maybe writing settings directly rather than using an env var.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do set this up, should the setting be visible in templates? Would we want to expose it elsewhere in the application?

Copy link
Contributor

@rebeccacremona rebeccacremona Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why I feel like we did this once before, or something like it, and it was exposed in the templates, and it was read by a LIL service or uptimerobot or something....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's how we did it before: 7e0a04b

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now see e7d9405 -- I haven't yet set up the mechanism for interpolation in deployment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now have a mechanism for interpolation in deployment; I don't love it, but it works, and I may revisit it later.

perma_web/perma/models.py Outdated Show resolved Hide resolved
@bensteinberg
Copy link
Contributor Author

replayweb claims "2 hashes verified, 1 invalid"
(This is true for working WACZs.)

Hmmm, interesting. I wonder if this is true just for these new WACZs or for other WACZs we produce... Maybe let's dig.

This is true for freshly-created Perma WACZs made with Scoop.

@bensteinberg bensteinberg merged commit 4dfa0f0 into harvard-lil:develop Jan 9, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants