-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save user uploads as WACZs #3679
Save user uploads as WACZs #3679
Conversation
Apart from getting this to work, the main question is what the metadata should look like. |
I think the problem here is that the CDX index is broken, because the entries for
I'm going to see if I can demonstrate this. If this is the problem, I'm not sure where to make a fix. |
(A possible thread to pull when working further with this: does using a |
Thanks. I've tried with both response and resource records; not sure it matters here. (Philosophically, these might better be resource records.) I've removed the dependency on the indexer and am now creating the WACZ inline. I'm still not producing a working file, but I've pushed the changes for now. |
The thread I'm pulling now is that replayweb claims "2 hashes verified, 1 invalid" -- but I haven't been able to find an invalid hash. I am not perfectly confident in the way I'm calculating offset in the WARC, which I suppose could be the problem. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #3679 +/- ##
===========================================
+ Coverage 69.67% 69.91% +0.23%
===========================================
Files 54 54
Lines 7661 7725 +64
===========================================
+ Hits 5338 5401 +63
- Misses 2323 2324 +1 ☔ View full report in Codecov by Sentry. |
I can try and find the code we used to use for making CDXlines, back when that was a function of Perma. Those were not CDXJ, but it could be interesting just to see. |
One of my thoughts here is that since what we are trying to do is so minimal, there's no need for a general solution to indexing or any other part of WACZ production. |
Thanks @rebeccacremona and @matteocargnelutti for the help!
(This is true for working WACZs.) Is it worth adding the WARC header REFERS_TO_TARGET_URI? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for taking this on!! This is cool!
I read through your code once so far; I'd love to take a more detailed look later, including looking at a couple of the finished WACZs. But in the meantime, shared a few thoughts.
replayweb claims "2 hashes verified, 1 invalid"
(This is true for working WACZs.)
Hmmm, interesting. I wonder if this is true just for these new WACZs or for other WACZs we produce... Maybe let's dig.
Is it worth adding the WARC header REFERS_TO_TARGET_URI?
Interesting. TBD!
<header> | ||
<h1>Provenance Summary</h1> | ||
<p>The data present in this capture were uploaded by a Perma user to replace a failed or unsatisfactory capture of {{ url }} on {{ now }}.</p> | ||
</header> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hooray for a provenance summary!
TBD! I think probably let's add some more details here before this goes live?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What details do you have in mind? This is minimal partly because it's a placeholder and opportunity for discussion, but also because I'm inclined to keep the file as short as possible. I don't think it's necessary to provide the level of provenance detail Scoop provides.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's necessary to provide the level of provenance detail Scoop provides.
Agreed.
How about time of original capture (if any), time of replacement, replacement mime type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See 63ca218 --
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
perma_web/perma/utils.py
Outdated
"description": f"User upload for {url}", | ||
"mainPageURL": warc_url, | ||
"created": ts_string, | ||
"software": "Perma.cc", # version? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# version?
Hmmm...
So if we wanted to consider the production "tag" the version, then we'd have to somehow expose that to the application and interpolate here... Which sounds like a bit of a pain in the butt? (Some salt state that runs a command that gets the current tag and puts it into an ENV var, and then an application setting that reads that ENV var?)
Some new version that we just increment if/when we change this code (we'll be prone to forget, I imagine).
Or skip entirely?
Hmmmmmmmmm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the automated mechanism for accomplishing this would be slightly baroque and the manual way would be failure-prone. I think I still want to add it? Let me think about the best approach -- I think it will be something like what you describe, maybe writing settings directly rather than using an env var.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do set this up, should the setting be visible in templates? Would we want to expose it elsewhere in the application?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why I feel like we did this once before, or something like it, and it was exposed in the templates, and it was read by a LIL service or uptimerobot or something....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's how we did it before: 7e0a04b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now see e7d9405 -- I haven't yet set up the mechanism for interpolation in deployment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now have a mechanism for interpolation in deployment; I don't love it, but it works, and I may revisit it later.
This is true for freshly-created Perma WACZs made with Scoop. |
This is a first cut at preserving user uploads as WACZ files rather than WARCs. I'm not sure which Linear ticket is the best one to link here. I'm making this a draft for now.
This works, in the sense that it produces a valid WACZ that replayweb.page can play back, but it does not yet play back in Perma; the error message, from wabac.js. is e.g.
(That page is in fact in the archive.)
I'm guessing I need a slightly different set of options to py-wacz.