Update `pages` and `requests` schemas on staging to match crawl dataset #18

max-ostapenko · 2024-10-14T09:11:38Z

The older data schema is being reprocessed using these queries:

After we promote these new schemas to be the new default we need to update agent processing.

We should be able to just do SELECT * when copying data from crawl_staging to crawl in crawl_complete pipeline.

The text was updated successfully, but these errors were encountered:

max-ostapenko · 2024-11-22T16:53:34Z

The transformation of crawl_staging.requests into crawl.requests lasted ~13h for Nov 2024 crawl (not including failed attempts).

@pmeenan let's update the wptagent to match crawl_staging with crawl schema.
As agreed we will not update legacy table anymore, so ready for a cleanup.

max-ostapenko · 2024-12-03T18:45:24Z

HTTPArchive/dataform#33 to sync the pipeline with the adjustments.

max-ostapenko · 2024-12-03T19:31:33Z

I hope it's not gonna be needed, BUT in case there are some issues with ingesting data to JSON columns we can fallback to parsing STRING data within the pipeline as we do currently.

max-ostapenko · 2024-12-06T00:29:51Z

@pmeenan from data in crawl_staging it looks like pages.summary values are not yet trimmed, e.g.:

'$._adult_site',
'$.archive',
'$.avg_dom_depth',
'$.crawlid',
'$.createDate',
'$.doctype',
'$.document_height',
'$.document_width',
'$.label',
'$.localstorage_size',
'$.meta_viewport',
'$.metadata',
'$.num_iframes',
'$.num_scripts_async',
'$.num_scripts_sync',
'$.num_scripts',
'$.pageid',
'$.PageSpeed',
'$.rank',
'$.sessionstorage_size',
'$.startedDateTime',
'$.url',
'$.urlhash',
'$.urlShort',
'$.usertiming',
'$.wptid',
'$.wptrun'

and pages.metadata:

'$.page_id',
'$.parent_page_id',
'$.root_page_id'

True, or are these rows outdated?

max-ostapenko · 2024-12-06T00:41:00Z

Some of these fields were removed in #15.
How should we proceed?

pmeenan · 2024-12-06T02:36:36Z

Were they present in all the records? The first few were before I removed the fields but the last 1-2 should have most of them removed.

…

On Thu, Dec 5, 2024 at 7:41 PM Max Ostapenko ***@***.***> wrote: Some of these fields were removed in #15 <#15>. How should we proceed? — Reply to this email directly, view it on GitHub <#18 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADMOBM2XNCNCONIR4U25AL2EDXDFAVCNFSM6AAAAABP4RC5H6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRRHAZTGNBRGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

max-ostapenko · 2024-12-06T10:01:05Z

Yeah, if sorted by payload.startedDateTime last two are OK.
👍🏽

max-ostapenko changed the title ~~Align agent data structure to be written to BQ with the new all dataset schema.~~ Update pages and requests schemas written by agent to BQ to a new one Oct 14, 2024

This was referenced Oct 14, 2024

Trimmed summary for pages and requests #15

Closed

Trim page ids from pages.metadata HTTPArchive/dataform#19

Merged

max-ostapenko changed the title ~~Update pages and requests schemas written by agent to BQ to a new one~~ Update pages and requests schemas on staging to match crawl dataset Dec 3, 2024

max-ostapenko mentioned this issue Dec 4, 2024

Pipeline adjustments after staging aligned with crawl dataset HTTPArchive/dataform#33

Merged

pmeenan mentioned this issue Dec 4, 2024

Moved to the new crawl schema #28

Merged

pmeenan closed this as completed in #28 Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update `pages` and `requests` schemas on staging to match crawl dataset #18

Update `pages` and `requests` schemas on staging to match crawl dataset #18

max-ostapenko commented Oct 14, 2024 •

edited

Loading

max-ostapenko commented Nov 22, 2024

max-ostapenko commented Dec 3, 2024

max-ostapenko commented Dec 3, 2024

max-ostapenko commented Dec 6, 2024

max-ostapenko commented Dec 6, 2024

pmeenan commented Dec 6, 2024 via email

max-ostapenko commented Dec 6, 2024

Update pages and requests schemas on staging to match crawl dataset #18

Update pages and requests schemas on staging to match crawl dataset #18

Comments

max-ostapenko commented Oct 14, 2024 • edited Loading

max-ostapenko commented Nov 22, 2024

max-ostapenko commented Dec 3, 2024

max-ostapenko commented Dec 3, 2024

max-ostapenko commented Dec 6, 2024

max-ostapenko commented Dec 6, 2024

pmeenan commented Dec 6, 2024 via email

max-ostapenko commented Dec 6, 2024

Update `pages` and `requests` schemas on staging to match crawl dataset #18

Update `pages` and `requests` schemas on staging to match crawl dataset #18

max-ostapenko commented Oct 14, 2024 •

edited

Loading