You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As objects pass through the queued pipeline, we should build up a list of processes that touched the object.
For example, when fetch grabs a page, we create both a .raw and .json object in S3. The JSON object should have a field, history (or similar) that contains a list of values. To start, it would contain fetch. After we walk the page, we should have fetch,walk. After extraction, fetch,walk,extract, and so on.
validate can then tell whether or not objects have gone through the entire pipeline.
Part of this, however, involves us knowing when a process is "done."
for thought
do we want this with the object, or in a work database? Having it decoupled could be more trouble, but it would give us the ability to analyze more quickly/easily than retrieving all of the objects to inspect them.
The text was updated successfully, but these errors were encountered:
As objects pass through the queued pipeline, we should build up a list of processes that touched the object.
For example, when
fetch
grabs a page, we create both a.raw
and.json
object in S3. The JSON object should have a field,history
(or similar) that contains a list of values. To start, it would containfetch
. After we walk the page, we should havefetch,walk
. After extraction,fetch,walk,extract
, and so on.validate
can then tell whether or not objects have gone through the entire pipeline.Part of this, however, involves us knowing when a process is "done."
for thought
do we want this with the object, or in a
work
database? Having it decoupled could be more trouble, but it would give us the ability to analyze more quickly/easily than retrieving all of the objects to inspect them.The text was updated successfully, but these errors were encountered: