Epic: Track changes to a data series and push the date of the latest change to CKAN #298

cjhendrix · 2014-10-17T15:17:30Z

The goal is to allow all the information about a ckan indicator (which is simply a ckan dataset that is coming from CPS) to be maintained in one place: CPS. CPS would then have the ability to push changes to this information (let's call it Ancillary Indicator Information, AI2) to CKAN via CKAN's action API.

The goal of this epic is to set up the framework for this using a high value test case, described below.

Consider all the indicators returned from this search: https://data.hdx.rwlabs.org/dataset?q=fts+cross-appeal Note that the "Updated By" date for all of them is July 7, which is the date when the ckan datasets were created. However the data on CPS has been updated at least weekly since then, but CKAN has no way of knowing this. This epic will result in these dates being updated by CPS whenever a change is made to the data series. Later we will expand this approach to allow all of the AI2 to be managed in CPS.

The list of AI2 to be managed by CPS will ultimately include:

dataset AI2 (0..1 per data series)
- most recent date of changed data or metadata (for "updated by")
- description
- source
- source link
- date range (calculated)
- locations (1..*)
- public/private
- license (including other + text)
- methodology (including other + text)
- caveats
- tags (0..*)
- topics (fixed list)
- Org
- data display precision
- Resource AI2 (0..* per data series)
  - CPS API URL for Resource
  - Resource Name
  - Resource Note
  - Resource Format

rufuspollock · 2014-10-17T15:28:10Z

I'm a bit unclear why wouldn't you manage all of this info directly in CKAN? It already has the capability to store all this kind of info and it saves you having to reinvent the wheel by adding support for this in CPS (and then pushing it back across into CKAN).

cjhendrix · 2014-10-17T15:59:53Z

We use CPS to import and normalize data and maintain referential integrity. Since at least some of the info has to be maintained on the CPS side, our data managers feel it would be easier to manage all of it there. This is just for those datasets that we curate, not user contributed datasets which live solely on CKAN.

rufuspollock · 2014-10-22T08:45:49Z

@cjhendrix I guess the question is why you couldn't maintain all the info on the CKAN side here based on DRY principles? Generally, I think it would be really useful (for me) to understand a bit more about the overall architecture especially of CPS to understand what is being done where and how as I can then offer more useful input :-)

cjhendrix · 2014-10-23T12:07:46Z

Note to Sam. Understood that this one will likely carry over multiple sprints given your availability.

seustachi · 2014-10-28T12:11:10Z

The biggest difficulty I see here is that CPS does not know about the curated datasets.

Instead, the curated datasets know about CPS.

If we add some kind of mapping, allowing CPS to know which curated datasets to update when some data (or metadata) changes are detected, we still have 2 places to maintain. If we add a new indicator, we have to create it in CPS, create the curated dataset, and they both must know about each other.

So we don't follow the DRY principle, and I am not sure this will be simpler for the data team.

The gain here would be that once this is set up, the updates should be replicated.

I think we should have a call dedicated to this topic.

seustachi · 2014-11-09T19:17:11Z

So after discussion, here is the plan :

There is a 1 to 1 relationship between dataseries and ckan datasets. So if we detect a change in the data or metadata for a dataserie, we can push it to the dataset.

The metadata would be pushed to the dataset metadata.
For the data, this is still to be discussed. We might want to push a new file, and / or invalidate the cache... Many things might have to be done, to be discussed with Alex and Serban.

What I can do already is the following :

Add some fields in the dataserie table :

Name of the dataset (to be able to push to ckan)
Last metadata update
Last metadata push
Last data update
Last data push

Setup a job that will search the dataseries where Last metadata update > Last metadata push, and push the metadata to ckan (and update the last metadata push value)

rufuspollock · 2014-11-10T07:59:32Z

@seustachi it would be super useful to get a bit of a diagram here to understand what is going on - as mentioned you'll want to be careful about not ending up with your authoratative metadata in 2 places (and getting stuff out of sync).

cjhendrix · 2014-11-10T10:11:32Z

@seustachi The key thing we need to urgently solve is the high value test case listed in the original issue above. If I understand your last comment above, it sounds like you are putting that one as secondary. Happy to discuss, but I think you need to focus your effort on that one.

seustachi · 2014-11-10T10:34:23Z

@cjhendrix I don't put it as secondary priority.

To detect a change related to a dataserie is a prerequisite.
To know how dataseries and datasets are related is also a prerequisite.

Then we will be able to push information to CKAN.

cjhendrix · 2014-11-10T12:10:31Z

Ok, thanks for the clarification.

seustachi · 2014-11-14T12:13:37Z

So, we agreed that :

a dataset is related to a dataserie
we need to change the names of the dataset to be able to have several datasets for an indicatorName (title_with_underscore___sourceCode)
we want in the extras : sourceCode, sourceName, IndicatorTypeCode, IndicatorTypeName, lastUpdateDate
we will propose to Serban to use a static FS for reports that will be updated by the CkanSynchronizerJob (This job is in CPS) instead.

LastUpdateDate changes only if at least one vale was added or updated

seustachi · 2014-11-18T19:59:04Z

List of the extras keys we wat to use :

"dataset_source" for the sourceName
"dataset_source_code" for the source code

"indicator_type" for the IT Name
"indicator_type_code" for the IT code

"dataset_date": "11/02/2014-11/20/2014", for the date range of the data

"dataset_summary"
"methodology"
"more_info"
"terms_of_use"
"validation_notes_and_comments"

seustachi · 2014-11-18T20:00:39Z

Format of the action we want to use is documented here :
https://gist.github.com/alexandru-m-g/09155dff01e8302acf47

seustachi · 2014-11-18T20:26:57Z

More info here : https://docs.google.com/document/d/1KqOQtDGgu-HE1VFDGg8te8fP9adlh1HHMBWv5muAQmg/edit

seustachi · 2014-12-27T19:15:33Z

@cjhendrix @alexandru-m-g
I don't remember what we decided about the change to the dataset names.

Do we keep a human readable title (title_with_underscore___sourceCode) or do we want (indTypeCode_SourceCode)

I think I remember the CJ prefered the human readable. If we do that, we have to manage the title in CPS (to be able to push updates). Is it what we want ?

teodorescuserban · 2014-12-28T17:32:25Z

Please, when in doubt about any names, favor human readable over anything else and url slug over human readable.

cjhendrix · 2014-12-29T10:22:50Z

@seustachi It's the former, for example: https://data.hdx.rwlabs.org/dataset/proportion_of_the_population_using_improved_sanitation_facilities___mdgs

Alex is making the change in sprint 46 (2 week sprint starting 5 Jan): OCHA-DAP/hdx-ckan#1771

As for managing the title in CPS, that should be fine. The only thing we shouldn't manage is the "name", which is used for the URL.

Upon completion of metadata update, the ts in db is updated so the job is considered as done

seustachi · 2015-01-08T16:53:55Z

What we want now is to trigger the metadata update is a new indicator value is added or an existing one changed, because we need to change the range of values dates

seustachi · 2015-01-08T16:58:12Z

And we also want to update the date of the last "update" of the dataset. See with @alexandru-m-g if we store it in dataset or resource. This is a new metadata, update triggered when an update to the data is done

#298

which there is some data for the dataserie

to appear under "name" instead of "id"

seustachi · 2015-01-29T08:37:27Z

@cjhendrix Moved to sprint 48.

Even if we started to implement this epic in sprint 46, and some work was also done on sprint 47, some sub-tasks are still pending and planned for sprint 48 or later

cjhendrix added enhancement High Priority labels Oct 17, 2014

cjhendrix added this to the Sprint 37 milestone Oct 23, 2014

cjhendrix added ?Prioritization? and removed ?Prioritization? labels Oct 23, 2014

cjhendrix assigned seustachi Oct 24, 2014

seustachi added a commit that referenced this issue Nov 9, 2014

Starting to work on Epic #298

e84a195

seustachi added a commit that referenced this issue Nov 14, 2014

#298, TO BE CONTINUED

e8fe49e

seustachi added a commit that referenced this issue Nov 27, 2014

#298

252e96a

seustachi added a commit that referenced this issue Dec 27, 2014

for #298, data model and incremental script

6977fb9

seustachi added a commit that referenced this issue Dec 27, 2014

#298

80a1590

seustachi added a commit that referenced this issue Dec 27, 2014

\#298

2a08888

seustachi added a commit that referenced this issue Dec 27, 2014

#298

01f367d

seustachi added a commit that referenced this issue Dec 31, 2014

fixing a small bug in validators + a few changes for #298

a0b8fba

seustachi added a commit that referenced this issue Dec 31, 2014

#298

8d83245

seustachi added a commit that referenced this issue Jan 2, 2015

#298

2332dba

Upon completion of metadata update, the ts in db is updated so the job is considered as done

seustachi added a commit that referenced this issue Jan 2, 2015

#298

b39a845

seustachi added a commit that referenced this issue Jan 8, 2015

getting the list of countries with data for a given Dataserie

8280e74

#298

danmihaila mentioned this issue Jan 9, 2015

Date missing from some datasets (where source = WHO?) OCHA-DAP/hdx-ckan#2033

Closed

seustachi added a commit that referenced this issue Jan 12, 2015

#298 when updating metadata, we also push the list of countries for

ffcb3bf

which there is some data for the dataserie

seustachi added a commit that referenced this issue Jan 12, 2015

#298 setting up the data, preparing for a massive push

c705675

seustachi added a commit that referenced this issue Jan 12, 2015

#298

c7af131

seustachi added a commit that referenced this issue Jan 12, 2015

fixing the insert statements, #298

de23416

seustachi added a commit that referenced this issue Jan 14, 2015

minor modifications for #298

df9aa90

alexandru-m-g mentioned this issue Jan 14, 2015

An import in CPS should trigger a push of data to CKAN for the changed dataseries #330

Open

alexandru-m-g added a commit that referenced this issue Jan 20, 2015

#298 changing the group code json

75f3bf6

to appear under "name" instead of "id"

seustachi modified the milestones: Sprint 47, Sprint 37 Jan 23, 2015

seustachi added the Metadata Epic label Jan 23, 2015

seustachi modified the milestones: Sprint 48, Sprint 47 Jan 29, 2015

danmihaila added the CPS label Jun 22, 2016

danmihaila unassigned seustachi Jun 22, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: Track changes to a data series and push the date of the latest change to CKAN #298

Epic: Track changes to a data series and push the date of the latest change to CKAN #298

cjhendrix commented Oct 17, 2014

rufuspollock commented Oct 17, 2014

cjhendrix commented Oct 17, 2014

rufuspollock commented Oct 22, 2014

cjhendrix commented Oct 23, 2014

seustachi commented Oct 28, 2014

seustachi commented Nov 9, 2014

rufuspollock commented Nov 10, 2014

cjhendrix commented Nov 10, 2014

seustachi commented Nov 10, 2014

cjhendrix commented Nov 10, 2014

seustachi commented Nov 14, 2014

seustachi commented Nov 18, 2014

seustachi commented Nov 18, 2014

seustachi commented Nov 18, 2014

seustachi commented Dec 27, 2014

teodorescuserban commented Dec 28, 2014

cjhendrix commented Dec 29, 2014

seustachi commented Jan 8, 2015

seustachi commented Jan 8, 2015

seustachi commented Jan 29, 2015

Epic: Track changes to a data series and push the date of the latest change to CKAN #298

Epic: Track changes to a data series and push the date of the latest change to CKAN #298

Comments

cjhendrix commented Oct 17, 2014

rufuspollock commented Oct 17, 2014

cjhendrix commented Oct 17, 2014

rufuspollock commented Oct 22, 2014

cjhendrix commented Oct 23, 2014

seustachi commented Oct 28, 2014

seustachi commented Nov 9, 2014

rufuspollock commented Nov 10, 2014

cjhendrix commented Nov 10, 2014

seustachi commented Nov 10, 2014

cjhendrix commented Nov 10, 2014

seustachi commented Nov 14, 2014

seustachi commented Nov 18, 2014

seustachi commented Nov 18, 2014

seustachi commented Nov 18, 2014

seustachi commented Dec 27, 2014

teodorescuserban commented Dec 28, 2014

cjhendrix commented Dec 29, 2014

seustachi commented Jan 8, 2015

seustachi commented Jan 8, 2015

seustachi commented Jan 29, 2015