Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to publish Open Data from MELODIES #3

Open
jonblower opened this issue May 29, 2015 · 8 comments
Open

How to publish Open Data from MELODIES #3

jonblower opened this issue May 29, 2015 · 8 comments

Comments

@jonblower
Copy link
Member

This is to start a discussion on how we should publish data from the MELODIES project as Open Data. The ideal situation is to publish five-star linked open data where everything is described as RDF, with links to other datasets and vocabularies.

The current list of open data planned from the MELODIES project is on the EMDESK Wiki (perhaps we should move it to GitHub? -> see #5), although we should consider other datasets too in an attempt to identify generically-useful methods.

We consider three levels of information:

  1. Individual observations or measurements.
  2. Collections of observations/measurements, i.e. datasets.
  3. Collections of datasets, i.e. catalogues.

Our goals are:

  1. We are obliged to publish dataset in the GEOSS DataCORE. There are various ways to do this, with instructions here. For example, we can submit metadata documents, or provide an OpenSearch endpoint.
  2. We would like to appear in the Linked Open Data Cloud, which means publishing through the VoID vocabulary.
  3. We would also like to appear in Google searches, which could be achieved by describing data through schema.org, although I'm not sure how this works in practice.
  4. We would like to be able to visualise and interact with data at the level of observations (not just datasets), meaning the data themselves must be available on the web in some useful web-friendly way.

How can we achieve the above? Questions include:

  1. Which vocabularies/ontologies to use?
  2. Where should we host the RDF descriptions of datasets? On our own servers, or can we publish them elsewhere? Don't forget that we want to demonstrate geospatial linked data and not all data-hosting sites are geo-enabled.
  3. How do we publish the data themselves, given that an RDF dump of a large raster dataset is probably not a good idea? And how do we link data files to the metadata descriptions (and vice versa)?
  4. How do we expose data to interactive web portals (which means interacting at the level of observations, not just datasets)?

Discussion is welcome!

@letmaik
Copy link
Member

letmaik commented May 29, 2015

The first three goals are all at dataset level. For that we should at a minimum use the established W3C vocabularies DCAT and VoID for describing datasets (I don't think schema.org will be that useful currently, but I may be wrong). DCAT and VoID have some overlap in general metadata (importing other vocabularies like DC and FOAF). The difference between them is that DCAT is for arbitrary datasets (actual dataset can be random files), while VoID is specifically meant for LOD datasets where the whole dataset consists of RDF triples. This is made obvious by the fact that you should provide a SPARQL endpoint within VoID for the dataset, and may provide statistics of triples (counts, basically dataset size). With VoID an OpenSearch.xml description can also be linked to, but this is only for searching within the dataset with free text search.

Some facts:

  • DCAT allows to refer directly to dataset files with given mime-types (called "download URL"), which means that there is no layer for describing available observations within DCAT (if we use this separation)
  • DCAT also allows to point to an "access URL", which is "A landing page, feed, SPARQL endpoint or other type of resource that gives access to the distribution of the dataset"
  • VoID only has a Dataset level, DCAT has Catalog and Dataset (we probably can ignore Catalog)
  • DCAT allows to describe temporal and spatial extent of a dataset, which will be useful for geo search engines (and for providing an OpenSearch Geo/Time service)

About metadata hosting, in the end the datasets will be hosted on some server anyway, providing things like a GeoSPARQL endpoint, and some way for accessing raster data in an intelligent way with OPeNDAP or WCS for example (possibly linked to via RDF somehow). And on that server, the dataset probably has its own URL where the metadata can be stored alongside as well. VoID describes three ways of doing just that. I think that's a minimum. The next step would be to point catalogs to this metadata so they can harvest it. However I don't know of any catalog which has what we want. As far as I can see only the closed ones from NASA for example have rich query capabilities like bounding box and time range searching. We may have some luck in adding a bit more temporal and geospatial sauce to CKAN (which is the software used for catalog portals like datahub.io) in case the available plugins are not enough. That could be one of the Melodies contributions on the software side.

Things I haven't discussed here are how to model datasets themselves, how observations are linked to the metadata, and how to integrate raster data. I think this has to be cleared first before thinking about how to expose it in graphical portals.

@p3dr0
Copy link
Member

p3dr0 commented May 29, 2015

+1 for DCAT ... just please don't ignore the "Catalog" element

however DCAT capabilities for geo are quite feeble ...
I've been following the discussion on the geo-dcat application profile that might be a solution but I'm still not really convinced about it (probably too much INSPIRE-antibodies on my blood stream)
nevertheless this is probably something to check
http://joinup.ec.europa.eu/mailman/listinfo/dcat_application_profile-geo

@jonblower
Copy link
Member Author

Thanks Pedro - what do you think MELODIES should do for a catalogue? Should we expose our own "demonstrator" catalogue (e.g. with OpenSearch Geo/time interfaces)? Or is there another catalogue we could plug into (e.g. on Terradue's platform) that we could use to demonstrate what we have been doing?

@p3dr0
Copy link
Member

p3dr0 commented Jun 1, 2015

Currently each partner has data repositories and catalogue services as part of the cloud platform baseline services and have been exploited in developing and integrating their MELODIES services. What we are missing is a public top level catalogue that could aggregate/expose particular collections as Open Data.

This study in WP3 will be very useful to frame the metadata model. Among others, it will help us to check the feasibility of DCAT to improve our catalogue solution.

@jonblower
Copy link
Member Author

Currently we’re thinking of publishing the MELODIES catalogue as an RDF document using DCAT (and maybe VoID). We think that CKAN instances can harvest this. Would this work for Terradue? How might we include OpenSearch capabilities?

@letmaik
Copy link
Member

letmaik commented Jun 23, 2015

I think we should split this issue up as it covers too much. So I think in general we have these topics:

  • high-level interoperable dataset description (using DCAT/VoID etc.) -> for ingestion into catalogues, discovery etc.
  • exposing the data of the datasets themselves (O&M, custom ontologies, etc.) -> for data users, web portals, etc
  • linking both worlds
  • where to host the data and metadata

Working on these separately and doing some discussion cross-referencing is better I think than having a massive thread covering everything.

@letmaik
Copy link
Member

letmaik commented Jun 25, 2015

I have opened separate smaller discussions (#6, #7, #8) now. If I missed anything please go ahead and create another issue and link to this one (#3). Please don't add further comments in this mother thread (only if absolutely necessary for some reason).

@jonblower
Copy link
Member Author

(I created a new issue #9 to discuss where MELODIES data should be published.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants