Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support workflow resource type in Resource Catalogue (Q2) #56

Closed
kalxas opened this issue Jun 28, 2024 · 28 comments
Closed

Support workflow resource type in Resource Catalogue (Q2) #56

kalxas opened this issue Jun 28, 2024 · 28 comments
Assignees
Labels
BR040 Provide a metadata catalog that supports all resource types required for open reproducible science EOfarm pycsw

Comments

@kalxas
Copy link
Member

kalxas commented Jun 28, 2024

No description provided.

@kalxas kalxas added the BR040 Provide a metadata catalog that supports all resource types required for open reproducible science label Jun 28, 2024
@kalxas kalxas added this to the Q2 milestone Jun 28, 2024
@kalxas kalxas self-assigned this Jun 28, 2024
@kalxas kalxas added the pycsw label Jul 10, 2024
@jonas-eberle
Copy link
Collaborator

jonas-eberle commented Sep 3, 2024

@kalxas What is meant by workflow? Do we want to specify this further (e.g., CWL, OpenEO process graph)?

@j08lue
Copy link
Collaborator

j08lue commented Sep 3, 2024

Related: I have trawled the pycsw and pygeoapi repos for sample resources of various type - no workflows there, though, afaics:

@j08lue
Copy link
Collaborator

j08lue commented Sep 3, 2024

@GarinSmith to link to related EarthCODE story, pls.

@GarinSmith
Copy link

Hi @j08lue,
The reference you need is here https://github.com/orgs/ESA-EarthCODE/projects/5/views/1?pane=issue&itemId=72091040
I may need to give you access to this, but there is a summary below (which will save you time reading the link above).

In summary:

After an initial review with Angelos. We agreed that we should use OGC API Records to

  • Provide a link to an externally hosted workflow definition (probably on GitHub)
  • This could support numerous workflow types as referenced in the suggested architecture
    • CWL
    • openEO
    • Jupyter Notebook
    • Another etc
  • Add additionally agreed metadata in a consistent format

This is important because it means (as Richard suggested)

  • We can start to ingest Workflows in a very flexible manner
  • We can start to "Find" Workflows in a very flexible manner
  • We can start to "Access" Workflows in a very flexible manner
  • This can done regardless of the format of the workflow reference (e.g. CWL, openEO etc although we may include the workflow type)
  • This approach could also be possibly be used for a Reproducible job details (Workflow Metadata) and or a Replicable workflow (Experiment Metadata). E.g. CWL could perhaps implement either approach using default parameters and openEO may default to a Replicable workflow implementation.

We would like a formal way of validating a schema. Can you please suggest something?

E.g. we would like EOEPCA+ guidance on how to validate schema compliance? This seems quite complicated.
E.g. see
https://json-schema.org/implementations#validators-web-(online)
or
https://json-schema.org/implementations#command-line

The online schemas to not seem to cope with $ref instances and there seem to be lots of $refs for OGC API Records.

I have looked at command lined solutions like Polyglottal JSON Schema Validator and these seem to struggle too.
E.g.
pajv validate -s recordGeoJSON.yaml -d record.json -r recordCommonProperties.yaml -r time.yaml -r linkBase.yaml -r linkTemplate.yaml
(this does not yet seem to work yet)

Could you provide a working example/solution to validate a valid OGC API Record? We could then use this approach in EarthCODE using the above strategy.

For above I used (schemas)
https://github.com/opengeospatial/ogcapi-records/blob/master/core/openapi/schemas/recordJSON.yaml
https://github.com/opengeospatial/ogcapi-records/blob/master/core/openapi/schemas/recordGeoJSON.yaml
and (records)
https://github.com/opengeospatial/ogcapi-records/blob/master/core/examples/json/record.json
I am not sure about the compatibility of the above, but I just wanted to get a schema validation test.

@j08lue
Copy link
Collaborator

j08lue commented Sep 3, 2024

Sure thing. @kalxas, let us discuss, how much Records validation should happen on the API vs UI level.

@GarinSmith
Copy link

GarinSmith commented Sep 3, 2024

Thanks. We would like to know first a reliable way to perform this validation, so that:
i) We can validate OGC API Records on the EarthCODE Catalog, before we publish them from a Platform.
ii) We can validate OGC API Records on a Platform, before we try to publish them to the EarthCODE Catalog.
This will help avoid operational issues by performing validation in suitable places along the operational pipeline.

@kalxas
Copy link
Member Author

kalxas commented Sep 19, 2024

This would mean validating against, directly: https://github.com/opengeospatial/ogcapi-records/blob/master/core/openapi/schemas/recordGeoJSON.yaml .

The problem here is that the schema is in YAML, and tools like Python check-jsonschema do not do well with JSON Schema via YAML, especially when there are $ref’s involved.

OGC typically pushes out the YAML schemas onto http://schemas.opengis.net/. We need JSON schemas.

@kalxas
Copy link
Member Author

kalxas commented Sep 19, 2024

@jonas-eberle @j08lue any type of workflow could be represented with a metadata record. The goal of this task is to define a record schema with extra properties to describe metadata about a workflow

@kalxas
Copy link
Member Author

kalxas commented Sep 19, 2024

@GarinSmith I got feedback from @tomkralidis
WMO have defined their own schemas based on OGC API Records:
See https://github.com/wmo-im/wcmp2/tree/main/schemas
https://schemas.wmo.int/wcmp/2.0.0/schemas/

@GarinSmith
Copy link

@kalxas Thanks. That helps. I don't care about the format (JSON or YAML) as long as it validates against a specific OGC API Records implementation that we can use for a workflow or experiment. I had the same problem above using YAML and $ref.

It is very helpful to see the spec referenced here too https://schemas.wmo.int/wcmp/2.0.0/standard/wcmp-2.0.0.pdf

I got this to work using
check-jsonschema --schemafile wcmp2-bundled.json de-dwd.surface-weather-observations-realtime.json
Note the example provided in the above document uses
check-jsonschema --schemafile schemas/wcmp2-bundled.json examples/msc-swob-realtime.json
although I cannot find msc-swob-realtime.json in examples, but never mind.

I tried it with https://github.com/opengeospatial/ogcapi-records/blob/master/core/examples/json/record.json and got
check-jsonschema --schemafile wcmp2-bundled.json record.json
record.json::$.properties.contacts[0]: 'organization' is a required property
however, this was easy to fix and seems reasonable.

I also tried it with an example openEO implementation
https://github.com/ESA-APEx/apex_algorithms/blob/main/algorithm_catalog/worldcereal_inference.json and got
check-jsonschema --schemafile wcmp2-bundled.json worldcereal_inference.json
Schema validation errors were encountered.
worldcereal_inference.json::$: 'time' is a required property
worldcereal_inference.json::$: 'geometry' is a required property
worldcereal_inference.json::$.properties.contacts[1]: 'organization' is a required property

I am wondering why a workflow would require
i) A time
ii) A geometry

@tomkralidis
Copy link

Note that OGC API - Records allows for time and geometry to be encoded as null. This could be used as part of describing any resource without spatial or temporal properties, while keeping broad interoperability given use of OGC API - Records and GeoJSON.

@GarinSmith
Copy link

Thanks.
I added "time": null, "geometry": null and it worked as you say. I think this is a good starting point for EarthCODE.

@kalxas , hopefully EOEPCA+ Catalog (or pycsw) will ingest in this format? I think I tried this before with STAC and I could not ingest. Hopefully this will not be an issue for OGC API Records with the latest version of pycsw.
I believe "pycsw supports OGC API - Records - Part 1: Core, version 1.0 by default."

@kalxas
Copy link
Member Author

kalxas commented Sep 24, 2024

@GarinSmith pycsw can ingest both OGC API Record and STAC, it has been demonstrated in various EOEPCA demos.

We need to define/extend the record to describe the workflows

@GarinSmith
Copy link

@kalxas , great thanks. That is very good to know.

Can we start by "defining" and using the current spec, so we can flexibly reference the various different workflow types that can be described externally. This seems like a separation of concerns we need. We also need to try and start off by using what we already have if possible.
E.g. using something like?

OpenEO

links": [
{
"rel": "openeo-process",
"type": "application/json",
"title": "openEO Process Title",
"href": "https://raw.githubusercontent.com/ESA-APEx/apex_algorithms/max_ndvi_composite/openeo_udp/examples/max_ndvi_composite/max_ndvi_composite.json"
}

OGC API Processes

links": [
{
"rel": "ogcapi-process",
"type": "application/json",
"title": "OGC API Process Title",
"href": "https://owncloud.spaceapplications.com/owncloud/index.php/s/iCk60Kmry77o2l6/download"
}

Python Processes

links": [
{
"rel": "python-process",
"type": "application/json",
"title": "Python Process Title",
"href": "https://github.com/GEUS-SICE/SICE/blob/master/S3_wrapper.sh"
}

Jupyter Notebook Processes

links": [
{
"rel": "jupyter-notebook",
"type": "application/json",
"title": "Python Process Title",
"href": "https://github.com/...../..../file.ipynb"
}

@kalxas
Copy link
Member Author

kalxas commented Sep 24, 2024

IANA defines the link relations:
https://www.iana.org/assignments/link-relations/link-relations.xhtml

Also see https://developer.mozilla.org/en-US/docs/Web/HTML/Attributes/rel which mentions:

The current registries for the possible values of the rel attribute are the IANA link relation registry, the HTML Living Standard, and the freely-editable existing-rel-values page in the microformats wiki, as suggested by the Living Standard. If a rel attribute not present in one of the three sources above is used some HTML validators (such as the W3C Markup Validation Service) will generate a warning.

@kalxas
Copy link
Member Author

kalxas commented Sep 24, 2024

My plan is to draft some initial proposal for the next demo.

@GarinSmith
Copy link

Thanks @kalxas,

I saw a reference to IANA before, but could not find the links above. My current thoughts are that OGC API Records seems to provide most of what we currently seem to need.

However

  1. It would be useful to know what a workflow type is somehow.
    E.g. openeo, ogc api processes, JNB, free format Python and so on.
  2. It would be useful to know if a processes is a Workflow (generic) , Experiment (specific) or Dashboard (GUI to Workflow).
    I am am sure about the best way to achieve this and would welcome your guidance?

I note the IANA links above do not seem to be interested in things like Workflows or Processes or Process Types. However, this is important to us, because we need to know the type of link we are looking at, so that we know better what platform can handle that type of link.

I note that the examples above do successfully validate when I use the check-jsonschema tool. They also correspond with the approach some platforms already use, so they are a useful starting point to move forwards from.
At his stage I will suggest that by default all EarthCODE platforms use check-jsonschema tool for schema validation of OGC API Records when appropriate. Again this seems like a very good starting point.

@kalxas
Copy link
Member Author

kalxas commented Sep 28, 2024

I have created a new repository that will host the metadata schema for EOEPCA profile(s):
https://github.com/EOEPCA/metadata-profile/

The resource schema was initialized with the OGC API Records schema:
https://github.com/EOEPCA/metadata-profile/blob/master/schemas/resource.yaml#L7

An enumeration is provided for the resource type which can be further expanded to support various types (as required above).
In my opinion the workflow type has to be defined at the resource/record level, not at the link level:
https://github.com/EOEPCA/metadata-profile/blob/master/schemas/resource.yaml#L9-L15

From that initial resource definition, I have created a JSON Schema bundle as described in WMO by @tomkralidis
https://github.com/EOEPCA/metadata-profile/tree/master/schemas

Validation process described here:
https://github.com/EOEPCA/metadata-profile/tree/master/schemas#validating-an-emp-record

@kalxas kalxas closed this as completed Sep 28, 2024
@GarinSmith
Copy link

Thanks @kalxas,

It might help to test this against a typical scenario that EarthCODE might want to use.
I can validate the current openeo-process example attached (worldcereal_inference2.json)
E.g.
check-jsonschema --schemafile wcmp2-bundled.json worldcereal_inference2.json
ok -- validation done

Could you please update it to include a typical "EOEPCA resource type" for instance a "workflow"
This could be useful when applying FAIR principles (Find, Access etc)

However, how will we know what type of workflow we are dealing with (openeo-process in this case)?
E.g.
openeo-process
ogcapi-process
JNB
etc

Does the type of "workflow" have to go at the link level if there is more than one link?
I think it helps to refer to a real world EarthCODE example that we might one day use.

Many thanks

Garin

@GarinSmith
Copy link

@GarinSmith
Copy link

Hi @kalxas and @rconway,
I totally agree with the point Angelos just made in the update. We just need a starting point that we can use to ingest and then evolve further (in fact I already had this using the previous schema provided by Tom). I need to get this in time for EarthCODE when we start work very shortly. Having a first version from Angelos that also validates with an EarthCODE potential example would be great.

Angelos can you please tweak worldcereal_inference2.json above, so that it validates against your latest schema?
That will be a great starting point. I can only get this to work partially and I had to guess some values to fix one validation issue.

It would help if there was clear meaning to the following EOEPCA resource types that map to the EarthCODE utilisation domain.
I think they they are OK, but here are my assumptions.
- dataset (maybe this could be a product or input to a workflow)
- service (this could map to an application or thing that uses a process or a workflow including a GUI)
- process (is this a specific thing like an experiment that has a specific config for a workflow)
- workflow (assume this is a generic thing of say type OGC API Records or OGC API Processes or JBB etc etc )

Note that EarthCODE has the concept of Workflow, Experiment, Application and Product. We need to map to these somehow, hence my comment above.

@kalxas
Copy link
Member Author

kalxas commented Oct 2, 2024

Thank you @GarinSmith
I will look at the provided record and try to make it validate.
The schema provided is just a first draft, we will need to expand, so I would not provide it yet to EarthCODE for production purposes.

@kalxas
Copy link
Member Author

kalxas commented Oct 2, 2024

@kalxas
Copy link
Member Author

kalxas commented Oct 2, 2024

The is a bug in the schema provided, will work to fix it

@kalxas
Copy link
Member Author

kalxas commented Oct 2, 2024

Schema updated:
EOEPCA/metadata-profile@24d2755

@kalxas
Copy link
Member Author

kalxas commented Oct 2, 2024

check-jsonschema --schemafile resource.json worldcereal_inference2.json
Schema validation errors were encountered.
  worldcereal_inference2.json::$: 'geometry' is a required property
  worldcereal_inference2.json::$.properties.formats[0]: 'GeoTiff' is not of type 'object'
  worldcereal_inference2.json::$.properties.type: 'apex_algorithm' is not one of ['dataset', 'service', 'process', 'workflow']

@kalxas
Copy link
Member Author

kalxas commented Oct 2, 2024

@GarinSmith this is the example that validates:

worldcereal_inference2.json

@kalxas kalxas added the EOfarm label Oct 4, 2024
@GarinSmith
Copy link

Hi @kalxas ,
Thanks, that works for me too.
This is a starting point we can use to guide EarthCODE.
We will use this by default and iterate from here using your guidance along the way.
We hope to have a first production release fairly soon and hopefully we will learn by doing.

@kalxas kalxas changed the title Support workflow resource type in Resource Catalogue Support workflow resource type in Resource Catalogue (Q2) Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BR040 Provide a metadata catalog that supports all resource types required for open reproducible science EOfarm pycsw
Projects
None yet
Development

No branches or pull requests

5 participants