Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review latest workflow schemas for EOEPCA+ #233

Open
GarinSmith opened this issue Aug 7, 2024 · 39 comments
Open

Review latest workflow schemas for EOEPCA+ #233

GarinSmith opened this issue Aug 7, 2024 · 39 comments
Assignees

Comments

@GarinSmith
Copy link

EOEPCA+ will look to define a schema for

  1. Reproducible job details (Workflow Metadata)
  2. Replicable workflow (Experiment Metadata)

See - System level
https://github.com/orgs/EOEPCA/projects/4/views/13?sliceBy%5Bvalue%5D=Resource+Discovery&pane=issue&itemId=60227850

See - BB level
https://github.com/orgs/EOEPCA/projects/7/views/1?filterQuery=workflow&pane=issue&itemId=69113579

Garin to discuss with @rconway and Angelos whose GitHub id I cannot find yet.
Garin also to confirm that EOEPCA+ Catalogue can ingest and discover OGC API Records. We believe that we will need to write a new Front End in the portal to Find and Access OGC API Records in the same way we do for STAC.

pycsw supports OGC API - Records - Part 1: Core, version 1.0 by default.
See https://docs.pycsw.org/en/latest/oarec-support.html
Angelos has confirmed that OSC currently has support to ingest and discover OGC API Records.

Angelos noted that "Open Science Catalog is 2 versions behind, several fixes and new features have been implemented in the last 6 months or so"

Angelos is off from 9 Aug to 9 Sep, but I will meet him tomorrow for his advice on
EOEPCA+ will look to define a schema for

  1. Reproducible job details (Workflow Metadata)
  2. Replicable workflow (Experiment Metadata)

Please also refer to
https://github.com/orgs/ESA-EarthCODE/projects/5/views/8?pane=issue&itemId=72092886

@GarinSmith GarinSmith self-assigned this Aug 7, 2024
@GarinSmith
Copy link
Author

GarinSmith commented Aug 9, 2024

After initial review with Angelos. We agreed that we should use OGC API Records to

  • Provide a link to an externally hosted workflow definition (probably on GitHub)
  • This could support numerous workflow types as referenced in the suggested architecture
    • CWL
    • openEO
    • Jupyter Notebook
    • Another etc
  • Add additionally agreed metadata in a consistent format

This is important because it means (as Richard suggested)

  • We can start to ingest Workflows in a very flexible manner
  • We can start to "Find" Workflows in a very flexible manner
  • We can start to "Access" Workflows in a very flexible manner
  • This can done regardless of the format of the workflow reference (e.g. CWL, openEO etc although we may include the workflow type)
  • This approach could also be possibly be used for a Reproducible job details (Workflow Metadata) and or a Replicable workflow (Experiment Metadata). E.g. CWL could perhaps implement either approach using default parameters and openEO may default to a Replicable workflow implementation.

E.g.
openEO

https://github.com/ESA-APEx/apex_algorithms/blob/main/algorithm_catalog/worldcereal_inference.json
Link
"rel": "openeo-process"
"rel": "git"
"rel": "service"
"rel": "license
Additional Metadata
"properties": {
"created": "2024-05-17T00:00:00Z",
"updated": "2024-05-17T00:00:00Z",
"type": "apex_algorithm",
"title": "ESA worldcereal global maize detector",
"description": "A maize detection algorithm.",
"cost_estimate": 0.1,
"cost_unit": "platform credits per km\u00b2",
etc

Open Science Catalog

catalog.osc.earthcode.eox.at/collections/metadata:main/items/HCA_L2E_CS_LTA__SIR1SAR_FR_20150331T150158_20150331T150200_D001?f=json
"links"
Addional Metadata
"properties": {
"title": "HCA_L2E_CS_LTA__SIR1SAR_FR_20150331T150158_20150331T150200_D001",
"description": "HYDROCOASTAL Final Product: ........
"datetime": "2023-02-10T08:45:21.061533Z",
"start_datetime": "2015-03-31T15:02:32.858513+00:00",
"end_datetime": "2015-03-31T15:02:34.864426+00:00",
"created": "2023-02-10T08:45:21.061533+00:00"
}

Note that above example provided by Angelos seems to refer to a wms service and not a CWL file. This can be clarified when Angelos returns.

Next Steps

  1. Review different examples
  2. Angelos to make clearer suggestion on return for standard schema
  3. Get feedback from platforms when we know what they are.

@GarinSmith
Copy link
Author

It is important that this approach also supports the types of workflows identified by @edobrowolska in
ESA-EarthCODE/portal#17

Ewelina, identified a number of scripts that can be considered workflows. E.g
https://github.com/diarmuidcorr/Lake-Channel-Identifier/blob/v1.0/Landsat-8%20SGL%20and%20Channel%20Classifier (Python script)
https://github.com/GEUS-SICE/SICE/blob/master/S3_wrapper.sh (Python script)

These scripts might be regarded as unstructured workflows, perhaps like a Jupyter Notebook. It may be that at some point they might be converted to a more formal workflow like for instance a CWL file (OGC API Processes). However there is no reason why these unstructured scripts cannot be used and supported by EarthCODE using the above approach.

E.g.

{
  "rel": "git",
  "type": "application/json",
  "title": "Git source repository",
  "href": "https://github.com/diarmuidcorr/Lake-Channel-Identifier/blob/v1.0/Landsat-8%20SGL%20and%20Channel%20Classifier"
},

or

{
  "rel": "git",
  "type": "application/json",
  "title": "Git source repository",
  "href": "https://github.com/GEUS-SICE/SICE/blob/master/S3_wrapper.sh"
},

The above syntax may not be 100% correct, but hopefully, it demonstrates what is possible with OGC API Records.

It may be that
i) Angelos can recommend an elaborated standard approach.
ii) One/all of the chosen EarthCODE contractors can suggest a suitable schema that already works with the existing platforms .
iii) A combination of the above works.

@GarinSmith
Copy link
Author

GarinSmith commented Aug 28, 2024

We need to confirm the schema that will be used for validation of OGC API Records
See https://github.com/opengeospatial/ogcapi-records/tree/master/core/openapi/schemas
E.g.
https://github.com/opengeospatial/ogcapi-records/blob/master/core/openapi/schemas/recordJSON.yaml
or
https://github.com/opengeospatial/ogcapi-records/blob/master/core/openapi/schemas/recordGeoJSON.yaml

Hopefully we can test some of the above examples using the correct schema.

@GarinSmith
Copy link
Author

GarinSmith commented Sep 3, 2024

I just reviewed this approach with EOEPCA+ and will link then to this user story to help clarify our requirements.
See EOEPCA/resource-discovery#56

I have asked for EOEPCA+ guidance on how to validate schema compliance? This seems quite complicated.
E.g. see
https://json-schema.org/implementations#validators-web-(online)
or
https://json-schema.org/implementations#command-line

The online schemas to not seem to cope with $ref instances and there seem to be lots for OGC API Records.

I have looked at command lined solutions like Polyglottal JSON Schema Validator and these seem to struggle too.
E.g.
pajv validate -s recordGeoJSON.yaml -d record.json -r recordCommonProperties.yaml -r time.yaml -r linkBase.yaml -r linkTemplate.yaml
(this does not yet seem to work yet)

For above I used (schemas)
https://github.com/opengeospatial/ogcapi-records/blob/master/core/openapi/schemas/recordJSON.yaml
https://github.com/opengeospatial/ogcapi-records/blob/master/core/openapi/schemas/recordGeoJSON.yaml
and (records)
https://github.com/opengeospatial/ogcapi-records/blob/master/core/examples/json/record.json

@GarinSmith
Copy link
Author

I also got some useful strategic possibilities from EOEPCA

E.g.
from Jonas Sølvsteen
( potential platform integration with UK EO Data Hub
https://github.com/os-climate/hazard/blob/main/hazard_workflow.cwl
(this is meant to run among others on the UK EO Data Hub https://eodatahub.org.uk/)
https://d1fzab3z0mlfhy.cloudfront.net/
https://radiantearth.github.io/stac-browser/#/external/pgstac.demo.cloudferro.com/

from Gérald FENOY
https://ospd-02.geolabs.fr/examples/cwls/algae-usecase-workflow-copernicus.cwl
https://ospd-02.geolabs.fr/examples/app-package.cwl

@GarinSmith
Copy link
Author

Angelos is now back and will ask Peter Vretanos from the OGC API Records SWG about the tool he used to validate all the examples in the specification.

@kalxas
Copy link
Contributor

kalxas commented Sep 19, 2024

@GarinSmith
Copy link
Author

The link above provided by @kalxas is very helpful and provides a very good starting point to move forwards for EarthCODE.

@GarinSmith
Copy link
Author

I have now successfully validated the OpenEO example above using a command line tool.
The key parts of this investigation are stored in the updated Architecture document.

In summary

  1. We have a tool to validate OGC API Records (https://github.com/wmo-im/wcmp2/tree/main/schemas)
  2. I can successfully validate the OpenEO example above although I had to fix 3 minor problems
    time must be set as null - "time": null
    geometry must be set as null - "geometry": null
    "organization" is mandatory and was missing in one instance.
  3. We could use the OpenEO for other workflows. See Other Examples below
  4. All these examples also validate using the tool above.
  5. This is a good starting point, but we need to agree with Angelos what is the best way to
    i) Know what a workflow type is somehow.
    E.g. openeo, ogc api processes, JNB, free format Python and so on.
    ii) Know if a processes is a Workflow (generic) , Experiment (specific) or Dashboard (GUI to Workflow).
    Apparently the examples below may not be best practise although they do validate and they also conform to the current OpenEO approach, so in a way they are perhaps good reference/starting point.
    I have asked Angelos for guidance.

Other Examples

OpenEO

links": [
{
"rel": "openeo-process",
"type": "application/json",
"title": "openEO Process Title",
"href": "https://raw.githubusercontent.com/ESA-APEx/apex_algorithms/max_ndvi_composite/openeo_udp/examples/max_ndvi_composite/max_ndvi_composite.json"
}

OGC API Processes

links": [
{
"rel": "ogcapi-process",
"type": "application/json",
"title": "OGC API Process Title",
"href": "https://owncloud.spaceapplications.com/owncloud/index.php/s/iCk60Kmry77o2l6/download"
}

Python Processes

links": [
{
"rel": "python-process",
"type": "application/json",
"title": "Python Process Title",
"href": "https://github.com/GEUS-SICE/SICE/blob/master/S3_wrapper.sh"
}

Jupyter Notebook Processes

links": [
{
"rel": "jupyter-notebook",
"type": "application/json",
"title": "Python Process Title",
"href": "https://github.com/...../..../file.ipynb"
}

@GarinSmith
Copy link
Author

This now works with the following examples

Test using check-jsonschema

https://schemas.wmo.int/wcmp/2.0.0/standard/wcmp-2.0.0.pdf
"The WMO Core Metadata Profile (WCMP) standard defined herein is an extension of the
International Standard OGC API - Records - Part 1: Core."

sudo pip3 install check-jsonschema
wget https://schemas.wmo.int/wcmp/2.0.0/schemas/wcmp2-bundled.json // schema
wget https://schemas.wmo.int/wcmp/2.0.0/examples/de-dwd.surface-weather-observations-realtime.json // example
check-jsonschema --schemafile wcmp2-bundled.json de-dwd.surface-weather-observations-realtime.json // success

Example OGC API Record
git clone https://github.com/opengeospatial/ogcapi-records.git
cp ogcapi-records/core/examples/json/record.json .
check-jsonschema --schemafile wcmp2-bundled.json record.json
record.json::$.properties.contacts[0]: 'organization' is a required property
check-jsonschema --schemafile wcmp2-bundled.json record2.json // success

OpenEO Example
git clone https://github.com/ESA-APEx/apex_algorithms.git
cp apex_algorithms/algorithm_catalog/worldcereal_inference.json .
check-jsonschema --schemafile wcmp2-bundled.json worldcereal_inference.json
Schema validation errors were encountered.
worldcereal_inference.json::$: 'time' is a required property
worldcereal_inference.json::$: 'geometry' is a required property
worldcereal_inference.json::$.properties.contacts[1]: 'organization' is a required property
check-jsonschema --schemafile wcmp2-bundled.json worldcereal_inference2.json // success

@kalxas
Copy link
Contributor

kalxas commented Sep 30, 2024

@GarinSmith please check the new EOEPCA metadata schema here:
https://github.com/EOEPCA/metadata-profile/tree/master/schemas

@GarinSmith
Copy link
Author

@kalxas i just looked,
Is it possible to give us the schemas/resource-bundled.json ?
to avoid having to install Stoplight Studio etc
and we can then re-test our existing ogc api records JSON examples?
See other thread regarding the suggestion of focusing on potential real world examples to help make progress.
Please see EOEPCA/resource-discovery#56

@kalxas
Copy link
Contributor

kalxas commented Sep 30, 2024

@GarinSmith
Copy link
Author

Thanks, I have used this new schema, but I am having trouble converting the EOPECA example to one that works
please see EOEPCA/resource-discovery#56

  1. Could you please convert the EarthCODE example to one that validates? So we then have a good working reference.
  2. Could you explain how we will know what type of workflow we are dealing with (openeo-process in this case)?
    E.g.
    openeo-process
    ogcapi-process
    JNB
    etc
    This may be clear when 1) is answered.

@GarinSmith
Copy link
Author

It would also help to have a reference guide such as
https://schemas.wmo.int/wcmp/2.0.0/standard/wcmp-2.0.0.pdf
although this may be less necessary when we have a working reference (see above).

@Schpidi
Copy link
Member

Schpidi commented Oct 1, 2024

If I understand this correctly it would be great to prepare a PR towards https://github.com/ESA-EarthCODE/open-science-catalog-validation to add the schema and any other validation artifacts extending the validation command open-science-catalog-validation ./{eo-missions,products,projects,themes,variables}.

The consecutive task would be to add a first example workflow to https://github.com/ESA-EarthCODE/open-science-catalog-metadata and extend the validation action with the extended command.

@kalxas
Copy link
Contributor

kalxas commented Oct 2, 2024

@GarinSmith
Let's see what we did for Processes, ADES, CSW in EOEPCA v1.x:
We used the type "service" and added keywords in the record to specify that it is of a specific flavor.
https://resource-catalogue.develop.eoepca.org/collections/metadata:main/items/https---demo-pygeoapi-io-stable-processes-
For CWL records, we used type=application and added "CWL" in keywords.
We did not use a custom link relation in all those cases.

I propose we follow the same pattern, using record.properties.type to specify workflows.

There are 2 options:

  1. Use type="workflow" for all flavors of workflows and specify the flavor in a keyword (that makes it queryable in the catalog)
  2. Use the flavor as a dedicated type (e.g. type="openeo-process") and add the keyword "workflow"

@silvester-pari
Copy link
Collaborator

@kalxas there are some issues with the schema https://github.com/EOEPCA/metadata-profile/blob/master/schemas/resource.json, like the given URL and the $id in the schema are not consistent (e.g. yaml file extension instead of .json) and some other things that the Node-based validator (AJV) we are using is complaining about (seems to be stricter than the Python one check-jsonschema). What's the best way to iterate on this schema to make it work for OSC?

@m-mohr
Copy link
Collaborator

m-mohr commented Oct 9, 2024

Yeah, the schema should be hosted in a way that the $id matches the actual URL and ideally it would also return a JSON media type instead of text/plain. Maybe host it through GitHub Pages?

AJV also complains about the keyword "example" which should be "examples" (and an array). The format "url" should probably be "uri"?! See https://json-schema.org/understanding-json-schema/reference/string#resource-identifiers

Here's a fixed version of the schema with instructions (see bash.sh) how to run it with ajv-cli:
https://gist.github.com/m-mohr/381e31fcbe23a015d080925d07424384

@kalxas
Copy link
Contributor

kalxas commented Oct 14, 2024

The schema was generated from the OGC API - Records yaml schema using the tools described here: https://github.com/EOEPCA/metadata-profile/tree/master/schemas
The upstream schema is here:
https://github.com/opengeospatial/ogcapi-records/blob/master/core/openapi/schemas/recordGeoJSON.yaml

@m-mohr
Copy link
Collaborator

m-mohr commented Oct 14, 2024

Yeah OpenAPI (3.0) and JSON Schema can't be converted 1:1, it seems the tool doesn't do the conversion very strictly. Shall we propose the changes from my Gist to the upstream schema then? The $id is of course always something that needs to be updated manually after conversion, I guess.

@kalxas
Copy link
Contributor

kalxas commented Oct 15, 2024

thanks @m-mohr , opened an upstream pull request

@silvester-pari
Copy link
Collaborator

Thanks @kalxas, there was a second issue:

AJV also complains about the keyword "example" which should be "examples" (and an array).

Is this also something that can be fixed upstream?

@kalxas
Copy link
Contributor

kalxas commented Oct 18, 2024

I will check with the SWG

@silvester-pari
Copy link
Collaborator

@kalxas any updates yet?

@m-mohr
Copy link
Collaborator

m-mohr commented Oct 31, 2024

Shouldn't the example provided above (i.e. https://schemas.wmo.int/wcmp/2.0.0/examples/de-dwd.surface-weather-observations-realtime.json ) include a conformance class for (static) OGC API - Records?

@kalxas
Copy link
Contributor

kalxas commented Nov 1, 2024

no further updates yet

@kalxas
Copy link
Contributor

kalxas commented Nov 2, 2024

I think OpenAPI allows for both example (string/object/array) and examples (array)

@m-mohr
Copy link
Collaborator

m-mohr commented Nov 2, 2024

That's correct, but then it needs to be converted to examples for JSON Schema. So why not just use examples also in OpenAPI if that's the common denominator? Makes it easier to switch between JSON Schema and OpenAPI.

The question with the conformance classes is also still open.

@kalxas
Copy link
Contributor

kalxas commented Nov 3, 2024

opengeospatial/ogcapi-records#396

@kalxas
Copy link
Contributor

kalxas commented Nov 3, 2024

@m-mohr regarding conformance classes:
If the conformance class is based on a requirements class that has a dependency, why should we additionally specify OARec as another conformance class?

@m-mohr
Copy link
Collaborator

m-mohr commented Nov 3, 2024

@kalax How can I resolve in a client which conformance classes apply including all depencies? My client can't know all conformance classes of the whole world and it's dependencies, but should be able validate whether something is a valid Record. It doesn't know whether it's Records though because the conformace class is not listed. So it would reject the file as invalid Records. So I think the dependencies need to be listed. I would rather ask: Why would you not list it?

@kalxas
Copy link
Contributor

kalxas commented Nov 26, 2024

Just out of the OGC API Records meeting, we have covered all the pending issues. Changes to be applied shortly as pull requests.

@m-mohr
Copy link
Collaborator

m-mohr commented Nov 26, 2024

Can you remind me what the solution for the example vs examples issue was? @kalxas Migration to OpenAPI 3.1?

@kalxas
Copy link
Contributor

kalxas commented Dec 4, 2024

Yes, there is a plan to migrate OGC APIs to OpenAPI 3.1

@kalxas
Copy link
Contributor

kalxas commented Dec 30, 2024

EOEPCA/metadata-profile#2

@kalxas
Copy link
Contributor

kalxas commented Jan 15, 2025

The example(s) have been removed from the schema (also upstream)

@Schpidi
Copy link
Member

Schpidi commented Jan 21, 2025

Updated PR for records validation in ESA-EarthCODE/open-science-catalog-validation#12
Tested against a draft example included in https://github.com/ESA-EarthCODE/open-science-catalog-metadata-testing/tree/demo-workflow

@m-mohr
Copy link
Collaborator

m-mohr commented Feb 4, 2025

Records validation now in: ESA-EarthCODE/open-science-catalog-validation#16
Example metadata in: ESA-EarthCODE#69

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants