Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(cff): enforce valid urls as doi #108

Merged
merged 15 commits into from
Feb 2, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Scientific code repositories contain valuable metadata which can be used to enri

Using Gimie: easy peasy, it's a 3 step process.

## STEP 1: Installation
## 1: Installation

To install the stable version on PyPI:

Expand All @@ -32,10 +32,10 @@ Gimie is also available as a docker container hosted on the [Github container re
docker pull ghcr.io/sdsc-ordes/gimie:latest

# The access token can be provided as an environment variable
docker run -e ACCESS_TOKEN=$ACCESS_TOKEN ghcr.io/sdsc-ordes/gimie:latest gimie data <repo>
docker run -e GITHUB_TOKEN=$GITHUB_TOKEN ghcr.io/sdsc-ordes/gimie:latest gimie data <repo>
```

## STEP 2 : Set your credentials
## 2 : Set your credentials

In order to access the github api, you need to provide a github token with the `read:org` scope.

Expand All @@ -61,7 +61,7 @@ and/or your Gitlab token:
export GITLAB_TOKEN=
```

## STEP 3: GIMIE info ! Run Gimie
## 3: GIMIE info ! Run Gimie

### As a command line tool

Expand Down
6 changes: 5 additions & 1 deletion gimie/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,4 +22,8 @@
__version__ = importlib_metadata.version(__name__)

logger = logging.getLogger()
logger.setLevel(logging.WARNING)
stdout_formatter = logging.Formatter("%(levelname)s :: %(message)s")
stream_handler = logging.StreamHandler()
stream_handler.setLevel(logging.WARNING)
stream_handler.setFormatter(stdout_formatter)
logger.addHandler(stream_handler)
77 changes: 65 additions & 12 deletions gimie/parsers/cff.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very readable - so I even dare to make a suggestion - though I'm not sure about the performance considerations: Instead of having 2 if statements taking care of multiple possible doi formats - would it not be simpler to regex the "10.xxxx" part of the doi out of whatever format is found in, and always prefix that with https://doi.org?

that way it can also handle any misspelling in the prefix in the cff (e.g. http://doi.org/)

Copy link
Member Author

@cmdoret cmdoret Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, that's a bit more robust :) I've added regex based matching.
The performance cost is negligible (in gimie time-scale):

long_doi = 'https://doi.org/10.123112/abc.def'
short_doi = '10.123112/abc.def'

# With regex 
In [6]: %timeit doi_to_url_regex(long_doi)
1.58 µs ± 64.4 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [9]: %timeit doi_to_url_regex(short_doi)
1.45 µs ± 18.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

# With if
In [7]: %timeit doi_to_url_if(long_doi)
230 ns ± 3.92 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [8]: %timeit doi_to_url_if(short_doi)
358 ns ± 3.32 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid crashing the whole program because of malformatted user-entered data, I've also made it a bit more defensive:

  • Invalid yaml -> warn user and extract nothing
  • Valid yaml but no doi -> successfully extract nothing
  • doi value is not a doi -> warn the user and extract nothing

Does that make sense?
In all cases above, the program resumes successfully. warnings are logged to stderr and will not be mixed with the output.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would only doubt a bit on the

Valid yaml but no doi -> successfully extract nothing

It's not mandatory from CFF, but a doi is the only thing we are looking at a CFF for here - maybe we could just produce a warning in this case too? Basically, the only case in which a warning is not produced is either a user has not added a CFF file, or the CFF file they uploaded contains a valid DOI. But maybe that's being too conservative and patriarchal to our users - as if they are not aware that their CFF doesn't contain a DOI at all.

Copy link
Member Author

@cmdoret cmdoret Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense! I added a warning in 4d43f42

Now the behavior is:

➜ gimie data https://github.com/opencv/cvat > /dev/null
WARNING :: CITATION.cff does not contain a 'doi' key.

EDIT: Added formatting to the log messages to include level (DEBUG/INFO/WARNING/ERROR)

Original file line number Diff line number Diff line change
Expand Up @@ -17,15 +17,17 @@
from io import BytesIO
import re
from typing import List, Optional, Set
import yaml

from rdflib.term import URIRef

from gimie import logger
from gimie.graph.namespaces import SDO
from gimie.parsers.abstract import Parser, Property


class CffParser(Parser):
"""Parse cff file to extract the doi into schema:citation <doi>."""
"""Parse DOI from CITATION.cff into schema:citation <doi>."""

def __init__(self):
super().__init__()
Expand All @@ -43,30 +45,81 @@ def parse(self, data: bytes) -> Set[Property]:
return props


def doi_to_url(doi: str) -> str:
"""Formats a doi to an https URL to doi.org.

Parameters
----------
doi
doi where the scheme (e.g. https://) and
hostname (e.g. doi.org) may be missing.

Returns
-------
str
doi formatted as a valid url. Base url
is set to https://doi.org when missing.

Examples
--------
>>> doi_to_url("10.0000/example.abcd")
'https://doi.org/10.0000/example.abcd'
>>> doi_to_url("doi.org/10.0000/example.abcd")
'https://doi.org/10.0000/example.abcd'
>>> doi_to_url("https://doi.org/10.0000/example.abcd")
'https://doi.org/10.0000/example.abcd'
"""

# regex from:
# https://www.crossref.org/blog/dois-and-matching-regular-expressions
doi_match = re.search(
r"10.\d{4,9}/[-._;()/:A-Z0-9]+$", doi, flags=re.IGNORECASE
)

if doi_match is None:
raise ValueError(f"Not a valid DOI: {doi}")

short_doi = doi_match.group()

return f"https://doi.org/{short_doi}"


def get_cff_doi(data: bytes) -> Optional[str]:
"""Given a CFF file, returns the DOI, if any.

Parameters
----------
data:
data
The cff file body as bytes.

Returns
-------
str, optional
doi formatted as a valid url

Examples
--------
>>> get_cff_doi(bytes("doi: 10.5281/zenodo.1234", encoding="utf8"))
'10.5281/zenodo.1234'
'https://doi.org/10.5281/zenodo.1234'
>>> get_cff_doi(bytes("abc: def", encoding="utf8"))

"""

matches = re.search(
r"^doi: *(.*)$",
data.decode(),
flags=re.IGNORECASE | re.MULTILINE,
)
try:
doi = matches.groups()[0]
except AttributeError:
doi = None
cff = yaml.safe_load(data.decode())
except yaml.scanner.ScannerError:
logger.warning("cannot read CITATION.cff, skipped.")
return None

try:
doi_url = doi_to_url(cff["doi"])
# No doi in cff file
except (KeyError, TypeError):
logger.warning("CITATION.cff does not contain a 'doi' key.")
doi_url = None
# doi is malformed
except ValueError as err:
logger.warning(err)
doi_url = None

return doi
return doi_url
Loading
Loading