fix(cff): enforce valid urls as doi #108

cmdoret · 2024-02-01T10:59:16Z

Addresses a crash caused by DOIs being stored as 10.xxxx/abcd in CITATION.cff instead of doi.org/10.xxxx/abcd.
We now post-process the DOI to make it a valid URL before casting to URIRef.

Additionally:

handle situations where the CITATION.cff is broken (e.g. invalid yaml) without crashing
minor style tweaks to the docs.

rmfranken

Looks good to me, I leave up to you whether regex is smart and worth it :)

rmfranken · 2024-02-01T12:36:36Z

gimie/parsers/cff.py

This is very readable - so I even dare to make a suggestion - though I'm not sure about the performance considerations: Instead of having 2 if statements taking care of multiple possible doi formats - would it not be simpler to regex the "10.xxxx" part of the doi out of whatever format is found in, and always prefix that with https://doi.org?

that way it can also handle any misspelling in the prefix in the cff (e.g. http://doi.org/)

Good point, that's a bit more robust :) I've added regex based matching.
The performance cost is negligible (in gimie time-scale):

long_doi = 'https://doi.org/10.123112/abc.def' short_doi = '10.123112/abc.def' # With regex In [6]: %timeit doi_to_url_regex(long_doi) 1.58 µs ± 64.4 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) In [9]: %timeit doi_to_url_regex(short_doi) 1.45 µs ± 18.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) # With if In [7]: %timeit doi_to_url_if(long_doi) 230 ns ± 3.92 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) In [8]: %timeit doi_to_url_if(short_doi) 358 ns ± 3.32 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

To avoid crashing the whole program because of malformatted user-entered data, I've also made it a bit more defensive:

Invalid yaml -> warn user and extract nothing

Valid yaml but no doi -> successfully extract nothing

doi value is not a doi -> warn the user and extract nothing

Does that make sense?
In all cases above, the program resumes successfully. warnings are logged to stderr and will not be mixed with the output.

I would only doubt a bit on the

Valid yaml but no doi -> successfully extract nothing

It's not mandatory from CFF, but a doi is the only thing we are looking at a CFF for here - maybe we could just produce a warning in this case too? Basically, the only case in which a warning is not produced is either a user has not added a CFF file, or the CFF file they uploaded contains a valid DOI. But maybe that's being too conservative and patriarchal to our users - as if they are not aware that their CFF doesn't contain a DOI at all.

That makes sense! I added a warning in 4d43f42

Now the behavior is:

➜ gimie data https://github.com/opencv/cvat > /dev/null WARNING :: CITATION.cff does not contain a 'doi' key.

EDIT: Added formatting to the log messages to include level (DEBUG/INFO/WARNING/ERROR)

cmdoret added 7 commits January 31, 2024 23:17

fix(cff): parse yaml to handle quotes + prepend scheme to doi

dc49d66

chore(fmt): quotes

d0bd177

chore: update lock

6ba62c0

docs(readme): update obsolete env var name

f125596

docs(readme): lighter fmt

2ee8edf

test(cff): update doctest to include scheme

4f40840

fix(cff): ensure doi is a valid url

43465c4

cmdoret self-assigned this Feb 1, 2024

cmdoret requested a review from rmfranken February 1, 2024 10:59

cmdoret added 2 commits February 1, 2024 12:48

fix: drop unused import

db81df7

fix(cff): error handling on invalid yaml

ff6bd5d

rmfranken approved these changes Feb 1, 2024

View reviewed changes

refactor(cff): use regex + defensive prog

7d83123

cmdoret requested a review from rmfranken February 1, 2024 16:15

cmdoret added 2 commits February 1, 2024 17:16

fix(cff): drop unused prefix var

fbae498

test(cff): update test cases with valid doi

159a65e

rmfranken approved these changes Feb 2, 2024

View reviewed changes

cmdoret added 3 commits February 2, 2024 09:43

feat(cff): add warning when doi missing from cff

4d43f42

fix(log): replace deprecated logger.warn -> logger.warning

94d2369

feat(log): format to include loglevel

028cbd7

cmdoret merged commit e68513f into main Feb 2, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cff): enforce valid urls as doi #108

fix(cff): enforce valid urls as doi #108

cmdoret commented Feb 1, 2024 •

edited

Loading

rmfranken left a comment

rmfranken Feb 1, 2024

cmdoret Feb 1, 2024 •

edited

Loading

cmdoret Feb 1, 2024

rmfranken Feb 2, 2024

cmdoret Feb 2, 2024 •

edited

Loading

fix(cff): enforce valid urls as doi #108

fix(cff): enforce valid urls as doi #108

Conversation

cmdoret commented Feb 1, 2024 • edited Loading

rmfranken left a comment

Choose a reason for hiding this comment

rmfranken Feb 1, 2024

Choose a reason for hiding this comment

cmdoret Feb 1, 2024 • edited Loading

Choose a reason for hiding this comment

cmdoret Feb 1, 2024

Choose a reason for hiding this comment

rmfranken Feb 2, 2024

Choose a reason for hiding this comment

cmdoret Feb 2, 2024 • edited Loading

Choose a reason for hiding this comment

cmdoret commented Feb 1, 2024 •

edited

Loading

cmdoret Feb 1, 2024 •

edited

Loading

cmdoret Feb 2, 2024 •

edited

Loading