Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(cff): enforce valid urls as doi #108

Merged
merged 15 commits into from
Feb 2, 2024
Merged

fix(cff): enforce valid urls as doi #108

merged 15 commits into from
Feb 2, 2024

Conversation

cmdoret
Copy link
Member

@cmdoret cmdoret commented Feb 1, 2024

Addresses a crash caused by DOIs being stored as 10.xxxx/abcd in CITATION.cff instead of doi.org/10.xxxx/abcd.
We now post-process the DOI to make it a valid URL before casting to URIRef.

Additionally:

  • handle situations where the CITATION.cff is broken (e.g. invalid yaml) without crashing
  • minor style tweaks to the docs.

@cmdoret cmdoret self-assigned this Feb 1, 2024
@cmdoret cmdoret requested a review from rmfranken February 1, 2024 10:59
Copy link
Member

@rmfranken rmfranken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, I leave up to you whether regex is smart and worth it :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very readable - so I even dare to make a suggestion - though I'm not sure about the performance considerations: Instead of having 2 if statements taking care of multiple possible doi formats - would it not be simpler to regex the "10.xxxx" part of the doi out of whatever format is found in, and always prefix that with https://doi.org?

that way it can also handle any misspelling in the prefix in the cff (e.g. http://doi.org/)

Copy link
Member Author

@cmdoret cmdoret Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, that's a bit more robust :) I've added regex based matching.
The performance cost is negligible (in gimie time-scale):

long_doi = 'https://doi.org/10.123112/abc.def'
short_doi = '10.123112/abc.def'

# With regex 
In [6]: %timeit doi_to_url_regex(long_doi)
1.58 µs ± 64.4 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [9]: %timeit doi_to_url_regex(short_doi)
1.45 µs ± 18.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

# With if
In [7]: %timeit doi_to_url_if(long_doi)
230 ns ± 3.92 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [8]: %timeit doi_to_url_if(short_doi)
358 ns ± 3.32 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid crashing the whole program because of malformatted user-entered data, I've also made it a bit more defensive:

  • Invalid yaml -> warn user and extract nothing
  • Valid yaml but no doi -> successfully extract nothing
  • doi value is not a doi -> warn the user and extract nothing

Does that make sense?
In all cases above, the program resumes successfully. warnings are logged to stderr and will not be mixed with the output.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would only doubt a bit on the

Valid yaml but no doi -> successfully extract nothing

It's not mandatory from CFF, but a doi is the only thing we are looking at a CFF for here - maybe we could just produce a warning in this case too? Basically, the only case in which a warning is not produced is either a user has not added a CFF file, or the CFF file they uploaded contains a valid DOI. But maybe that's being too conservative and patriarchal to our users - as if they are not aware that their CFF doesn't contain a DOI at all.

Copy link
Member Author

@cmdoret cmdoret Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense! I added a warning in 4d43f42

Now the behavior is:

➜ gimie data https://github.com/opencv/cvat > /dev/null
WARNING :: CITATION.cff does not contain a 'doi' key.

EDIT: Added formatting to the log messages to include level (DEBUG/INFO/WARNING/ERROR)

@cmdoret cmdoret requested a review from rmfranken February 1, 2024 16:15
@cmdoret cmdoret merged commit e68513f into main Feb 2, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants