-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(cff): enforce valid urls as doi #108
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, I leave up to you whether regex is smart and worth it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very readable - so I even dare to make a suggestion - though I'm not sure about the performance considerations: Instead of having 2 if statements taking care of multiple possible doi formats - would it not be simpler to regex the "10.xxxx" part of the doi out of whatever format is found in, and always prefix that with https://doi.org?
that way it can also handle any misspelling in the prefix in the cff (e.g. http://doi.org/)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, that's a bit more robust :) I've added regex based matching.
The performance cost is negligible (in gimie time-scale):
long_doi = 'https://doi.org/10.123112/abc.def'
short_doi = '10.123112/abc.def'
# With regex
In [6]: %timeit doi_to_url_regex(long_doi)
1.58 µs ± 64.4 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
In [9]: %timeit doi_to_url_regex(short_doi)
1.45 µs ± 18.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
# With if
In [7]: %timeit doi_to_url_if(long_doi)
230 ns ± 3.92 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
In [8]: %timeit doi_to_url_if(short_doi)
358 ns ± 3.32 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid crashing the whole program because of malformatted user-entered data, I've also made it a bit more defensive:
- Invalid yaml -> warn user and extract nothing
- Valid yaml but no doi -> successfully extract nothing
- doi value is not a doi -> warn the user and extract nothing
Does that make sense?
In all cases above, the program resumes successfully. warnings are logged to stderr and will not be mixed with the output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would only doubt a bit on the
Valid yaml but no doi -> successfully extract nothing
It's not mandatory from CFF, but a doi is the only thing we are looking at a CFF for here - maybe we could just produce a warning in this case too? Basically, the only case in which a warning is not produced is either a user has not added a CFF file, or the CFF file they uploaded contains a valid DOI. But maybe that's being too conservative and patriarchal to our users - as if they are not aware that their CFF doesn't contain a DOI at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense! I added a warning in 4d43f42
Now the behavior is:
➜ gimie data https://github.com/opencv/cvat > /dev/null
WARNING :: CITATION.cff does not contain a 'doi' key.
EDIT: Added formatting to the log messages to include level (DEBUG/INFO/WARNING/ERROR)
Addresses a crash caused by DOIs being stored as
10.xxxx/abcd
inCITATION.cff
instead ofdoi.org/10.xxxx/abcd
.We now post-process the DOI to make it a valid URL before casting to
URIRef
.Additionally: