As the title states, these scripts curate (deconvolute) the yield columns contained in aforementioned public data sets.
First is the so-called "Lowe dataset": https://figshare.com/articles/dataset/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873
Second, the so called "Schwaller dataset" which is a continuation of above set: https://pubs.rsc.org/en/content/articlelanding/2018/SC/C8SC02339E
(This in turn uses the curated set disclosed by Jin and coworkers, see https://dl.acm.org/doi/10.5555/3294996.3295021)
All credit for these curation goes to the authors in above links, as well as the appropriate licensing info for the data.
These two data-sets have two pre-curated columns of text mined yield versus calculated yield. Many entries don't contain
any, partial, or incorrect numbers. For the majority of reaction analysis when having only the yield as available outcome,
that information becomes useless since there is no correlation to reaction conditions. Thus, by correcting & eliminating such
entries, noise in the data set is reduced.
The new datasets contain approx. only 50% of the original dataset!
Two scripts are available:
- curate_yield_cropped.py: this removes incomplete/missing/wrong entries entirely, incl. the patent/paragraf/yield columns and corresponds to the data available on Figshare. For users who don't agree to the filtration and want to keep all data:
- curate_yield_full.py: curates the yield, but keeps all other data (incl. columns) intact.
Python 3.6 or higher. Only standard numpy/pandas libraries are used.
Change the path & file names in the script to your locations. The scripts are light weight without any error checking or cmd line inputs, etc.
- Source
- Target
- CanonicalizedReaction
- OriginalReaction (= ReactionSmiles*)
- PatentNumber*
- ParagraphNum*
- Year*
- TextMinedYield*
- CalculatedYield*
TextMinedYield
- x%
- > or < x% & >= x%
- ~x%
- ~x to y&
- x.xx%
- negative numbers
CalculatedYield
- x.x%
- also with some negative values or values >> 100%
new columns (independent of filtration of input columns)
- ID (optional)
- Yield (the curated one, in %)
- Although some values look as if they are due to calculation error (factor 10 or 100), or a typo (missing .), this remains an assumption and thus such numbers have to been seen as faulty and dismissed.
- If neither yield type exists or if the value > 100% or negative, then it is set to 0.
- If a value is only in one column, that value is used
- If value is available in both, then the larger value is used
- Range x to y%: the largest value y% is used
- Since 0 corresponds to basically none existent or faulty data, these are dismissed (filtered out) entirely.
- An estimated <0.1% of incorrect entries are not corrected and end up being also filtered out.
figshare: https://doi.org/10.6084/m9.figshare.14414039
MIT license, see license file for details.