The output of the EXSCLAIM! pipeline is an EXSCLAIM JSON which presents data extracted from open, peer-reviewed journals in an organized structure. This structure maps Figure names to their respective Figure JSONs. In each Figure JSON, there is an attribute "Master Images" that maps to Master Image JSON representing each Master Image in the figure. Each Figure JSON also contains metadata and an "unassigned" attribute containing all figure components that have not been organized into a master image.

Coordinate JSON


Describes a specific pixel in an image, located x pixels from the left of the image and y pixels from the top.


  • x : An integer. The number of pixels from the left of the image to the specified coordinate.
  • y : An integer. The number of pixels from the left of the image to the specified coordinate.

Bounding Box JSON


Describes a bounding box (a rectangle) on an image as a list of its four corners. Note that the sides of this bounding box must be parallel to the sides of the image.


No attributes. Is a list of Coordinate JSONs

Label JSON


Describes either a Subfigure Label or Scale Bar Label by giving its location and contents within the Figure.


Scale Bar JSON


Describes both a Scale Bar Label and the Scale Bar Line it describes.


  • label : A Label JSON. The Scale Bar Label, usually an integer and a unit of measurement.
  • geometry : A Bounding Box JSON. The Scale Bar Line.

Child Image JSON


A Child Image (any individual image without its own highest order Subfigure Label i.e. a Dependent or Inset Image) contained within the Figure in question described by its location, label, scale bar, and classification.


  • label : (optional) A Label JSON. The Subfigure Label, usually a single letter or number optionally surrounded by parenthesis.
  • geometry : A Bounding Box JSON. The location of the Subfigure in the Figure.
  • scale bar : (optional) A Scale Bar JSON. Describes the Scale Bar contained within the Subfigure.
  • classification : A string. The type of image. Possible values are: Microscopy, Diffraction, Graph, Illustration, Photo, or Unclear.

Master Image JSON


A Master Image (the largest “super” image clearly associated with the neighboring highest order Subfigure Label that captures all images/subfigures the Subfigure label refers to) contained within the Figure. Fully described by its Child Images, image type, component Scale Bars, Labels, and associated caption.


  • label : (optional) A Label JSON. The Subfigure Label, usually a single letter or number optionally surrounded by parenthesis.
  • geometry : A Bounding Box JSON. The location of the Master Image in the Figure.
  • scale bar : (optional) A Scale Bar JSON. Describes the Scale Bar contained within the Subfigure.
  • classification : A string. The type of image. One of: Parent, Microscopy, Diffraction, Graph, Illustration, Photo, or Unclear.
  • inset images : (optional) A list of Child Image JSONs. All Inset Images contained within the Master Image.
  • dependent images : (optional) A list of Child Image JSONs. All Dependent Images contained within the Master Image. Note: this list should only be filled if classification is Parent.
  • caption : (optional) A string. The caption chunk from the Figure's article describing the Master Image.

Figure JSON


Contains all known information about a given Figure. Includes metadata, all Master Images, and all unassigned objects.


  • article url : A string. URL to the article from which the figure came.
  • journal : A string. Name of the journal in which the figure appeared.
  • article title : A string. Title of the figure's article.
  • full caption : A string. The full caption describing the figure.
  • name : A string. The name of the figure. The Figure Scraper package initializes the name as the article name concatenated with _fig{i} where {i} means it was the ith figure scraped from the article.
  • url : A string. URL to the figure.
  • open : A boolean. True if article is open access
  • license : A string. Text of, or link to article license, if known.
  • master images : A list of Master Image JSONs. Contains all master images that compose the figure. These are extracted from the unassigned attribute in the cluster module.
  • unassigned : An Object JSON. Contains all objects the object detection package found.

Object JSON


Contains all objects detected by the object detector. Objects include Master Image, Dependent Image, Inset Image, Subfigure Label, Scale Bar Label, Scale Bar Line, and Scale Bar (only included after cluster pairs Scale Bar Lines with Scale Bar Labels). Each of these objects are represented as JSONs including their classification (if an image), text (if a label) and location (as geometry attribute).


Image JSON


Describes unassigned Master, Dependent, and Inset Images output from the object detector


  • geometry : A Bounding Box JSON. The location of the Master Image in the Figure.
  • classification : A string. The type of image. One of: Parent, Microscopy, Diffraction, Graph, Illustration, Photo, or Unclear.



Main result of running the EXSCLAIM Pipeline. A dictionary mapping figure names to their respective Figure JSONs.


  • <figure_name> : An key for each image where its value is the respective Figure JSON

Search Query JSON


Inner portion of Query JSON. Defines specific search terms.


  • search_field_<n> : A dict with two key-value pairs:
    • term, a string to be searched primarily
    • synonyms, a list of strings to be alternatively searched.

Query JSON


User input to define parameters of desired article search.


  • name : A string. User defined name of search.
  • journal_family : A string. Journal family to search from. Current options: nature, rsc.
  • maximum_scraped : An integer. Maximum number of articles to scrape.
  • sortby : A string. How to sort or rank which articles to scrape first. Options are relevant and recent
  • query : A Search Query JSON.
  • results_dir : A string or pathlib.Path. Full path to a directory to put results in. Results directory will be determined as follows:
    1. If results_dir is supplied, results will go to <results_dir>/<name>.
    2. If not supplied, but supplied previously on the current installation, results will go to <most recent results_dir>/<name>.
    3. Otherwise results will be saved to <current working directory>/extracted/<name>
  • save_format : A list of strings. A list of desired save formats. Options include:
    • mongo for saving to mongo database
    • csv for saving to a csv
    • postgres for saving to postgres database (also saves to csv)
    • visualize for saving a visualization of the resulting subfigures
    • boxes for saving figures with bounding boxes drawn.
    • save_subfigures for saving all subfigures as images.
  • mongo_connection : (optional) A string. URL to mongo database, if one is being used.
  • open : A boolean. True if only open access results are desired.
  • logging : A list of at most two strings. If print is in the list, logging information will print to stdout (the terminal usually). The other string is, optionally, a file to save logging information to.