The four notebooks Preprocessing Dataset.ipynb, From Reports to Gaps.ipynb, Exporting Features.ipynb and Exporting Features 2.ipynb are meant to be executed in this order. They contain every preprocessing step taken in order to produce the data provided via the link below.
Directories containing this data (i.e. public datasets and their preprocessed versions) can be found here. The files should simply be unziped in the directory where every other file from this repository are located.
Among the python files, features.py, utils.py and Archives/archive.py are short utilities used both in the notebook files and in the other python scripts. learner.py contains the code used to query labels and fit a classifier algorithm. labeler.py contains the code used to generate a web-based user interface to perform labeling operations. Both files can be run standalone, but their main goal is to be used within active-learning.py, which contains the actual active learning loop and handles the communication between Learner
and Labeler
instances.
This is the main python script, from which one can start the actual Active Learning process. Various Active Learning parameters, such as batch sizes, etc... should be set from this script when instantiating the Learner
. The Labeler
instance does not require any parameters as of now. Lastly, the path chosen when instantiating an Archive
is the path where the labels produced will be archived.