-
Notifications
You must be signed in to change notification settings - Fork 2
models
usage: acheron build model [-h] [--dry_run] [-c CORES] [-o OUT] [--cluster CLUSTER] -x TRAIN [-y TEST] [-v VALIDATION] [-f NUM_FEATURES] -l LABEL [--columns COLUMNS] [-m MODEL] [-p] -a ATTRIBUTE -t TYPE [--trial TRIAL] [--cv CV] [--manual MANUAL] [-k FOLDS]
Note that parameters covered with [square brackets] are optional, others are mandatory.
These are examples of building a model on publically available data and then using it to predict on the Canadian GRDI dataset.
acheron build model -x salm_amr -y grdi -l AMR_MIC -f 1000 -m XGB -c 8 -a AMP -t 11mer --cluster slurm
How many cores you want the model to run on. Unlike feature building, increasing the core count does NOT increase RAM usage. This will default to 1 less thread than your system has.
As mentioned, there is support for slurm, so that you don't need to declare CPU and RAM usage, this will be automated based on the estimated use of the job you have run. This has only been tested on PHAC's cluster, and may not work on other clusters. If this is the case, just wrap acheron normally.
What dataset to train your model on. If only train is based (and not test or validate), then the training set will be split according to the cross-validation fold (default is 5 fold-cv). This means that if just pass -x (train), the dataset is split into 5 folds, 4 for training and 1 for testing. This is repeated 5 times so that each fold gets a chance at being the testing fold. If you have hyperparameter optimizations turned on, train will be split into 5 folds; 3 for training, 1 for hyperparameter validation, and 1 for testing (unless validation or testing sets have been defined).
This is optional. If test is not declared, a testing set will be split from the training set. If declared, the model will be trained on -x, and then tested using -y. There will be no cross-validation.
If you declare the same dataset for testing and training, acheron will ignore you and split the testing set off the training data.
This is optional. If hyperparameter optimizations are turned on, but no validation set is passed, acheron will split a validation set off from the training data before the model is trained.
Acheron will not allow you to run a model where the testing and validation sets are the same. If you pass the same dataset for training and validation, acheron will ignore you and split off a validation set from the training data prior to model training.
-x dataset1
will split dataset1 into 5 folds, training on 4 folds, testing using the fifth (repeated 5x).
-x dataset1 -y dataset2
will train the model on 1, and test the model using 2'
-p -x dataset1
will split dataset1 into 5 folds. Multiple models with different hyperparameters will be built using 3 of the folds. Each of the models will be validated using 1 fold. Once the model with the best hyperparameters has been identified, it will be tested using the final and 5th fold.
-p -x dataset1 -y dataset2
will split dataset1 into 5 folds. Multiple models with different hyperparameters will be built using 4 of the folds. Each of the models will be validated using the final and 5th fold. Once the model with the best hyperparameters has been identified, it will be tested using dataset2.
-p -x dataset1 -y dataset2 -v dataset3 -p
will train multiple models on dataset1. These models will be validated using dataset2 to see which settings are the best, then the accuracy of the model with the best settings will be determined using dataset3.
How many features to train the model on. If using k-mers, passing -f 1000 will train the model on the most important 1000 k-mers according to variance determined by ANOVA F-value. (features that don't vary between classes are essentially useless. If a k-mer is seen 10 times in class1, and 10 times in class 2, then we don't care about it. Conversely, if a k-mer is seen a lot in class1, but never in class2, that's very important in predicting classes).
What labels to use, this needs to be an acheron built label, see the label wiki page for how to make these.
Which model to use, you need to use the 3 or 4 letter code defined below:
XGB: XGBoost
ANN: Artificial Neural Network (TensorFlow)
SVM: Support Vector Classifier
SGDC: Stochastic Gradient Descent Classifier
PERC: Perceptron
PAC: Passive Aggressive Classifier
NNC: Nearest Neighbours Classifier
NNR: Nearest Neighbours Regressor
GNB: Gaussian Naive Bayes
ADA: AdaBoost
GBDT: Gradient Boosted Decision Trees (scikit)
MLPC: Multi-layer Perceptron Classifier (scikit)
No argument follows, just passing -p turns on hyperparameter optimizations according to the -x, -y, -v rules declared above. Currently only XGBoost and Tensorflow have hyperparameter optimizations declared.
Any other model can be added by declaring 2 things: An objective and a search space.
See acheron.workflows.hyp for an example. You need to declare which settings you want to search through (i.e. number of layers in a neural net, and number of neurons per layer). You also need to declare an objective to minimize. This can simply be set as -accuracy, so by minimizing the objective, we maximize the accuracy. Please see workflows/hyp.py, it will make more sense. You can also open an issue telling me which model you want and I'll add it.
What the model is trained to predict. It needs to be a column as defined when creating the labels. For example, passing -a AMP
will train the model to predict ampicillin resistance
What type of features to use, -t 11mer will train the model using 11mers, -t 31mer will train using 31mers, -t genes will use genes. Remember, these need to be created using acherons feature builder.
How many trials you want. If you want the average over 10 runs, pass --trial 10
Acheron will always build UPTO the defined number of trials, NOT IN ADDITION. So if you run --trial 5
and build an average over 5, if you then run --trial 10
, acheron will only run the remaining 5, not 10 new ones. So if you first ran 5 trials, and then wanted to run 10 more, you would need to pass --trial 15
.
How many folds you want in cross-validation, defaults to 5. This means the data is split into 5 even-ish folds of 20% of the data. These splits are random stratified, each trial will have different data in each fold. Splitting into CV folds is always the first thing to happen, once testing sets have been set aside, then things like feature selection and training will commence.