This repositories contains an extension to the machine learning program weka. In the context of my Study Works a 5-class problem between Alzheimers Disease, Parkinsons Disease, Breast Cancer, Multiple Sclerosis and Healthy Controls was tackled. The necessary data comes from functional protein microarrays printed in duplicate. In the accompanying study we used both spots on the array for classification to increase our available data. Usually only one spot would be chosen. This project contains the means to parse a .gpr file to .arff files for weka. Furthermore, a variant of leave-one-out cross-validation was implemented. This is necessary as no sample of a patient should be in the test and data set. Weka does not come with the possibility to specificly split a corpus by a special attribute (This would be the patient id in our case).
In addition to the previous features, multithreading at different stages has been implemented. In the study a lot of classifiers were evaluated. Some of them terminated quickly, others took longer. To adapt to this, the user can specify the number of threads classifying, reading from file, or writing to file or database (for how to do this please see further below)
In order for this contribution the following additional libraries are required:
ANTLR4 is actually needed for the ConfigFileCompiler
- Clone this repositories
- Get the weka jar
- Clone and build ConfigFileCompiler
- Place the jars within the folder StudyWorks/lib/ you need to create it yourself
- Build the jar by invoking
ant package
- Execute the jar by calling
ant execute -Dconfig /path/to/config/file
where/path/to/config/file
is a configuration file as according to StudyWorksCompiler (See example below)
The configuration file us used to set up your experiment (again the facilities available in weka do not matcht our usecase). In the configuration file you can specify the number of classifiers, file readers and result writers. In addition you can specify the file results should be appended to and whether or not to consist results in a database or solely in a file (writing to a database takes very long, so just go for the text file :)). Also you must specify the folder where the .arff files reside.
In addition, the config file gives you the opportunity to easily describe parameter tuning for a classifier. Values for an parameter can be specified either specific like 1,5,8,9,10,16
or if it follows a sequence like this 1,3,..,11
. This expression will be resolved to the numbers 1,3,5,7,9,11
.
The in <resources>...</ressources>
specified values will be applied to all following defined classifiers.
Available resources are (required ones are bold):
- reader default=1
- writer default=1
- classifier default=1
- bag default=1
- infogain default=-1, no info gain
- numattributes default=-1, all attributes if a number n is specified when top n features are selected for classification
- arffFolder
- resultfile *default=C:\weka_experiment_results.csv on windows and ~/weka_experiment_results.csv on weka
- sqlOut Folder to store derived sql statements
<experiment>
<resources>
<resource name="arffFolder" value="/g/Documents/DHBW/5Semester/Study_Works/antibodies/DataAnalysis/Arff/loocv/" />
<resource name="resultfile" value="/g/Documents/GitHub/StudyWorks/results/results.csv" />
<resource name="sqlOut" value="/g/Documents/GitHub/results/" />
</resources>
<classifier name="REPTree" />
</experiment>
This config file will call weka for all the specified classifiers with information gain from 0.1,02.,...,1
with four classification threads, one writer thread and, since not specified, one reader thread.
<experiment>
<resources>
<resource name="arffFolder" value="/g/Documents/DHBW/5Semester/Study_Works/antibodies/DataAnalysis/Arff/loocv/" />
<resource name="resultfile" value="/g/Documents/GitHub/StudyWorks/results/results.csv" />
<resource name="sqlOut" value="/g/Documents/GitHub/results/" />
</resources>
<classifier name="REPTree" />
<classifier name="Ridor" />
<classifier name="KStar" />
<classifier name="PART" />
<classifier name="IBk" />
<classifier name="IB1" />
<classifier name="SMO" />
<classifier name="NaiveBayes" />
<classifier name="BayesNet" />
<classifier name="DMNBtext" />
<classifier name="RBFNetwork" />
<classifier name="DecisionTable" />
</experiment>
This example runs a grid search on the paramters I={1,5,10,..,100
and P={10,20,30,...,100}
. Classification will be executed for each combination, that is runs={{1,10},{1,20},..,{1,100},{5,10},..,{100,100}}
.
<experiment>
<resources>
<resource name="reader" value="1" />
<resource name="writer" value="1" />
<resource name="bag" value="1" />
<resource name="infogain" value="-1" />
<resource name="numattributes" value="-1" />
</resources>
<classifier name="AdaBoostM1">
<attribute type="class" name="W">
<classifier name="J48">
<attribute name="U" />
<attribute name="M" value="2" />
</classifier>
</attribute>
<attribute name="P" value="10,20..100" />
<attribute name="S" value="1" />
<attribute name="I" value="1,5..100" />
</classifier>
</experiment>