Comparing SVM on top of bag-of-words approach to BERT for text classification
This is the source code to go along with the blog article
Figure 6. BERT is the leader of the pack in all cases, even if not by much in some cases.
BERT yields the best F1 scores on three different repositories representing binary, multi-class, and multi-label/class situations. BoW with tf-idf weighted one-hot word vectors using SVM for classification is not a bad alternative to going full bore with BERT however, as it is cheap.
transformers
tensorflow
numpy
Download: BERT-Base, Uncased. Edit the script "runBert.sh" so it can find it
BERT_BASE_DIR="$PRE_TRAINED_HOME/bert/uncased_L-12_H-768_A-12"
Download: crawl-300d-2M-subword.vec. Edit the script "bow_classify.py" so it can find it
f = open(os.environ["PRE_TRAINED_HOME"] + '/fasttext/crawl-300d-2M-subword.vec')
cd data
tar zxvf aclImdb.tar.gz
cd ..
pipenv run python ./getdata.py
Figure 2. The vitals of the document repositories. All documents are used for classification but the longer ones are truncated to the first X number of words. Logistic regression and SVM can handle all the words, but we need make sure to use identically processed docs for head-2-head comparisons between BERT and non-BERT counterparts.
Figure 3. Class distribution. The reuters data set is skewed with as few as 2 documents for some classes and 4000 for another. The other two data sets are quite balanced.
mkdir results
./runBoW.sh
./runBert.sh
>> model.compile(optimizer=tf.optimizers.Adam(learning_rate=2e-5, epsilon=1e-08, clipnorm=1.0), loss=tfLoss, metrics=allMetrics)
Figure 7. Learning rate has to be small enough for BERT to be fine tuned. Some improvement in F1 can be obtained by playing with learning rate a bit.