An implementation of BERT for cybersecurity named entity recognition
***** New November 5th, 2020: Named Entity Recognition Models *****
We design a few joint BERT models for cybersecurity named entity recognition. The BERT pretraining model is described in google-research. These models are suitable for Chinese and English combined data, but other languages need to modify the DataProcessor code
A task to identify entities with specific meanings in the text, including names of people, locations, organizations, specific nouns, etc.
Metric
|
accuracy
|
precision
|
recall
|
f1-score
|
BERT finetuning | 97.78 | 88.73 | 92.22 | 90.44 |
BERT-CRF | 97.53 (-0.25) | 91.46 (+2.73) | 87.67 (-4.55) | 89.53 (-0.91) |
BERT-LSTM-CRF | 98.13 (+0.35) | 93.00 (+4.27) | 93.09 (+0.87) | 93.05 (+2.61) |
BERT-Bi-LSTM-CRF | 98.23 (+0.45) | 94.77 (+6.04) | 92.97 (+0.75) | 93.11 (+2.67) |
BERT-ID-CNN-CRF | 98.18 (+0.4) | 93.37 (+4.64) | 93.07 (+0.85) | 93.13 (+2.69) |
***** New November 6th, 2020: The Usage of Named Entity Recognition Models *****
First, download the BERT model from google-search. In this project, we choose the BERT-Base Chinese model as the pretraining model. Clone this project and data_dir, bert_config_file, output_dir, init_checkpoint, vocab_file must be specified in bert_lstm_ner.py. replace the BERT path and project path in bert_lstm_ner.py:
if os.name == 'nt': #windows path config
bert_path = '{your BERT model path}'
root_path = '{project path}'
else: # linux path config
bert_path = '{your BERT model path}'
root_path = '{project path}'
If you change the train/dev/test dataset, the structure of the dataset should be like this:
这 O
次 O
的 O
问 O
题 O
是 O
出 O
在 O
我 O
们 O
的 O
M B-SW
a I-SW
r I-SW
k I-SW
d I-SW
o I-SW
w I-SW
n I-SW
渲 O
染 O
中 O
。 O
And there must be a blank line between the two sentences, while the maximum length of a single sentence is max_seq_length
which is defined in bert_lstm_ner.py. Notice: the max_seq_length parameter has a great influence on the experimental results. 256 or 128 is a good choice for this parameter.
But if your GPU performance cannot meet the requirements and a training error occurs, adjust the max_seq_length
and the train_batch_size
to the maximum acceptable size.
These models can be trained in a few hours on a GPU, but maybe 1 day on CPU, starting from the exact same pre-training model. During training, tensorboard is useful for monitoring the training process. Just open your command and change the path to the project directory, and then type in the following codes:
tensorboard --logdir=./output
Then type in localhost:6006 into the browser, and there will be some real-time training effect monitoring charts.
If you just want to test your dataset and do not want to train again, just change the do_train parameter to False.