Skip to content

An implementation of BERT for cybersecurity named entity recognition

License

Notifications You must be signed in to change notification settings

stelemate/BERT-for-Cybersecurity-NER

Repository files navigation

BERT-for-Cybersecurity-NER

An implementation of BERT for cybersecurity named entity recognition

***** New November 5th, 2020: Named Entity Recognition Models *****

Introduction

We design a few joint BERT models for cybersecurity named entity recognition. The BERT pretraining model is described in google-research. These models are suitable for Chinese and English combined data, but other languages need to modify the DataProcessor code

Named Entity Recognition(NER) task

A task to identify entities with specific meanings in the text, including names of people, locations, organizations, specific nouns, etc.

Results

Metric

accuracy
precision
recall
f1-score
BERT finetuning 97.78 88.73 92.22 90.44
BERT-CRF 97.53 (-0.25) 91.46 (+2.73) 87.67 (-4.55) 89.53 (-0.91)
BERT-LSTM-CRF 98.13 (+0.35) 93.00 (+4.27) 93.09 (+0.87) 93.05 (+2.61)
BERT-Bi-LSTM-CRF 98.23 (+0.45) 94.77 (+6.04) 92.97 (+0.75) 93.11 (+2.67)
BERT-ID-CNN-CRF 98.18 (+0.4) 93.37 (+4.64) 93.07 (+0.85) 93.13 (+2.69)

***** New November 6th, 2020: The Usage of Named Entity Recognition Models *****

Usage

First, download the BERT model from google-search. In this project, we choose the BERT-Base Chinese model as the pretraining model. Clone this project and data_dir, bert_config_file, output_dir, init_checkpoint, vocab_file must be specified in bert_lstm_ner.py. replace the BERT path and project path in bert_lstm_ner.py:

if os.name == 'nt': #windows path config
   bert_path = '{your BERT model path}'
   root_path = '{project path}'
else: # linux path config
   bert_path = '{your BERT model path}'
   root_path = '{project path}'

If you change the train/dev/test dataset, the structure of the dataset should be like this:

这 O
次 O
的 O
问 O
题 O
是 O
出 O
在 O
我 O
们 O
的 O
M B-SW
a I-SW
r I-SW
k I-SW
d I-SW
o I-SW
w I-SW
n I-SW
渲 O
染 O
中 O
。 O

And there must be a blank line between the two sentences, while the maximum length of a single sentence is max_seq_length which is defined in bert_lstm_ner.py. Notice: the max_seq_length parameter has a great influence on the experimental results. 256 or 128 is a good choice for this parameter. But if your GPU performance cannot meet the requirements and a training error occurs, adjust the max_seq_length and the train_batch_size to the maximum acceptable size.

These models can be trained in a few hours on a GPU, but maybe 1 day on CPU, starting from the exact same pre-training model. During training, tensorboard is useful for monitoring the training process. Just open your command and change the path to the project directory, and then type in the following codes:

tensorboard --logdir=./output

Then type in localhost:6006 into the browser, and there will be some real-time training effect monitoring charts.

If you just want to test your dataset and do not want to train again, just change the do_train parameter to False.