BERT-for-Cybersecurity-NER

An implementation of BERT for cybersecurity named entity recognition

***** New November 5th, 2020: Named Entity Recognition Models *****

Introduction

We design a few joint BERT models for cybersecurity named entity recognition. The BERT pretraining model is described in google-research. These models are suitable for Chinese and English combined data, but other languages need to modify the DataProcessor code

Named Entity Recognition(NER) task

A task to identify entities with specific meanings in the text, including names of people, locations, organizations, specific nouns, etc.

Results

Metric	accuracy	precision	recall	f1-score
BERT finetuning	97.78	88.73	92.22	90.44
BERT-CRF	97.53 (-0.25)	91.46 (+2.73)	87.67 (-4.55)	89.53 (-0.91)
BERT-LSTM-CRF	98.13 (+0.35)	93.00 (+4.27)	93.09 (+0.87)	93.05 (+2.61)
BERT-Bi-LSTM-CRF	98.23 (+0.45)	94.77 (+6.04)	92.97 (+0.75)	93.11 (+2.67)
BERT-ID-CNN-CRF	98.18 (+0.4)	93.37 (+4.64)	93.07 (+0.85)	93.13 (+2.69)

***** New November 6th, 2020: The Usage of Named Entity Recognition Models *****

Usage

First, download the BERT model from google-search. In this project, we choose the BERT-Base Chinese model as the pretraining model. Clone this project and data_dir, bert_config_file, output_dir, init_checkpoint, vocab_file must be specified in bert_lstm_ner.py. replace the BERT path and project path in bert_lstm_ner.py:

if os.name == 'nt': #windows path config
   bert_path = '{your BERT model path}'
   root_path = '{project path}'
else: # linux path config
   bert_path = '{your BERT model path}'
   root_path = '{project path}'

If you change the train/dev/test dataset, the structure of the dataset should be like this:

这 O
次 O
的 O
问 O
题 O
是 O
出 O
在 O
我 O
们 O
的 O
M B-SW
a I-SW
r I-SW
k I-SW
d I-SW
o I-SW
w I-SW
n I-SW
渲 O
染 O
中 O
。 O

And there must be a blank line between the two sentences, while the maximum length of a single sentence is max_seq_length which is defined in bert_lstm_ner.py. Notice: the max_seq_length parameter has a great influence on the experimental results. 256 or 128 is a good choice for this parameter. But if your GPU performance cannot meet the requirements and a training error occurs, adjust the max_seq_length and the train_batch_size to the maximum acceptable size.

These models can be trained in a few hours on a GPU, but maybe 1 day on CPU, starting from the exact same pre-training model. During training, tensorboard is useful for monitoring the training process. Just open your command and change the path to the project directory, and then type in the following codes:

tensorboard --logdir=./output

Then type in localhost:6006 into the browser, and there will be some real-time training effect monitoring charts.

If you just want to test your dataset and do not want to train again, just change the do_train parameter to False.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
BERT-BiLSTM-CRF-NER		BERT-BiLSTM-CRF-NER
BERT-CRF-NER		BERT-CRF-NER
BERT-IDCNN-CRF-NER		BERT-IDCNN-CRF-NER
BERT-lstm-CRF-NER		BERT-lstm-CRF-NER
Data		Data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BERT-for-Cybersecurity-NER

Introduction

Named Entity Recognition(NER) task

Results

Usage

About

Releases 1

Packages

Languages

License

stelemate/BERT-for-Cybersecurity-NER

Folders and files

Latest commit

History

Repository files navigation

BERT-for-Cybersecurity-NER

Introduction

Named Entity Recognition(NER) task

Results

Usage

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages