1-3, Example of text data modeling process

One, prepare data

The goal of the imdb dataset is to predict the emotional label of the review based on the text content of the movie review.

The training set has 20,000 movie review texts, and the test set has 5000 movie review texts, of which both positive reviews and negative reviews account for half.

The preprocessing of text data is cumbersome, including Chinese word segmentation (not involved in this example), dictionary construction, encoding conversion, sequence filling, data pipeline construction, and so on.

The preprocessing of text data in torch generally uses torchtext or custom Dataset. Torchtext is very powerful and can build data sets for NLP tasks such as text classification, sequence labeling, question answering models, and machine translation.

The following only demonstrates the method of using it to build a text classification data set.

For a more complete tutorial, please refer to the following article: "Pytorch Study Notes—Torchtext"

https://zhuanlan.zhihu.com/p/65833208

List of common torchtext APIs

torchtext.data.Example: used to represent a sample, data and label
torchtext.vocab.Vocab: vocabulary list, you can import some pre-trained word vectors
torchtext.data.Datasets: Data set class, __getitem__ returns Example instance, torchtext.data.TabularDataset is its subclass.
torchtext.data.Field: Used to define the processing method of the field (text field, label field) preprocessing when creating Example, and some processing operations when batch.
torchtext.data.Iterator: Iterator, used to generate batch
torchtext.datasets: Contains common data sets.

import torch
import string,re
import torchtext

MAX_WORDS = 10000 # Only consider the most frequent 10000 words
MAX_LEN = 200 # Each sample retains the length of 200 words
BATCH_SIZE = 20

#分词方法
tokenizer = lambda x:re.sub('[%s]'%string.punctuation,"",x).split(" ")

#Filter out low frequency words
def filterLowFreqWords(arr,vocab):
    arr = [[x if x<MAX_WORDS else 0 for x in example]
           for example in arr]
    return arr

#1, define the preprocessing method of each field
TEXT = torchtext.data.Field(sequential=True, tokenize=tokenizer, lower=True,
                  fix_length=MAX_LEN,postprocessing = filterLowFreqWords)

LABEL = torchtext.data.Field(sequential=False, use_vocab=False)

#2, build a tabular dataset
#torchtext.data.TabularDataset can read csv, tsv, json and other formats
ds_train, ds_valid = torchtext.data.TabularDataset.splits(
        path='./data/imdb', train='train.tsv',test='test.tsv', format='tsv',
        fields=[('label', LABEL), ('text', TEXT)],skip_header = False)

#3, build a dictionary
TEXT.build_vocab(ds_train)

#4, build a data pipeline iterator
train_iter, valid_iter = torchtext.data.Iterator.splits(
        (ds_train, ds_valid), sort_within_batch=True,sort_key=lambda x: len(x.text),
        batch_sizes=(BATCH_SIZE,BATCH_SIZE))

#View example information
print(ds_train[0].text)
print(ds_train[0].label)

['it','really','boggles','my','mind','when','someone','comes','across','a','movie','like', ' this','and','claims','it','to','be','one','of','the','worst','slasher','films','out' ,'there','this','is','by','far','not','one','of','the','worst','out','there', ' still','not','a','good','movie','but','not','the','worst','nonetheless','go','see','something' ,'like','death','nurse','or','blood','lake','and','then','come','back','to','me', ' and','tell','me','if','you','think','the','night','brings','charlie','is','the','worst' ,'the','film','has','decent','camera','work','and','editing','which','is','way','more', ' than','i','can','say','for','many','more','extremely','obscure','slasher','filmsbr','br','the' ,'film','doesnt','deliver','on','the','onscreen','deaths','theres','one','death','where','you', ' see','his','pruning','saw','rip','into','a','neck','but','all','other','deaths','a re','hardly','interesting','but','the','lack','of','onscreen','graphic','violence','doesnt','mean','this' ,'isnt','a','slasher','film','just','a','bad','onebr','br','the','film','was', ' obviously','intended','not','to','be','taken','too','seriously','the','film','came','in','at' ,'the','end','of','the','second','slasher','cycle','so','it','certainly','was','a', ' reflection','on','traditional','slasher','elements','done','in','a','tongue','in','cheek','way','for' ,'example','after','a','kill','charlie','goes','to','the','towns','welcome','sign','and', ' marks','the','population','down','one','less','this','is','something','that','can','only','get' ,'a','laughbr','br','if','youre','into','slasher','films','definitely','give','this','film', ' a','watch','it','is','slightly','different','than','your','usual','slasher','film','with','possibility' ,'of','two','killers','but','not','by', 'much','the','comedy','of','the','movie','is','pretty','much','telling','the','audience','to ','relax','and','not','take','the','movie','so','god','darn','serious','you','may', 'forget','the','movie','you','may','remember','it','ill','remember','it','because','i','love ','the','name']
0

# View dictionary information
print(len(TEXT.vocab))

#itos: index to string
print(TEXT.vocab.itos[0])
print(TEXT.vocab.itos[1])

#stoi: string to index
print (TEXT.vocab.stoi [ '<unk>']) #unknown unknown word
print(TEXT.vocab.stoi['<pad>']) #padding padding


#freqs: word frequency
print (TEXT.vocab.freqs [ '<unk>'])
print(TEXT.vocab.freqs['a'])
print(TEXT.vocab.freqs['good'])

108197
<Unk>
<pad>
0
1
0
129453
11457

# View data pipeline information
# Note that there are pits: the 0th dimension of text is the sentence length
for batch in train_iter:
    features = batch.text
    labels = batch.label
    print(features)
    print(features.shape)
    print(labels)
    break

tensor([[ 17, 31, 148, ..., 54, 11, 201],
        [2, 2, 904, ..., 335, 7, 109],
        [1371, 1737, 44, ..., 806, 2, 11],
        ...,
        [6, 5, 62, ..., 1, 1, 1],
        [170, 0, 27, ..., 1, 1, 1],
        [15, 0, 45, ..., 1, 1, 1]])
torch.Size([200, 20])
tensor([0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0])

# Organize the data pipeline into torch.utils.data.DataLoader similar features, label output form
class DataLoader:
    def __init__(self,data_iter):
        self.data_iter = data_iter
        self.length = len(data_iter)
    
    def __len__(self):
        return self.length
    
    def __iter__(self):
        # Note: here adjust the features to batch first, and adjust the shape and dtype of the label
        for batch in self.data_iter:
            yield(torch.transpose(batch.text,0,1),
                  torch.unsqueeze(batch.label.float(), dim = 1))
    
dl_train = DataLoader(train_iter)
dl_valid = DataLoader(valid_iter)

Second, define the model

There are usually three ways to build a model using Pytorch: use nn.Sequential to build a model in layer order, inherit nn.Module base class to build a custom model, inherit nn.Module base class to build a model and assist in applying model containers (nn.Sequential, nn. ModuleList, nn.ModuleDict) is encapsulated.

Here choose to use the third method to build.

import torch
from torch import nn
from torchkeras import LightModel,summary

torch.random.seed()
import torch
from torch import nn

class Net(nn.Module):
    
    def __init__(self):
        super(Net, self).__init__()
        
        #After setting the padding_idx parameter, the filled token will always be assigned a 0 vector during the training process
        self.embedding = nn.Embedding(num_embeddings = MAX_WORDS,embedding_dim = 3,padding_idx = 1)
        self.conv = nn.Sequential()
        self.conv.add_module("conv_1",nn.Conv1d(in_channels = 3,out_channels = 16,kernel_size = 5))
        self.conv.add_module("pool_1",nn.MaxPool1d(kernel_size = 2))
        self.conv.add_module("relu_1",nn.ReLU())
        self.conv.add_module("conv_2",nn.Conv1d(in_channels = 16,out_channels = 128,kernel_size = 2))
        self.conv.add_module("pool_2",nn.MaxPool1d(kernel_size = 2))
        self.conv.add_module("relu_2",nn.ReLU())
        
        self.dense = nn.Sequential()
        self.dense.add_module("flatten",nn.Flatten())
        self.dense.add_module("linear",nn.Linear(6144,1))
        self.dense.add_module("sigmoid",nn.Sigmoid())
        
    def forward(self,x):
        x = self.embedding(x).transpose(1,2)
        x = self.conv(x)
        y = self.dense(x)
        return y
        

net = Net()
print(net)

summary(net, input_shape = (200,),input_dtype = torch.LongTensor)

Net(
  (embedding): Embedding(10000, 3, padding_idx=1)
  (conv): Sequential(
    (conv_1): Conv1d(3, 16, kernel_size=(5,), stride=(1,))
    (pool_1): MaxPool1d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (relu_1): ReLU()
    (conv_2): Conv1d(16, 128, kernel_size=(2,), stride=(1,))
    (pool_2): MaxPool1d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (relu_2): ReLU()
  )
  (dense): Sequential(
    (flatten): Flatten()
    (linear): Linear(in_features=6144, out_features=1, bias=True)
    (sigmoid): Sigmoid()
  )
)
-------------------------------------------------- --------------
        Layer (type) Output Shape Param #
================================================= ==============
         Embedding-1 [-1, 200, 3] 30,000
            Conv1d-2 [-1, 16, 196] 256
         MaxPool1d-3 [-1, 16, 98] 0
              ReLU-4 [-1, 16, 98] 0
            Conv1d-5 [-1, 128, 97] 4,224
         MaxPool1d-6 [-1, 128, 48] 0
              ReLU-7 [-1, 128, 48] 0
           Flatten-8 [-1, 6144] 0
            Linear-9 [-1, 1] 6,145
          Sigmoid-10 [-1, 1] 0
================================================= ==============
Total params: 40,625
Trainable params: 40,625
Non-trainable params: 0
-------------------------------------------------- --------------
Input size (MB): 0.000763
Forward/backward pass size (MB): 0.287796
Params size (MB): 0.154972
Estimated Total Size (MB): 0.443531
-------------------------------------------------- --------------

Three, training model

Training Pytorch usually requires users to write custom training loops. The code style of training loops varies from person to person.

There are three typical training loop code styles: script form training loop, function form training loop, and class form training loop.

Here is a kind of training loop.

We use Pytorch-Lightning to define a high-level model interface LightModel, encapsulated in torchkeras, which can train the model very conveniently.

import pytorch_lightning as pl
from torchkeras import LightModel

class Model(LightModel):
    
    #loss,and optional metrics
    def shared_step(self,batch)->dict:
        x, y = batch
        prediction = self(x)
        loss = nn.BCELoss()(prediction,y)
        preds = torch.where(prediction>0.5,torch.ones_like(prediction),torch.zeros_like(prediction))
        acc = pl.metrics.functional.accuracy(preds, y)
        dic = {"loss":loss,"accuracy":acc}
        return dic
    
    #optimizer,and optional lr_scheduler
    def configure_optimizers(self):
        optimizer = torch.optim.Adagrad(self.parameters(),lr = 0.02)
        return optimizer

pl.seed_everything(1234)
net = Net()
model = Model(net)

ckpt_cb = pl.callbacks.ModelCheckpoint(monitor='val_loss')

# set gpus=0 will use cpu,
# set gpus=1 will use 1 gpu
# set gpus=2 will use 2gpus
# set gpus = -1 will use all gpus
# you can also set gpus = [0,1] to use the given gpus
# you can even set tpu_cores=2 to use two tpus

trainer = pl.Trainer(max_epochs=20,gpus = 0, callbacks=[ckpt_cb])
trainer.fit(model,dl_train,dl_valid)

================================================= ==============================2021-01-16 21:47:29
epoch = 0
{'val_loss': 0.6834630966186523,'val_accuracy': 0.5546000003814697}
{'accuracy': 0.5224003791809082,'loss': 0.7246873378753662}

================================================= ==============================2021-01-16 21:48:07
epoch = 1
{'val_loss': 0.6371415257453918,'val_accuracy': 0.63319993019104}
{'accuracy': 0.6110503673553467,'loss': 0.6552867889404297}

================================================= ==============================2021-01-16 21:48:50
epoch = 2
{'val_loss': 0.5896139740943909,'val_accuracy': 0.6798002123832703}
{'accuracy': 0.6910000443458557,'loss': 0.5874115824699402}

================================================= ==============================2021-01-16 21:49:32
epoch = 3
{'val_loss': 0.5726749300956726,'val_accuracy': 0.6971999406814575}
{'accuracy': 0.7391000390052795,'loss': 0.5251786112785339}

================================================= ==============================2021-01-16 21:50:13
epoch = 4
{'val_loss': 0.5328916311264038,'val_accuracy': 0.7326000332832336}
{'accuracy': 0.7705488801002502,'loss': 0.4773417115211487}

================================================= ==============================2021-01-16 21:50:54
epoch = 5
{'val_loss': 0.5194208025932312,'val_accuracy': 0.7413997650146484}
{'accuracy': 0.7968998551368713,'loss': 0.43944093585014343}

================================================= ==============================2021-01-16 21:51:35
epoch = 6
{'val_loss': 0.5199333429336548,'val_accuracy': 0.7429998517036438}
{'accuracy': 0.8130489587783813,'loss': 0.4102325737476349}

================================================= ============================= 2021-01-16 21:52:16
epoch = 7
{'val_loss': 0.5124538540840149,'val_accuracy': 0.7517998814582825}
{'accuracy': 0.8314500451087952,'loss': 0.3849221169948578}

================================================= ==============================2021-01-16 21:52:58
epoch = 8
{'val_loss': 0.510671079158783,'val_accuracy': 0.7554002404212952}
{'accuracy': 0.8438503742218018,'loss': 0.3616768419742584}

================================================= ==============================2021-01-16 21:53:39
epoch = 9
{'val_loss': 0.5184627771377563,'val_accuracy': 0.7530001997947693}
{'accuracy': 0.8568001985549927,'loss': 0.34138554334640503}

================================================= ==============================2021-01-16 21:54:20
epoch = 10
{'val_loss': 0.5105863809585571,'val_accuracy': 0.7580001354217529}
{'accuracy': 0.865899920463562,'loss': 0.32265418767929077}

================================================= ==============================2021-01-16 21:55:02
epoch = 11
{'val_loss': 0.5222727656364441,'val_accuracy': 0.7586002349853516}
{'accuracy': 0.8747013211250305,'loss': 0.306064248085022}

================================================= ==============================2021-01-16 21:55:43
epoch = 12
{'val_loss': 0.5208917856216431,'val_accuracy': 0.7597998976707458}
{'accuracy': 0.8820013403892517,'loss': 0.29068493843078613}

================================================= ==============================2021-01-16 21:56:24
epoch = 13
{'val_loss': 0.5236031413078308,'val_accuracy': 0.7603999376296997}
{'accuracy': 0.889351487159729,'loss': 0.2765159606933594}

================================================= ==============================2021-01-16 21:57:04
epoch = 14
{'val_loss': 0.5428195595741272,'val_accuracy': 0.7572000622749329}
{'accuracy': 0.8975020051002502,'loss': 0.26261812448501587}

================================================= ==============================2021-01-16 21:57:45
epoch = 15
{'val_loss': 0.5340956449508667,'val_accuracy': 0.7602002024650574}
{'accuracy': 0.9049026966094971,'loss': 0.25028231739997864}

================================================= ==============================2021-01-16 21:58:25
epoch = 16
{'val_loss': 0.5380828380584717,'val_accuracy': 0.7612000107765198}
{'accuracy': 0.9085531234741211,'loss': 0.23980091512203217}

================================================= ==============================2021-01-16 21:59:05
epoch = 17
{'val_loss': 0.5447139739990234,'val_accuracy': 0.7638000249862671}
{'accuracy': 0.9168024659156799,'loss': 0.22760336101055145}

================================================= ==============================2021-01-16 21:59:45
epoch = 18
{'val_loss': 0.5505074858665466,'val_accuracy': 0.7636001110076904}
{'accuracy': 0.921653687953949,'loss': 0.21746191382408142}

================================================= ==============================2021-01-16 22:00:26
epoch = 19
{'val_loss': 0.5615255236625671,'val_accuracy': 0.7634001970291138}
{'accuracy': 0.9263033270835876,'loss': 0.2077799290418625}

Four, evaluation model

import pandas as pd

history = model.history
dfhistory = pd.DataFrame(history)
dfhistory

%matplotlib inline
%config InlineBackend.figure_format ='svg'

import matplotlib.pyplot as plt

def plot_metric(dfhistory, metric):
    train_metrics = dfhistory[metric]
    val_metrics = dfhistory['val_'+metric]
    epochs = range(1, len(train_metrics) + 1)
    plt.plot(epochs, train_metrics,'bo--')
    plt.plot(epochs, val_metrics,'ro-')
    plt.title('Training and validation'+ metric)
    plt.xlabel("Epochs")
    plt.ylabel(metric)
    plt.legend(["train_"+metric,'val_'+metric])
    plt.show()

plot_metric(dfhistory,"loss")

plot_metric(dfhistory,"accuracy")

# Evaluation
results = trainer.test(model, test_dataloaders=dl_valid, verbose = False)
print(results[0])

{'val_loss': 0.5056138457655907,'val_accuracy': 0.7948000040054322}

Five, use model

def predict(model,dl):
    model.eval()
    result = torch.cat([model.forward(t[0].to(model.device)) for t in dl])
    return(result.data)

result = predict(model,dl_valid)
result

tensor([[0.0357],
        [0.8699],
        [0.3303],
        ...,
        [0.9962],
        [0.5566],
        [0.0491]])

Six, save the model

print(ckpt_cb.best_model_score)
model.load_from_checkpoint(ckpt_cb.best_model_path)

best_net = model.net
torch.save(best_net.state_dict(),"./data/net.pt")

net_clone = Net()
net_clone.load_state_dict(torch.load("./data/net.pt"))
model_clone = Model(net_clone)
trainer = pl.Trainer()
result = trainer.test(model_clone,test_dataloaders=dl_valid, verbose = False)

print(result)

[{'test_loss': 0.4958915710449219,'test_accuracy': 0.75}]

If this book is helpful to you and want to encourage the author, remember to add a star⭐️ to this project and share it with your friends 😊!

If you need to further communicate with the author on the understanding of the content of this book, please leave a message under the public account "Algorithm Food House". The author has limited time and energy and will respond as appropriate.

You can also reply to keywords in the background of the official account: Add group, join the reader exchange group and discuss with you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1-3,文本数据建模流程范例.md

1-3,文本数据建模流程范例.md

1-3, Example of text data modeling process

One, prepare data

Second, define the model

Three, training model

Four, evaluation model

Five, use model

Six, save the model

Files

1-3,文本数据建模流程范例.md

Latest commit

History

1-3,文本数据建模流程范例.md

File metadata and controls

1-3, Example of text data modeling process

One, prepare data

Second, define the model

Three, training model

Four, evaluation model

Five, use model

Six, save the model