The goal of the imdb dataset is to predict the emotional label of the review based on the text content of the movie review.
The training set has 20,000 movie review texts, and the test set has 5000 movie review texts, of which both positive reviews and negative reviews account for half.
The preprocessing of text data is cumbersome, including Chinese word segmentation (not involved in this example), dictionary construction, encoding conversion, sequence filling, data pipeline construction, and so on.
The preprocessing of text data in torch generally uses torchtext or custom Dataset. Torchtext is very powerful and can build data sets for NLP tasks such as text classification, sequence labeling, question answering models, and machine translation.
The following only demonstrates the method of using it to build a text classification data set.
For a more complete tutorial, please refer to the following article: "Pytorch Study Notes—Torchtext"
https://zhuanlan.zhihu.com/p/65833208
List of common torchtext APIs
- torchtext.data.Example: used to represent a sample, data and label
- torchtext.vocab.Vocab: vocabulary list, you can import some pre-trained word vectors
- torchtext.data.Datasets: Data set class,
__getitem__
returns Example instance, torchtext.data.TabularDataset is its subclass. - torchtext.data.Field: Used to define the processing method of the field (text field, label field) preprocessing when creating Example, and some processing operations when batch.
- torchtext.data.Iterator: Iterator, used to generate batch
- torchtext.datasets: Contains common data sets.
import torch
import string,re
import torchtext
MAX_WORDS = 10000 # Only consider the most frequent 10000 words
MAX_LEN = 200 # Each sample retains the length of 200 words
BATCH_SIZE = 20
#分词方法
tokenizer = lambda x:re.sub('[%s]'%string.punctuation,"",x).split(" ")
#Filter out low frequency words
def filterLowFreqWords(arr,vocab):
arr = [[x if x<MAX_WORDS else 0 for x in example]
for example in arr]
return arr
#1, define the preprocessing method of each field
TEXT = torchtext.data.Field(sequential=True, tokenize=tokenizer, lower=True,
fix_length=MAX_LEN,postprocessing = filterLowFreqWords)
LABEL = torchtext.data.Field(sequential=False, use_vocab=False)
#2, build a tabular dataset
#torchtext.data.TabularDataset can read csv, tsv, json and other formats
ds_train, ds_valid = torchtext.data.TabularDataset.splits(
path='./data/imdb', train='train.tsv',test='test.tsv', format='tsv',
fields=[('label', LABEL), ('text', TEXT)],skip_header = False)
#3, build a dictionary
TEXT.build_vocab(ds_train)
#4, build a data pipeline iterator
train_iter, valid_iter = torchtext.data.Iterator.splits(
(ds_train, ds_valid), sort_within_batch=True,sort_key=lambda x: len(x.text),
batch_sizes=(BATCH_SIZE,BATCH_SIZE))
#View example information
print(ds_train[0].text)
print(ds_train[0].label)
['it','really','boggles','my','mind','when','someone','comes','across','a','movie','like', ' this','and','claims','it','to','be','one','of','the','worst','slasher','films','out' ,'there','this','is','by','far','not','one','of','the','worst','out','there', ' still','not','a','good','movie','but','not','the','worst','nonetheless','go','see','something' ,'like','death','nurse','or','blood','lake','and','then','come','back','to','me', ' and','tell','me','if','you','think','the','night','brings','charlie','is','the','worst' ,'the','film','has','decent','camera','work','and','editing','which','is','way','more', ' than','i','can','say','for','many','more','extremely','obscure','slasher','filmsbr','br','the' ,'film','doesnt','deliver','on','the','onscreen','deaths','theres','one','death','where','you', ' see','his','pruning','saw','rip','into','a','neck','but','all','other','deaths','a re','hardly','interesting','but','the','lack','of','onscreen','graphic','violence','doesnt','mean','this' ,'isnt','a','slasher','film','just','a','bad','onebr','br','the','film','was', ' obviously','intended','not','to','be','taken','too','seriously','the','film','came','in','at' ,'the','end','of','the','second','slasher','cycle','so','it','certainly','was','a', ' reflection','on','traditional','slasher','elements','done','in','a','tongue','in','cheek','way','for' ,'example','after','a','kill','charlie','goes','to','the','towns','welcome','sign','and', ' marks','the','population','down','one','less','this','is','something','that','can','only','get' ,'a','laughbr','br','if','youre','into','slasher','films','definitely','give','this','film', ' a','watch','it','is','slightly','different','than','your','usual','slasher','film','with','possibility' ,'of','two','killers','but','not','by', 'much','the','comedy','of','the','movie','is','pretty','much','telling','the','audience','to ','relax','and','not','take','the','movie','so','god','darn','serious','you','may', 'forget','the','movie','you','may','remember','it','ill','remember','it','because','i','love ','the','name']
0
# View dictionary information
print(len(TEXT.vocab))
#itos: index to string
print(TEXT.vocab.itos[0])
print(TEXT.vocab.itos[1])
#stoi: string to index
print (TEXT.vocab.stoi [ '<unk>']) #unknown unknown word
print(TEXT.vocab.stoi['<pad>']) #padding padding
#freqs: word frequency
print (TEXT.vocab.freqs [ '<unk>'])
print(TEXT.vocab.freqs['a'])
print(TEXT.vocab.freqs['good'])
108197
<Unk>
<pad>
0
1
0
129453
11457
# View data pipeline information
# Note that there are pits: the 0th dimension of text is the sentence length
for batch in train_iter:
features = batch.text
labels = batch.label
print(features)
print(features.shape)
print(labels)
break
tensor([[ 17, 31, 148, ..., 54, 11, 201],
[2, 2, 904, ..., 335, 7, 109],
[1371, 1737, 44, ..., 806, 2, 11],
...,
[6, 5, 62, ..., 1, 1, 1],
[170, 0, 27, ..., 1, 1, 1],
[15, 0, 45, ..., 1, 1, 1]])
torch.Size([200, 20])
tensor([0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0])
# Organize the data pipeline into torch.utils.data.DataLoader similar features, label output form
class DataLoader:
def __init__(self,data_iter):
self.data_iter = data_iter
self.length = len(data_iter)
def __len__(self):
return self.length
def __iter__(self):
# Note: here adjust the features to batch first, and adjust the shape and dtype of the label
for batch in self.data_iter:
yield(torch.transpose(batch.text,0,1),
torch.unsqueeze(batch.label.float(), dim = 1))
dl_train = DataLoader(train_iter)
dl_valid = DataLoader(valid_iter)
There are usually three ways to build a model using Pytorch: use nn.Sequential to build a model in layer order, inherit nn.Module base class to build a custom model, inherit nn.Module base class to build a model and assist in applying model containers (nn.Sequential, nn. ModuleList, nn.ModuleDict) is encapsulated.
Here choose to use the third method to build.
import torch
from torch import nn
from torchkeras import LightModel,summary
torch.random.seed()
import torch
from torch import nn
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
#After setting the padding_idx parameter, the filled token will always be assigned a 0 vector during the training process
self.embedding = nn.Embedding(num_embeddings = MAX_WORDS,embedding_dim = 3,padding_idx = 1)
self.conv = nn.Sequential()
self.conv.add_module("conv_1",nn.Conv1d(in_channels = 3,out_channels = 16,kernel_size = 5))
self.conv.add_module("pool_1",nn.MaxPool1d(kernel_size = 2))
self.conv.add_module("relu_1",nn.ReLU())
self.conv.add_module("conv_2",nn.Conv1d(in_channels = 16,out_channels = 128,kernel_size = 2))
self.conv.add_module("pool_2",nn.MaxPool1d(kernel_size = 2))
self.conv.add_module("relu_2",nn.ReLU())
self.dense = nn.Sequential()
self.dense.add_module("flatten",nn.Flatten())
self.dense.add_module("linear",nn.Linear(6144,1))
self.dense.add_module("sigmoid",nn.Sigmoid())
def forward(self,x):
x = self.embedding(x).transpose(1,2)
x = self.conv(x)
y = self.dense(x)
return y
net = Net()
print(net)
summary(net, input_shape = (200,),input_dtype = torch.LongTensor)
Net(
(embedding): Embedding(10000, 3, padding_idx=1)
(conv): Sequential(
(conv_1): Conv1d(3, 16, kernel_size=(5,), stride=(1,))
(pool_1): MaxPool1d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(relu_1): ReLU()
(conv_2): Conv1d(16, 128, kernel_size=(2,), stride=(1,))
(pool_2): MaxPool1d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(relu_2): ReLU()
)
(dense): Sequential(
(flatten): Flatten()
(linear): Linear(in_features=6144, out_features=1, bias=True)
(sigmoid): Sigmoid()
)
)
-------------------------------------------------- --------------
Layer (type) Output Shape Param #
================================================= ==============
Embedding-1 [-1, 200, 3] 30,000
Conv1d-2 [-1, 16, 196] 256
MaxPool1d-3 [-1, 16, 98] 0
ReLU-4 [-1, 16, 98] 0
Conv1d-5 [-1, 128, 97] 4,224
MaxPool1d-6 [-1, 128, 48] 0
ReLU-7 [-1, 128, 48] 0
Flatten-8 [-1, 6144] 0
Linear-9 [-1, 1] 6,145
Sigmoid-10 [-1, 1] 0
================================================= ==============
Total params: 40,625
Trainable params: 40,625
Non-trainable params: 0
-------------------------------------------------- --------------
Input size (MB): 0.000763
Forward/backward pass size (MB): 0.287796
Params size (MB): 0.154972
Estimated Total Size (MB): 0.443531
-------------------------------------------------- --------------
Training Pytorch usually requires users to write custom training loops. The code style of training loops varies from person to person.
There are three typical training loop code styles: script form training loop, function form training loop, and class form training loop.
Here is a kind of training loop.
We use Pytorch-Lightning to define a high-level model interface LightModel, encapsulated in torchkeras, which can train the model very conveniently.
import pytorch_lightning as pl
from torchkeras import LightModel
class Model(LightModel):
#loss,and optional metrics
def shared_step(self,batch)->dict:
x, y = batch
prediction = self(x)
loss = nn.BCELoss()(prediction,y)
preds = torch.where(prediction>0.5,torch.ones_like(prediction),torch.zeros_like(prediction))
acc = pl.metrics.functional.accuracy(preds, y)
dic = {"loss":loss,"accuracy":acc}
return dic
#optimizer,and optional lr_scheduler
def configure_optimizers(self):
optimizer = torch.optim.Adagrad(self.parameters(),lr = 0.02)
return optimizer
pl.seed_everything(1234)
net = Net()
model = Model(net)
ckpt_cb = pl.callbacks.ModelCheckpoint(monitor='val_loss')
# set gpus=0 will use cpu,
# set gpus=1 will use 1 gpu
# set gpus=2 will use 2gpus
# set gpus = -1 will use all gpus
# you can also set gpus = [0,1] to use the given gpus
# you can even set tpu_cores=2 to use two tpus
trainer = pl.Trainer(max_epochs=20,gpus = 0, callbacks=[ckpt_cb])
trainer.fit(model,dl_train,dl_valid)
================================================= ==============================2021-01-16 21:47:29
epoch = 0
{'val_loss': 0.6834630966186523,'val_accuracy': 0.5546000003814697}
{'accuracy': 0.5224003791809082,'loss': 0.7246873378753662}
================================================= ==============================2021-01-16 21:48:07
epoch = 1
{'val_loss': 0.6371415257453918,'val_accuracy': 0.63319993019104}
{'accuracy': 0.6110503673553467,'loss': 0.6552867889404297}
================================================= ==============================2021-01-16 21:48:50
epoch = 2
{'val_loss': 0.5896139740943909,'val_accuracy': 0.6798002123832703}
{'accuracy': 0.6910000443458557,'loss': 0.5874115824699402}
================================================= ==============================2021-01-16 21:49:32
epoch = 3
{'val_loss': 0.5726749300956726,'val_accuracy': 0.6971999406814575}
{'accuracy': 0.7391000390052795,'loss': 0.5251786112785339}
================================================= ==============================2021-01-16 21:50:13
epoch = 4
{'val_loss': 0.5328916311264038,'val_accuracy': 0.7326000332832336}
{'accuracy': 0.7705488801002502,'loss': 0.4773417115211487}
================================================= ==============================2021-01-16 21:50:54
epoch = 5
{'val_loss': 0.5194208025932312,'val_accuracy': 0.7413997650146484}
{'accuracy': 0.7968998551368713,'loss': 0.43944093585014343}
================================================= ==============================2021-01-16 21:51:35
epoch = 6
{'val_loss': 0.5199333429336548,'val_accuracy': 0.7429998517036438}
{'accuracy': 0.8130489587783813,'loss': 0.4102325737476349}
================================================= ============================= 2021-01-16 21:52:16
epoch = 7
{'val_loss': 0.5124538540840149,'val_accuracy': 0.7517998814582825}
{'accuracy': 0.8314500451087952,'loss': 0.3849221169948578}
================================================= ==============================2021-01-16 21:52:58
epoch = 8
{'val_loss': 0.510671079158783,'val_accuracy': 0.7554002404212952}
{'accuracy': 0.8438503742218018,'loss': 0.3616768419742584}
================================================= ==============================2021-01-16 21:53:39
epoch = 9
{'val_loss': 0.5184627771377563,'val_accuracy': 0.7530001997947693}
{'accuracy': 0.8568001985549927,'loss': 0.34138554334640503}
================================================= ==============================2021-01-16 21:54:20
epoch = 10
{'val_loss': 0.5105863809585571,'val_accuracy': 0.7580001354217529}
{'accuracy': 0.865899920463562,'loss': 0.32265418767929077}
================================================= ==============================2021-01-16 21:55:02
epoch = 11
{'val_loss': 0.5222727656364441,'val_accuracy': 0.7586002349853516}
{'accuracy': 0.8747013211250305,'loss': 0.306064248085022}
================================================= ==============================2021-01-16 21:55:43
epoch = 12
{'val_loss': 0.5208917856216431,'val_accuracy': 0.7597998976707458}
{'accuracy': 0.8820013403892517,'loss': 0.29068493843078613}
================================================= ==============================2021-01-16 21:56:24
epoch = 13
{'val_loss': 0.5236031413078308,'val_accuracy': 0.7603999376296997}
{'accuracy': 0.889351487159729,'loss': 0.2765159606933594}
================================================= ==============================2021-01-16 21:57:04
epoch = 14
{'val_loss': 0.5428195595741272,'val_accuracy': 0.7572000622749329}
{'accuracy': 0.8975020051002502,'loss': 0.26261812448501587}
================================================= ==============================2021-01-16 21:57:45
epoch = 15
{'val_loss': 0.5340956449508667,'val_accuracy': 0.7602002024650574}
{'accuracy': 0.9049026966094971,'loss': 0.25028231739997864}
================================================= ==============================2021-01-16 21:58:25
epoch = 16
{'val_loss': 0.5380828380584717,'val_accuracy': 0.7612000107765198}
{'accuracy': 0.9085531234741211,'loss': 0.23980091512203217}
================================================= ==============================2021-01-16 21:59:05
epoch = 17
{'val_loss': 0.5447139739990234,'val_accuracy': 0.7638000249862671}
{'accuracy': 0.9168024659156799,'loss': 0.22760336101055145}
================================================= ==============================2021-01-16 21:59:45
epoch = 18
{'val_loss': 0.5505074858665466,'val_accuracy': 0.7636001110076904}
{'accuracy': 0.921653687953949,'loss': 0.21746191382408142}
================================================= ==============================2021-01-16 22:00:26
epoch = 19
{'val_loss': 0.5615255236625671,'val_accuracy': 0.7634001970291138}
{'accuracy': 0.9263033270835876,'loss': 0.2077799290418625}
import pandas as pd
history = model.history
dfhistory = pd.DataFrame(history)
dfhistory
%matplotlib inline
%config InlineBackend.figure_format ='svg'
import matplotlib.pyplot as plt
def plot_metric(dfhistory, metric):
train_metrics = dfhistory[metric]
val_metrics = dfhistory['val_'+metric]
epochs = range(1, len(train_metrics) + 1)
plt.plot(epochs, train_metrics,'bo--')
plt.plot(epochs, val_metrics,'ro-')
plt.title('Training and validation'+ metric)
plt.xlabel("Epochs")
plt.ylabel(metric)
plt.legend(["train_"+metric,'val_'+metric])
plt.show()
plot_metric(dfhistory,"loss")
plot_metric(dfhistory,"accuracy")
# Evaluation
results = trainer.test(model, test_dataloaders=dl_valid, verbose = False)
print(results[0])
{'val_loss': 0.5056138457655907,'val_accuracy': 0.7948000040054322}
def predict(model,dl):
model.eval()
result = torch.cat([model.forward(t[0].to(model.device)) for t in dl])
return(result.data)
result = predict(model,dl_valid)
result
tensor([[0.0357],
[0.8699],
[0.3303],
...,
[0.9962],
[0.5566],
[0.0491]])
print(ckpt_cb.best_model_score)
model.load_from_checkpoint(ckpt_cb.best_model_path)
best_net = model.net
torch.save(best_net.state_dict(),"./data/net.pt")
net_clone = Net()
net_clone.load_state_dict(torch.load("./data/net.pt"))
model_clone = Model(net_clone)
trainer = pl.Trainer()
result = trainer.test(model_clone,test_dataloaders=dl_valid, verbose = False)
print(result)
[{'test_loss': 0.4958915710449219,'test_accuracy': 0.75}]
If this book is helpful to you and want to encourage the author, remember to add a star⭐️ to this project and share it with your friends 😊!
If you need to further communicate with the author on the understanding of the content of this book, please leave a message under the public account "Algorithm Food House". The author has limited time and energy and will respond as appropriate.
You can also reply to keywords in the background of the official account: Add group, join the reader exchange group and discuss with you.