The deep learning model is generally composed of various model layers.
A very rich variety of model layers are built in torch.nn. They are all subclasses of nn.Module and have parameter management functions.
E.g:
-
nn.Linear, nn.Flatten, nn.Dropout, nn.BatchNorm2d
-
nn.Conv2d,nn.AvgPool2d,nn.Conv1d,nn.ConvTranspose2d
-
nn.Embedding,nn.GRU,nn.LSTM
-
nn.Transformer
If these built-in model layers cannot meet the needs, we can also construct a custom model layer by inheriting the nn.Module base class.
In fact, pytorch does not distinguish between models and model layers, and is constructed by inheriting nn.Module.
Therefore, we only need to inherit the nn.Module base class and implement the forward method to customize the model layer.
import numpy as np
import torch
from torch import nn
Some commonly used built-in model layers are briefly introduced as follows.
Basic layer
-
nn.Linear: Fully connected layer. Number of parameters = number of input layer features × number of output layer features (weight) + output layer features (bias)
-
nn.Flatten: Flattening layer, used to compress multi-dimensional tensor samples into one-dimensional tensor samples.
-
nn.BatchNorm1d: One-dimensional batch standardization layer. The input batch is scaled and translated to a stable mean and standard deviation through linear transformation. It can enhance the model's adaptability to different input distributions, speed up model training, and have a slight regularization effect. Generally used before the activation function. You can use the afine parameter to set whether the layer contains parameters that can be trained.
-
nn.BatchNorm2d: Two-dimensional batch standardization layer.
-
nn.BatchNorm3d: Three-dimensional batch standardization layer.
-
nn.Dropout: One-dimensional random dropout layer. A means of regularization.
-
nn.Dropout2d: Two-dimensional random dropout layer.
-
nn.Dropout3d: 3D random dropout layer.
-
nn.Threshold: Limiting layer. When the input is greater than or less than the threshold range, it is cut off.
-
nn.ConstantPad2d: Two-dimensional constant padding layer. Fill a two-dimensional tensor sample with a constant extension length.
-
nn.ReplicationPad1d: One-dimensional replication padding layer. For a one-dimensional tensor sample, fill in the extended length by copying the edge value.
-
nn.ZeroPad2d: Two-dimensional zero-value filling layer. Fill a two-dimensional tensor sample with 0 values at the edges.
-
nn.GroupNorm: group normalization. An alternative to batch normalization is to divide the channels into groups for normalization. Not limited by batch size, it is said that the performance and effect are better than BatchNorm.
-
nn.LayerNorm: Layer normalization. Used less frequently.
-
nn.InstanceNorm2d: Sample normalization. Used less frequently.
Various normalization techniques refer to the following article "FAIR He Yuming and others proposed group normalization: replace batch normalization, not limited by batch size"
https://zhuanlan.zhihu.com/p/34858971
Convolutional network related layers
-
nn.Conv1d: ordinary one-dimensional convolution, often used in text. Number of parameters = number of input channels × size of convolution kernel (such as 3) × number of convolution kernels + size of convolution kernel (such as 3)
-
nn.Conv2d: ordinary two-dimensional convolution, often used for images. Number of parameters = number of input channels × convolution kernel size (e.g. 3 times 3) × number of convolution kernels + convolution kernel size (e.g. 3 times 3) By adjusting the dilation parameter to be greater than 1, it can become a hollow convolution and increase the receptive field of the convolution kernel. By adjusting the groups parameter to not 1, it can become a grouped convolution. In grouped convolution, different groups use the same convolution kernel, which significantly reduces the number of parameters. When the groups parameter is equal to the number of channels, it is equivalent to the two-dimensional depth convolution layer tf.keras.layers.DepthwiseConv2D in tensorflow. Using the combined operation of grouped convolution and 1 by 1 convolution, the two-dimensional depth separable convolution layer tf.keras.layers.SeparableConv2D can be constructed equivalent to Keras.
-
nn.Conv3d: ordinary three-dimensional convolution, often used in video. The number of parameters = the number of input channels × the size of the convolution kernel (for example, 3 times 3 times 3) × the number of convolution kernels + the size of the convolution kernel (for example, 3 times 3 times 3).
-
nn.MaxPool1d: One-dimensional maximum pooling.
-
nn.MaxPool2d: Two-dimensional maximum pooling. A down-sampling method. There are no parameters to train.
-
nn.MaxPool3d: Three-dimensional maximum pooling.
-
nn.AdaptiveMaxPool2d: Two-dimensional adaptive maximum pooling. No matter how the size of the input image changes, the output image size is fixed. The realization principle of this function is probably based on the backward calculation of the padding, stride and other parameters of the pooling operator through the size of the input image and the size of the output image to be obtained.
-
nn.FractionalMaxPool2d: Two-dimensional score maximum pooling. Normal maximum pooling usually input size is an integer multiple of output. The maximum pooling of scores does not have to be an integer. Score maximum pooling uses some random sampling strategies, which have a certain regular effect, and can be used to replace ordinary maximum pooling and Dropout layers.
-
nn.AvgPool2d: Two-dimensional average pooling.
-
nn.AdaptiveAvgPool2d: Two-dimensional adaptive average pooling. No matter how the input dimension changes, the output dimension is fixed.
-
nn.ConvTranspose2d: Two-dimensional convolution transpose layer, commonly known as deconvolution layer. It is not the inverse operation of convolution, but in the case of the same convolution kernel, when the input size is the output size of the convolution operation, the output size of the convolution transpose is exactly the input size of the convolution operation. It can be used for upsampling in semantic segmentation.
-
nn.Upsample: Upsampling layer, the operation effect is opposite to pooling. The up-sampling strategy can be controlled by the mode parameter as "nearest" or "linear" linear interpolation strategy.
-
nn.Unfold: sliding window extraction layer. Its parameters are the same as the convolution operation nn.Conv2d. In fact, the convolution operation can be equivalent to a combination of nn.Unfold and nn.Linear and nn.Fold. Among them, the nn.Unfold operation can extract the numerical matrix of each sliding window from the input and flatten it into one dimension. Use nn.Linear to multiply the output of nn.Unfold with the convolution kernel, and then use The nn.Fold operation converts the result into the output image shape.
-
nn.Fold: Inverse sliding window extraction layer.
Recurrent network related layers
-
nn.Embedding: Embedding layer. A more effective method of encoding discrete features than Onehot. Generally used to map the words in the input to dense vectors. The parameters of the embedding layer need to be learned.
-
nn.LSTM: Long and short memory loop network layer [support multiple layers]. The most commonly used recurrent network layer. With carrying track, forget door, update door, output door. It can more effectively alleviate the problem of gradient disappearance, so that it can be applied to long-term dependence. Bidirectional LSTM can be obtained when bidirectional = True. It should be noted that the default input and output shapes are (seq, batch, feature), if you need to place the batch dimension in the 0th dimension, you must set the batch_first parameter to True.
-
nn.GRU: Gated loop network layer [support multiple layers]. The low-profile version of LSTM does not have a carrying track, the number of parameters is less than that of LSTM, and the training speed is faster.
-
nn.RNN: Simple recurrent network layer [support multiple layers]. The gradient disappears easily, and the problem of long-term dependence cannot be applied. Generally less used.
-
nn.LSTMCell: Long and short memory loop network unit. Compared with nn.LSTM iterating on the entire sequence, it only iterates one step on the sequence. Generally less used.
-
nn.GRUCell: Gated cyclic network unit. Compared with nn.GRU iterating on the entire sequence, it only iterates one step on the sequence. Generally less used.
-
nn.RNNCell: Simple cyclic network unit. Compared with nn.RNN iterating on the entire sequence, it only iterates one step on the sequence. Generally less used.
Transformer related layers
-
nn.Transformer: Transformer network structure. The Transformer network structure is a structure that replaces the cyclic network, which solves the shortcomings that the cyclic network is difficult to be parallel and difficult to capture long-term dependence. It is the main component of the current mainstream model of NLP tasks. Transformer network structure is composed of TransformerEncoder encoder and TransformerDecoder decoder. The core of the encoder and decoder is the MultiheadAttention layer.
-
nn.TransformerEncoder: Transformer encoder structure. It consists of multiple nn.TransformerEncoderLayer encoder layers.
-
nn.TransformerDecoder: Transformer decoder structure. It consists of multiple nn.TransformerDecoderLayer decoder layers.
-
nn.TransformerEncoderLayer: The encoder layer of Transformer.
-
nn.TransformerDecoderLayer: Decoder layer of Transformer.
-
nn.MultiheadAttention: Multihead attention layer.
For the introduction of Transformer principle, you can refer to the following article "Detailed Transformer (Attention Is All You Need)"
https://zhuanlan.zhihu.com/p/48508221
If Pytorch's built-in model layer cannot meet the needs, we can also construct a custom model layer by inheriting the nn.Module base class.
In fact, pytorch does not distinguish between models and model layers, and is constructed by inheriting nn.Module.
Therefore, we only need to inherit the nn.Module base class and implement the forward method to customize the model layer.
Below is the source code of Pytorch's nn.Linear layer, we can imitate it to customize the model layer.
import torch
from torch import nn
import torch.nn.functional as F
class Linear(nn.Module):
__constants__ = ['in_features','out_features']
def __init__(self, in_features, out_features, bias=True):
super(Linear, self).__init__()
self.in_features = in_features
self.out_features = out_features
self.weight = nn.Parameter(torch.Tensor(out_features, in_features))
if bias:
self.bias = nn.Parameter(torch.Tensor(out_features))
else:
self.register_parameter('bias', None)
self.reset_parameters()
def reset_parameters(self):
nn.init.kaiming_uniform_(self.weight, a=math.sqrt(5))
if self.bias is not None:
fan_in, _ = nn.init._calculate_fan_in_and_fan_out(self.weight)
bound = 1 / math.sqrt(fan_in)
nn.init.uniform_(self.bias, -bound, bound)
def forward(self, input):
return F.linear(input, self.weight, self.bias)
def extra_repr(self):
return'in_features={}, out_features={}, bias={}'.format(
self.in_features, self.out_features, self.bias is not None
)
linear = nn.Linear(20, 30)
inputs = torch.randn(128, 20)
output = linear(inputs)
print(output.size())
torch.Size([128, 30])
If this book is helpful to you and want to encourage the author, remember to add a star⭐️ to this project and share it with your friends 😊!
If you need to further communicate with the author on the understanding of the content of this book, please leave a message under the public account "Algorithm Food House". The author has limited time and energy and will respond as appropriate.
You can also reply to the keyword in the background of the official account: Add group, join the reader exchange group and discuss with you.