Hello, Habr!
In our pre-order, a long-awaited book about
the PyTorch library has appeared .
Since you will learn all the necessary basic material about PyTorch from this book, we remind you of the
benefits of a process called “grokking” or “in-depth comprehension” of the topic you want to learn. In today's post, we’ll tell you how Kai Arulkumaran slammed PyTorch (no picture). Welcome to cat.
PyTorch is a flexible deep learning framework that automatically distinguishes between objects using dynamic neural networks (that is, networks using dynamic flow control, such as
if
and
while
loops). PyTorch supports GPU acceleration,
distributed training , various types of
optimization and many other nice features. Here I presented some thoughts on how, in my opinion, should use PyTorch; all aspects of the library and recommended practices are not covered here, but I hope this text will be useful to you.
Neural networks are a subclass of computational graphs. Computing graphs receive data as input, then this data is routed (and can be converted) at the nodes where they are processed. In deep learning, neurons (nodes) usually transform data by applying parameters and differentiable functions to them, so that the parameters can be optimized to minimize losses by the gradient descent method. In a broader sense, I note that functions can be stochastic and a graph dynamic. Thus, while neural networks fit well with the dataflow programming paradigm, the PyTorch API focuses on the
imperative programming paradigm, and this way of interpreting the programs created is much more familiar. That is why PyTorch code is easier to read, it is easier to judge the design of complex programs, which, however, does not require serious compromise on performance: in fact, PyTorch is fast enough and provides many optimizations that you, as an end user, can not worry about at all (however, if you are really interested in them, you can dig a little deeper and get to know them).
The rest of this article is an analysis of the
official example on the MNIST dataset . Here we
play PyTorch, so I recommend to understand the article only after getting acquainted with the
official beginner's manuals . For convenience, the code is presented in the form of small fragments equipped with comments, that is, it is not distributed into separate functions / files that you are used to seeing in pure modular code.
Imports
import argparse import os import torch from torch import nn, optim from torch.nn import functional as F from torch.utils.data import DataLoader from torchvision import datasets, transforms
All of these are quite standard imports, with the exception of
torchvision
modules, which are especially actively used for solving tasks related to computer vision.
Customization
parser = argparse.ArgumentParser(description='PyTorch MNIST Example') parser.add_argument('--batch-size', type=int, default=64, metavar='N', help='input batch size for training (default: 64)') parser.add_argument('--epochs', type=int, default=10, metavar='N', help='number of epochs to train (default: 10)') parser.add_argument('--lr', type=float, default=0.01, metavar='LR', help='learning rate (default: 0.01)') parser.add_argument('--momentum', type=float, default=0.5, metavar='M', help='SGD momentum (default: 0.5)') parser.add_argument('--no-cuda', action='store_true', default=False, help='disables CUDA training') parser.add_argument('--seed', type=int, default=1, metavar='S', help='random seed (default: 1)') parser.add_argument('--save-interval', type=int, default=10, metavar='N', help='how many batches to wait before checkpointing') parser.add_argument('--resume', action='store_true', default=False, help='resume training from checkpoint') args = parser.parse_args() use_cuda = torch.cuda.is_available() and not args.no_cuda device = torch.device('cuda' if use_cuda else 'cpu') torch.manual_seed(args.seed) if use_cuda: torch.cuda.manual_seed(args.seed)
argparse
is the standard way to handle command line arguments in Python.
If you need to write code designed to work on different devices (using GPU-acceleration, when it is available, but if it is not rolled back to calculations on the CPU), then select and save the appropriate
torch.device
, with which you can determine where you should tensors are stored. For more information on creating such a code, see the
official documentation . PyTorch's approach is to make device selection under user control, which may seem undesirable in simple examples. However, this approach greatly simplifies the work when you have to deal with tensors, which a) is convenient for debugging b) allows you to effectively use devices manually.
For reproducibility of experiments, it is necessary to set random initial values ​​for all components that use random number generation (including
random
or
numpy
, if you use them
numpy
). Please note: cuDNN uses non-deterministic algorithms and is optionally disabled using
torch.backends.cudnn.enabled = False
.
Data
data_path = os.path.join(os.path.expanduser('~'), '.torch', 'datasets', 'mnist') train_data = datasets.MNIST(data_path, train=True, download=True, transform=transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))])) test_data = datasets.MNIST(data_path, train=False, transform=transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))])) train_loader = DataLoader(train_data, batch_size=args.batch_size, shuffle=True, num_workers=4, pin_memory=True) test_loader = DataLoader(test_data, batch_size=args.batch_size, num_workers=4, pin_memory=True)
Since
torchvision
models
torchvision
stored under
~/.torch/models/
, I prefer to store torchvision
torchvision
under
~/.torch/datasets
. This is my copyright agreement, but it is very convenient to use in projects developed on the basis of MNIST, CIFAR-10, etc. In general, datasets should be stored separately from the code if you intend to reuse multiple datasets.
torchvision.transforms
contains many convenient conversion options for individual images, such as cropping and normalization.
There are many options in the
batch_size
, but besides
batch_size
and
shuffle
, you should also keep in mind
num_workers
and
pin_memory
, they help increase efficiency.
num_workers > 0
uses subprocesses for asynchronous data loading, and does not block the main process for this. A typical use case is loading data (for example, images) from a disk and, possibly, converting them; all this can be done in parallel, along with network data processing. The degree of processing may need to be adjusted in order to a) minimize the number of employees and, consequently, the amount of CPU and RAM used (each employee loads a separate portion, rather than individual samples included in the portion) b) minimize the waiting time for data on the network.
pin_memory
uses
pinned memory (as opposed to pumped) to speed up any data transfer operations from RAM to the GPU (and does nothing with code specific to the CPU).
Model
class Net(nn.Module): def __init__(self): super(Net, self).__init__() self.conv1 = nn.Conv2d(1, 10, kernel_size=5) self.conv2 = nn.Conv2d(10, 20, kernel_size=5) self.conv2_drop = nn.Dropout2d() self.fc1 = nn.Linear(320, 50) self.fc2 = nn.Linear(50, 10) def forward(self, x): x = F.relu(F.max_pool2d(self.conv1(x), 2)) x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2)) x = x.view(-1, 320) x = F.relu(self.fc1(x)) x = self.fc2(x) return F.log_softmax(x, dim=1) model = Net().to(device) optimiser = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum) if args.resume: model.load_state_dict(torch.load('model.pth')) optimiser.load_state_dict(torch.load('optimiser.pth'))
Network initialization usually extends to member variables, layers that contain learning parameters and, possibly, individual learning parameters and untrained buffers. Then, with a direct pass, they are used in combination with functions from
F
that are purely functional and do not contain parameters. Some people like to work with purely functional networks (e.g. keep parameters and use
F.conv2d
instead of
nn.Conv2d
) or networks entirely consisting of layers (e.g.
nn.ReLU
instead of
F.relu
).
.to(device)
is a convenient way to send device parameters (and buffers) to the GPU if the
device
set to GPU, because otherwise (if the device is set to CPU) nothing will be done. It is important to transfer the device parameters to the appropriate device before passing them to the optimizer; otherwise the optimizer will not be able to track the parameters correctly!
Both neural networks (
nn.Module
) and optimizers (
optim.Optimizer
) can save and load their internal state, and it is recommended to do this with
.load_state_dict(state_dict)
- it is necessary to reload the state of both in order to resume training based on previously saved dictionaries states. Saving the entire object may be
fraught with errors . If you saved tensors on the GPU and want to load them onto the CPU or another GPU, then the easiest way is to load them directly onto the CPU using the
map_location
option , e.g.
torch.load('model.pth'
,
map_location='cpu'
).
Here are some other points that are not shown here, but worthy of mention, that you can use the control flow with a direct pass (for example, the execution of the
if
may depend on the member variable or the data itself. In addition, it is perfectly acceptable to output (
print
) tensors, which greatly simplifies debugging. Finally, with a direct pass, a lot of arguments can be used. I will illustrate this point with a short listing that is not tied to any particular idea:
def forward(self, x, hx, drop=False): hx2 = self.rnn(x, hx) print(hx.mean().item(), hx.var().item()) if hx.max.item() > 10 or self.can_drop and drop: return hx else: return hx2
Training
model.train() train_losses = [] for i, (data, target) in enumerate(train_loader): data = data.to(device=device, non_blocking=True) target = target.to(device=device, non_blocking=True) optimiser.zero_grad() output = model(data) loss = F.nll_loss(output, target) loss.backward() train_losses.append(loss.item()) optimiser.step() if i % 10 == 0: print(i, loss.item()) torch.save(model.state_dict(), 'model.pth') torch.save(optimiser.state_dict(), 'optimiser.pth') torch.save(train_losses, 'train_losses.pth')
Network modules are put into training mode by default - which to some extent affects the operation of modules, most of all - thinning and batch normalization. One way or another, it’s better to set such things manually using
.train()
, which filters the “training” flag to all child modules.
Here the
.to()
method not only accepts the device, but also sets
non_blocking=True
, thus ensuring asynchronous copying of data to the GPU from the committed memory, allowing the CPU to remain operational during data transfer; otherwise,
non_blocking=True
simply not an option.
Before collecting a new set of gradients using
loss.backward()
and performing back propagation using
optimiser.step()
, you must manually reset the gradients of the parameters being optimized using
optimiser.zero_grad()
. By default, PyTorch accumulates gradients, which is very convenient if you do not have enough resources to calculate all the gradients you need in one pass.
PyTorch uses a “tape” system of automatic gradients - it collects information about what operations and in what order were performed on the tensors, and then reproduces them in the opposite direction to perform differentiation in the reverse order (reverse-mode differentiation). That's why it is so super-flexible and allows arbitrary computational graphs. If none of these tensors requires gradients (you have to set
requires_grad=True
, creating a tensor for this purpose), then no graph is saved! However, networks usually have parameters requiring gradients, so any calculations performed based on network output will be stored in a graph. So, if you want to save the data resulting from this step, you will need to manually disable the gradients or (a more common approach), save this information as a Python number (using
.item()
in the PyTorch scalar) or a
numpy
array. Read more about
autograd
in the
official documentation .
One way to shorten the computational graph is to use
.detach()
when the hidden state is passed when learning RNN with a truncated version of backpropagation-through-time. It is also convenient when differentiating losses, when one of the components is the output of another network, but this other network should not be optimized with respect to losses. As an example, I will teach the discriminatory part on the output material generating when working with the GAN, or the policy training in the actor-critic algorithm using the objective function as the base function (e.g. A2C). Another technique that prevents the calculation of gradients is effective in training GAN (training of the generating part on discriminant material) and typical in fine tuning is the cyclic enumeration of network parameters for which
param.requires_grad = False
.
It is important not only to record the results in the console / log file, but also to set breakpoints in the model parameters (and the optimizer state) just in case. You can also use
torch.save()
to save regular Python objects, or use another standard solution - the built-in
pickle
.
Testing
model.eval() test_loss, correct = 0, 0 with torch.no_grad(): for data, target in test_loader: data = data.to(device=device, non_blocking=True) target = target.to(device=device, non_blocking=True) output = model(data) test_loss += F.nll_loss(output, target, reduction='sum').item() pred = output.argmax(1, keepdim=True) correct += pred.eq(target.view_as(pred)).sum().item() test_loss /= len(test_data) acc = correct / len(test_data) print(acc, test_loss)
In response to
.train()
networks need to be explicitly put into evaluation mode using
.eval()
.
As mentioned above, when using a network, a computational graph is usually compiled. To prevent this, use the
no_grad
context
no_grad
with
with torch.no_grad()
.
Some more
This is an additional section, in which I made a few more useful digressions.
Here is the
official documentation explaining working with memory.
CUDA errors? It’s hard to fix them, and usually they are connected with logical inconsistencies, according to which more sensible error messages are displayed on the CPU than on the GPU. Best of all, if you plan to work with the GPU, you can quickly switch between the CPU and GPU. A more general development tip is to organize the code so that it can be quickly checked before starting a full-fledged task. For example, prepare a small or synthetic dataset, run one era train + test, etc. If the matter is a CUDA error, or you cannot switch to the CPU at all, set CUDA_LAUNCH_BLOCKING = 1. This will make the CUDA kernel launches synchronous, and you will receive more accurate error messages.
A note on
torch.multiprocessing
or just running multiple PyTorch scripts at the same time. Because PyTorch uses multi-threaded BLAS libraries to speed up linear algebra computations on the CPU, usually several cores are involved. If you want to do several things at the same time, using multi-threaded processing or several scripts, it may be advisable to manually reduce their number by setting the environment variable
OMP_NUM_THREADS
to 1 or another low value. Thus, the probability of slipping the processor is reduced. The official documentation has other comments regarding multithreaded processing.