Introduction
This is an extension to pystematic that adds functionality related to running machine learning experiments in pytorch. Its main contribution is the Context and related classes, which has the goal of making your code agnostic to whether or not you are running on cuda, cpu, or distributed data-parallel.
Installation
All you have to do for pystematic to load the extension is to install it:
$ pip install pystematic-torch
Experiment API
This extension publishes its API under the pystematic.torch
namespace.
General
- pystematic.torch.move_to_device(device, *args)
Utility method to place a batch of data on a specific device (i.e. cuda or cpu). It handles nested dicts and lists by traversing every element and moving them to the proper device if possible. Unrecognized objects will be left as is.
- Parameters:
device (str, torch.Device) – The device to move to
*args (any) – Any objects that you want to move
- Returns:
The moved objects
- Return type:
any
- pystematic.torch.save_checkpoint(state_dict, id) None
Saves the provided state_dict to a file in
pystematic.output_dir
. This function will make sure to only save the checkpoint in the master process when called in distributed mode.- Parameters:
state_dict (dict) – The state dict to save, such as the on returned from
Context.state_dict()
id (any) – An id that uniquely identifies this checkpoint. E.g. epoch number, step number etc.
- pystematic.torch.load_checkpoint(checkpoint_file_path) dict
Loads and returns a checkpoint from the given filepath.
- Parameters:
checkpoint_file_path (str, pathlib.Path) – Path to the file to load.
- Returns:
The loaded state dict.
- Return type:
dict
- pystematic.torch.run_parameter_sweep(experiment, list_of_params, max_num_processes=1, num_gpus_per_process=None) None
Extends the
pystematic.run_parameter_sweep()
with GPU limiting capabilities.Runs an experiment multiple times with a set of different params. At most
max_num_processes
concurrent processes will be used. This call will block until all experiments have been run.- Parameters:
experiment (Experiment) – The experiment to run.
list_of_params (list of dict) – A list of parameter dictionaries. Each corresponding to one run of the experiment. See
pystematic.param_matrix()
for a convenient way of generating such a list.max_num_processes (int, optional) – The maximum number of concurrent processes to use for running the experiments. Defaults to 1.
num_gpus_per_process (int, optional) – The number of GPUs to allocate for each experiment. If None no allocation is done. Default is None.
Distributed
- pystematic.torch.is_distributed() bool
Alias for
torch.distributed.is_initialized()
.- Returns:
Returns true if torch distributed runtime is initialized.
- Return type:
bool
- pystematic.torch.is_master() bool
If running in distributed mode, returns whether of not this current process is the master process. In non-distributed mode, always returns True.
- Returns:
Whether the current process is the master process.
- Return type:
bool
- pystematic.torch.get_num_processes() int
Alias for
torch.distributed.get_world_size()
. In non-distributed mode, this always returns 1.- Returns:
The total number of processes in the distributed runtime.
- Return type:
int
- pystematic.torch.get_rank() int
Returns the global rank of the current process. If the current process is not currently running in distributed mode, it always return 0. In single node training the rank is the same as the local rank.
- Returns:
The rank of the current process.
- Return type:
int
- pystematic.torch.broadcast_from_master(value)
Alias for
torch.distributed.broadcast(value, 0)
. In non-distributed mode, this just returns the value.
- pystematic.torch.distributed_barrier() None
Alias for
torch.distributed.barrier()
. In non-distributed mode, this is a noop.
Context
When you are developing a model in pytorch, you often want to be able to train the model in many different settings, such as multi-node distributed, single gpu or even just on the cpu depending on your work location and on available resources. The main purpose of the context object is to allow you to transition seamlessly between these different modes of training, without changing your code.
If you are familiar with the Torch.nn.Module
object, you know that whenever
you add a paramater to the object, it gets registered with it, and when you want
to move the model to another device, you simply call module.cuda()
or
module.cpu()
to move all paramters registered with the module.
A context object is like a torch module on steroids. You are meant to register
every object important to your training session with it, e.g. models,
optimizers, epoch counter etc. You can then transition your session with the
Context.cpu()
, Context.cuda()
and Context.ddp()
methods.
You can also serialize and restore the state of the entire session with the
Context.state_dict()
and Context.load_state_dict()
methods, which
makes checkpointing painless.
Here is a short example showing how the Context may be used:
import pystematic
@pystematic.experiment
def context_example(params):
ctx = pystematic.torch.Context()
ctx.epoch = 0
ctx.recorder = pystematic.torch.Recorder()
ctx.model = torch.nn.Sequential(
torch.nn.Linear(2, 1),
torch.nn.Sigmoid()
)
ctx.optimzer = torch.optim.SGD(ctx.model.parameters(), lr=0.01)
# We use the smart dataloader so that batches are moved to
# the correct device
ctx.dataloader = pystematic.torch.SmartDataLoader(
dataset=Dataset(),
batch_size=2
)
ctx.loss_function = torch.nn.BCELoss()
ctx.cuda() # Move everything to cuda
# ctx.ddp() # and maybe distributed data-parallel?
if params["checkpoint"]:
# Load checkpoint
ctx.load_state_dict(pystematic.torch.load_checkpoint(params["checkpoint"]))
# Train one epoch
for input, lbl in ctx.dataloader:
# The smart dataloader makes sure the batch is placed on
# the correct device.
output = ctx.model(input)
loss = ctx.loss_function(output, lbl)
ctx.optimzer.zero_grad()
loss.backward()
ctx.optimzer.step()
ctx.recorder.scalar("train/loss", loss)
ctx.recorder.step()
ctx.epoch += 1
# Save checkpoint
pystematic.torch.save_checkpoint(ctx.state_dict(), id=ctx.epoch)
The following list specifies the transformations applied to each type of object:
torch.nn.Module
:
cuda: moved to
torch.cuda.current_device()
cpu: moved to cpu
ddp: Gets wrapped in
torch.nn.parallel.DistributedDataParallel
and then in an object proxy, that delegates all non-existinggetattr()
calls to the underlying module. This means that you should be able to use any custom attributes and methods of the original module, even after it get wrapped in the DDP module. This is needed to make the code you write agnostic to whether or not it is currently run in distributed mode.
torch.optim.Optimizer
:
cuda, cpu, ddp: Optimizer parameters will be moved to the correct device.
ddp: gets silenced on non master processes
pystematic.torch.SmartDataLoader
:
cuda, cpu: Moves the dataloader to the proper device. If you initialize the dataloader with
move_output = True
, the items yielded when iterating the dataloader are moved to the correct device.
Any object with a method named to()
(such as torch.Tensor
):
cuda, cpu, ddp: call the
to()
method with the device to move the object to.
All other types of objects are left unchanged.
The autotransform()
method uses the parameters cuda
,
distributed
, checkpoint
to automatically determine how the context
should be transformed.
- class pystematic.torch.Context
- autotransform()
Transforms the context according to the current experiment parameters. More specifically it; loads a state_dict from the parameter
checkpoint
if set, moves to cuda if paramtercuda
is set, moves to distributed if parameterdistributed
is set.
- cpu()
Moves the context to the cpu.
- cuda()
Moves the context to
torch.cuda.current_device()
.
- ddp()
Moves the context to a distributed data-parallell setting. Can only be used if torch.distributed is initialized.
- load_state_dict(state: dict) None
Sets the state for the context.
- Parameters:
state (dict) – The state to load.
- state_dict() dict
Returns the whole state of the context by iterating all registered items and calling
state_dict()
on the item to retrieve its state. Primitive values will also be saved.- Returns:
A dict representing the state of all registered objects.
- Return type:
dict
Other
- class pystematic.torch.Recorder(output_dir=None, tensorboard=True, file=True, console=False)
Used for recording metrics during training and evaluation.
The recorder has an internal counter
count
that is recorded together with all values. The count typically represents the ‘global_step’ during training. Remember to increment the counter appropriately.Each recorded value is also associated with a
tag
that uniquely determines which time series the value should be recorded to. The tag can use slashes (‘/’) to build hierarchies. E.g.train/loss
,test/loss
etc.- Parameters:
output_dir (str, optional) – The output directory store data in. Defaults to
pystematic.output_dir
.tensorboard (bool, optional) – If the recorder should write tensorboard logs. Defaults to True.
file (bool, optional) – If the recorder should write to plain files. Defaults to True.
console (bool, optional) – If the recorder should write to stdout. Defaults to False.
- property count: int
Counter that represents the x-axis when logging data. You can assign a value to this property or call
step()
to increase the counter.
- figure(tag, fig)
Logs a matplotlib figure
- Parameters:
tag (str) – A string that determines which time series the value should be recorded to.
fig (Figure) – A matplotlib figure
- image(tag, image)
Logs an image
- Parameters:
tag (str) – A string that determines which time series the value should be recorded to.
image (PIL.Image, np.ndarray, torch.tensor) – The image
- load_state_dict(state)
Loads a state dict
- Parameters:
state (dict) – The state dict to load.
- params(params_dict)
Logs a parameter dict.
- Parameters:
params_dict (dict) – dict of param values.
- scalar(tag, scalar)
Logs a scalar value.
- Parameters:
tag (str) – A string that determines which time series the value should be recorded to.
scalar (float) – The value of the scalar.
- class pystematic.torch.SmartDataLoader(dataset, shuffle=False, random_seed=None, sampler=None, batch_sampler=None, move_output=True, loading_bar=True, **kwargs)
Extends the
torch.utils.data.DataLoader
with the following:A loading bar is displayed when iterating the dataloader.
The items yielded when iterating are moved to the device previously set with
to()
.Transparently handles both distributed and non-distributed modes.
- Parameters:
dataset (torch.utils.data.Dataset) – The dataset to construct a loader for
shuffle (bool, optional) – Whether to shuffle the data when loading. Ignored if
sampler
is not None. Defaults to False.random_seed (int, optional) – Random seed to use when shuffleing data. Ignored if
sampler
is not None. Defaults to None.sampler (torch.utils.data.Sampler, Iterable, optional) – An object defining how to sample data items. Defaults to None.
move_output (bool, optional) – If items yielded during iteration automatically should be moved to the curent device. Defaults to True.
loading_bar (bool, optional) – If a loading bar should be displayed during iteration. Defaults to True.
- to(device)
Sets the device that yielded items should be placed on when iterating.
- Parameters:
device (str, torch.Device) – The device to move the items to.
Default parameters
The following parameters are added to all experiments by default. Note that
these are also listed if you run an experiment from the command line with the
--help
option.
checkpoint
: If using the contextautotransform()
method, it will load the checkpoint pointed to by this parameter (if set). Default value isNone
.cuda
: If using the contextautotransform()
method, setting this to True will move the context to cuda. Default value isTrue
.distributed
: Controls if the experiment should be run in a distributed fashion (multiple GPUs). When set to True, a distributed mode will be launched (similar totorch.distributed.launch
) before the experiment main function is run. If using the contextautotransform()
method, this parameter also tells the context whether to move to distributed mode (ddp
). Default value isFalse
.node_rank
: The rank of the node for multi-node distributed training. Default value is0
.nproc_per_node
: The number of processes to launch on each node, for GPU training, this is recommended to be set to the number of GPUs in your system so that each process can be bound to a single GPU. Default value is1
.nnodes
: The number of nodes to use for distributed training. Default value is1
.master_addr
: The master node’s (rank 0) IP address or the hostname. Leave default for single node training. Default value is127.0.0.1
.master_port
: The master node’s (rank 0) port used for communciation during distributed training. Default value is29500
.