apex.amp — Apex 0.1.0 documentation (2024)

Apex

0.1

AMP: Automatic Mixed Precision

apex.amp
- opt_levels and Properties
  - Properties
  - opt_levels
    - O0: FP32 training
    - O1: Mixed Precision (recommended for typical use)
    - O2: “Almost FP16” Mixed Precision
    - O3: FP16 training
- Unified API
- Checkpointing
- Advanced use cases
  - Advanced Amp Usage
    - GANs
    - Gradient clipping
    - Custom/user-defined autograd functions
    - Forcing particular layers/functions to a desired type
    - Multiple models/optimizers/losses
    - Gradient accumulation across iterations
    - Custom data batch types
- Transition guide for old API users
  - For users of the old “Amp” API
  - For users of the old FP16_Optimizer

Distributed Training

apex.parallel
- Utility functions

Fused Optimizers

apex.optimizers

Fused Layer Norm

apex.normalization.fused_layer_norm

Docs »
apex.amp
View page source

This page documents the updated API for Amp (Automatic Mixed Precision),a tool to enable Tensor Core-accelerated training in only 3 lines of Python.

A runnable, comprehensive Imagenet example demonstrating good practices can be foundon the Github page.

GANs are a tricky case that many people have requested. A comprehensive DCGAN exampleis under construction.

If you already implemented Amp based on the instructions below, but it isn’t behaving as expected,please review Advanced Amp Usage to see if any topics match your use case. If that doesn’t help,file an issue.

`opt_level`s and Properties¶

Amp allows users to easily experiment with different pure and mixed precision modes.Commonly-used default modes are chosen byselecting an “optimization level” or opt_level; each opt_level establishes a set ofproperties that govern Amp’s implementation of pure or mixed precision training.Finer-grained control of how a given opt_level behaves can be achieved by passing values forparticular properties directly to amp.initialize. These manually specified valuesoverride the defaults established by the opt_level.

Example:

# Declare model and optimizer as usual, with default (FP32) precisionmodel = torch.nn.Linear(D_in, D_out).cuda()optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)# Allow Amp to perform casts as required by the opt_levelmodel, optimizer = amp.initialize(model, optimizer, opt_level="O1")...# loss.backward() becomes:with amp.scale_loss(loss, optimizer) as scaled_loss: scaled_loss.backward()...

Users should not manually cast their model or data to .half(), regardless of what opt_levelor properties are chosen. Amp intends that users start with an existing default (FP32) script,add the three lines corresponding to the Amp API, and begin training with mixed precision.Amp can also be disabled, in which case the original script will behave exactly as it used to.In this way, there’s no risk adhering to the Amp API, and a lot of potential performance benefit.

Note

Because it’s never necessary to manually cast your model (aside from the call amp.initialize)or input data, a script that adheres to the new APIcan switch between different opt-levels without having to make any other changes.

Properties¶

Currently, the under-the-hood properties that govern pure or mixed precision training are the following:

cast_model_type: Casts your model’s parameters and buffers to the desired type.
patch_torch_functions: Patch all Torch functions and Tensor methods to perform Tensor Core-friendly ops like GEMMs and convolutions in FP16, and any ops that benefit from FP32 precision in FP32.
keep_batchnorm_fp32: To enhance precision and enable cudnn batchnorm (which improves performance), it’s often beneficial to keep batchnorm weights in FP32 even if the rest of the model is FP16.
master_weights: Maintain FP32 master weights to accompany any FP16 model weights. FP32 master weights are stepped by the optimizer to enhance precision and capture small gradients.
loss_scale: If loss_scale is a float value, use this value as the static (fixed) loss scale. If loss_scale is the string "dynamic", adaptively adjust the loss scale over time. Dynamic loss scale adjustments are performed by Amp automatically.

Again, you often don’t need to specify these properties by hand. Instead, select an opt_level,which will set them up for you. After selecting an opt_level, you can optionally pass propertykwargs as manual overrides.

`opt_level`s¶

Recognized opt_levels are "O0", "O1", "O2", and "O3".

O0 and O3 are not true mixed precision, but they are useful for establishing accuracy andspeed baselines, respectively.

O1 and O2 are different implementations of mixed precision. Try both, and seewhat gives the best speedup and accuracy for your model.

`O0`: FP32 training¶

Your incoming model should be FP32 already, so this is likely a no-op.O0 can be useful to establish an accuracy baseline.

Default properties set by O0:

cast_model_type=torch.float32

patch_torch_functions=False

keep_batchnorm_fp32=None (effectively, “not applicable,” everything is FP32)

master_weights=False

loss_scale=1.0

`O1`: Mixed Precision (recommended for typical use)¶

Patch all Torch functions and Tensor methods to cast their inputs according to a whitelist-blacklistmodel. Whitelist ops (for example, Tensor Core-friendly ops like GEMMs and convolutions) are performedin FP16. Blacklist ops that benefit from FP32 precision (for example, softmax)are performed in FP32. O1 also uses dynamic loss scaling, unless overridden.

Default properties set by O1:

cast_model_type=None (not applicable)

patch_torch_functions=True

keep_batchnorm_fp32=None (again, not applicable, all model weights remain FP32)

master_weights=None (not applicable, model weights remain FP32)

loss_scale="dynamic"

`O2`: “Almost FP16” Mixed Precision¶

O2 casts the model weights to FP16,patches the model’s forward method to cast inputdata to FP16, keeps batchnorms in FP32, maintains FP32 master weights,updates the optimizer’s param_groups so that the optimizer.step()acts directly on the FP32 weights (followed by FP32 master weight->FP16 model weightcopies if necessary),and implements dynamic loss scaling (unless overridden).Unlike O1, O2 does not patch Torch functions or Tensor methods.

Default properties set by O2:

cast_model_type=torch.float16

patch_torch_functions=False

keep_batchnorm_fp32=True

master_weights=True

loss_scale="dynamic"

`O3`: FP16 training¶

O3 may not achieve the stability of the true mixed precision options O1 and O2.However, it can be useful to establish a speed baseline for your model, against whichthe performance of O1 and O2 can be compared. If your model uses batch normalization,to establish “speed of light” you can try O3 with the additional property overridekeep_batchnorm_fp32=True (which enables cudnn batchnorm, as stated earlier).

Default properties set by O3:

Unified API¶

apex.amp.initialize(models, optimizers=None, enabled=True, opt_level='O1', cast_model_type=None, patch_torch_functions=None, keep_batchnorm_fp32=None, master_weights=None, loss_scale=None, cast_model_outputs=None, num_losses=1, verbosity=1, min_loss_scale=None, max_loss_scale=16777216.0)[source]¶

Initialize your models, optimizers, and the Torch tensor and functional namespace according to thechosen opt_level and overridden properties, if any.

amp.initialize should be called after you have finishedconstructing your model(s) andoptimizer(s), but before you send your model through any DistributedDataParallel wrapper.See Distributed training in the Imagenet example.

Currently, amp.initialize should only be called once,although it can process an arbitrary number ofmodels and optimizers (see the corresponding Advanced Amp Usage topic).If you think your use case requires amp.initialize to be called more than once,let us know.

Any property keyword argument that is not None will be interpreted as a manual override.

To prevent having to rewrite anything else in your script, name the returned models/optimizersto replace the passed models/optimizers, as in the code sample below.

Parameters

models (torch.nn.Module or list of torch.nn.Modules) – Models to modify/cast.
optimizers (optional, torch.optim.Optimizer or list of torch.optim.Optimizers) – Optimizers to modify/cast.REQUIRED for training, optional for inference.
enabled (bool, optional, default=True) – If False, renders all Amp calls no-ops, so your scriptshould run as if Amp were not present.
opt_level (str, optional, default="O1") – Pure or mixed precision optimization level. Accepted values are“O0”, “O1”, “O2”, and “O3”, explained in detail above.
cast_model_type (torch.dtype, optional, default=None) – Optional property override, seeabove.
patch_torch_functions (bool, optional, default=None) – Optional property override.
keep_batchnorm_fp32 (bool or str, optional, default=None) – Optional property override. Ifpassed as a string, must be the string “True” or “False”.
master_weights (bool, optional, default=None) – Optional property override.
loss_scale (float or str, optional, default=None) – Optional property override. If passed as a string,must be a string representing a number, e.g., “128.0”, or the string “dynamic”.
cast_model_outputs (torch.dpython:type, optional, default=None) – Option to ensure that the outputsof your model(s) are always cast to a particular type regardless of opt_level.
num_losses (int, optional, default=1) – Option to tell Amp in advance how many losses/backwardpasses you plan to use. When used in conjunction with the loss_id argument toamp.scale_loss, enables Amp to use a different loss scale per loss/backward pass,which can improve stability. See “Multiple models/optimizers/losses”under Advanced Amp Usage for examples. If num_losses is left to 1, Amp will stillsupport multiple losses/backward passes, but use a single global loss scalefor all of them.
verbosity (int, default=1) – Set to 0 to suppress Amp-related output.
min_loss_scale (float, default=None) – Sets a floor for the loss scale values that can be chosen by dynamicloss scaling. The default value of None means that no floor is imposed.If dynamic loss scaling is not used, min_loss_scale is ignored.
max_loss_scale (float, default=2.**24) – Sets a ceiling for the loss scale values that can be chosen bydynamic loss scaling. If dynamic loss scaling is not used, max_loss_scale is ignored.

Returns

Model(s) and optimizer(s) modified according to the opt_level.If either the models or optimizers args were lists, the corresponding return value willalso be a list.

Permissible invocations:

model, optim = amp.initialize(model, optim,...)model, [optim1, optim2] = amp.initialize(model, [optim1, optim2],...)[model1, model2], optim = amp.initialize([model1, model2], optim,...)[model1, model2], [optim1, optim2] = amp.initialize([model1, model2], [optim1, optim2],...)# This is not an exhaustive list of the cross product of options that are possible,# just a set of examples.model, optim = amp.initialize(model, optim, opt_level="O0")model, optim = amp.initialize(model, optim, opt_level="O0", loss_scale="dynamic"|128.0|"128.0")model, optim = amp.initialize(model, optim, opt_level="O1") # uses "loss_scale="dynamic" defaultmodel, optim = amp.initialize(model, optim, opt_level="O1", loss_scale=128.0|"128.0")model, optim = amp.initialize(model, optim, opt_level="O2") # uses "loss_scale="dynamic" defaultmodel, optim = amp.initialize(model, optim, opt_level="O2", loss_scale=128.0|"128.0")model, optim = amp.initialize(model, optim, opt_level="O2", keep_batchnorm_fp32=True|False|"True"|"False")model, optim = amp.initialize(model, optim, opt_level="O3") # uses loss_scale=1.0 defaultmodel, optim = amp.initialize(model, optim, opt_level="O3", loss_scale="dynamic"|128.0|"128.0")model, optim = amp.initialize(model, optim, opt_level="O3", keep_batchnorm_fp32=True|False|"True"|"False")

The Imagenet example demonstrates live use of various opt_levels and overrides.

apex.amp.scale_loss(loss, optimizers, loss_id=0, model=None, delay_unscale=False, delay_overflow_check=False)[source]¶

On context manager entrance, creates scaled_loss = (loss.float())*current loss scale.scaled_loss is yielded so that the user can call scaled_loss.backward():

with amp.scale_loss(loss, optimizer) as scaled_loss: scaled_loss.backward()

On context manager exit (if delay_unscale=False), the gradients are checked for infs/NaNsand unscaled, so that optimizer.step() can be called.

Note

If Amp is using explicit FP32 master params (which is the default for opt_level=O2, andcan also be manually enabled by supplying master_weights=True to amp.initialize)any FP16 gradients are copied to FP32 master gradients before being unscaled.optimizer.step() will then apply the unscaled master gradients to the master params.

Warning

If Amp is using explicit FP32 master params, only the FP32 master gradients will beunscaled. The direct .grad attributes of any FP16model params will remain scaled after context manager exit.This subtlety affects gradient clipping. See “Gradient clipping” underAdvanced Amp Usage for best practices.

Parameters

loss (Tensor) – Typically a scalar Tensor. The scaled_loss that the contextmanager yields is simply loss.float()*loss_scale, so in principleloss could have more than one element, as long as you callbackward() on scaled_loss appropriately within the context manager body.
optimizers – All optimizer(s) for which the current backward pass is creating gradients.Must be an optimizer or list of optimizers returned from an earlier callto amp.initialize. For example use with multiple optimizers, see“Multiple models/optimizers/losses” under Advanced Amp Usage.
loss_id (int, optional, default=0) – When used in conjunction with the num_losses argumentto amp.initialize, enables Amp to use a different loss scale per loss. loss_idmust be an integer between 0 and num_losses that tells Amp which loss isbeing used for the current backward pass. See “Multiple models/optimizers/losses”under Advanced Amp Usage for examples. If loss_id is left unspecified, Ampwill use the default global loss scaler for this backward pass.
model (torch.nn.Module, optional, default=None) – Currently unused, reserved to enable futureoptimizations.
delay_unscale (bool, optional, default=False) – delay_unscale is never necessary, andthe default value of False is strongly recommended.If True, Amp will not unscale the gradients or perform model->mastergradient copies on context manager exit.delay_unscale=True is a minor ninja performance optimization and can resultin weird gotchas (especially with multiple models/optimizers/losses),so only use it if you know what you’re doing.“Gradient accumulation across iterations” under Advanced Amp Usageillustrates a situation where this CAN (but does not need to) be used.

Warning

If delay_unscale is True for a given backward pass, optimizer.step() cannot becalled yet after context manager exit, and must wait for another, later backward contextmanager invocation with delay_unscale left to False.

apex.amp.master_params(optimizer)[source]¶

Generator expression that iterates over the params owned by optimizer.

Parameters: optimizer – An optimizer previously returned from amp.initialize.

Checkpointing¶

To properly save and load your amp training, we introduce the amp.state_dict(), which contains all loss_scalers and their corresponding unskipped steps, as well as amp.load_state_dict() to restore these attributes.

In order to get bitwise accuracy, we recommend the following workflow:

# Initializationopt_level = 'O1'model, optimizer = amp.initialize(model, optimizer, opt_level=opt_level)# Train your model...# Save checkpointcheckpoint = { 'model': model.state_dict(), 'optimizer': optimizer.state_dict(), 'amp': amp.state_dict()}torch.save(checkpoint, 'amp_checkpoint.pt')...# Restoremodel = ...optimizer = ...checkpoint = torch.load('amp_checkpoint.pt')model, optimizer = amp.initialize(model, optimizer, opt_level=opt_level)model.load_state_dict(checkpoint['model'])optimizer.load_state_dict(checkpoint['optimizer'])amp.load_state_dict(checkpoint['amp'])# Continue training...

Note that we recommend restoring the model using the same opt_level. Also note that we recommend calling the load_state_dict methods after amp.initialize.

Advanced use cases¶

The unified Amp API supports gradient accumulation across iterations,multiple backward passes per iteration, multiple models/optimizers,custom/user-defined autograd functions, and custom data batch classes. Gradient clipping and GANs alsorequire special treatment, but this treatment does not need to changefor different opt_levels. Further details can be found here:

Advanced Amp Usage

Transition guide for old API users¶

We strongly encourage moving to the new Amp API, because it’s more versatile, easier to use, and future proof. The original FP16_Optimizer and the old “Amp” API are deprecated, and subject to removal at at any time.

For users of the old “Amp” API¶

In the new API, opt-level O1 performs the same patching of the Torch namespace as the old thingcalled “Amp.”However, the new API allows static or dynamic loss scaling, while the old API only allowed dynamic loss scaling.

In the new API, the old call to amp_handle = amp.init(), and the returned amp_handle, are nolonger exposed or necessary. The new amp.initialize() does the duty of amp.init() (and more).Therefore, any existing calls to amp_handle = amp.init() should be deleted.

The functions formerly exposed through amp_handle are now freefunctions accessible through the amp module.

The backward context manager must be changed accordingly:

# old APIwith amp_handle.scale_loss(loss, optimizer) as scaled_loss: scaled_loss.backward()-># new APIwith amp.scale_loss(loss, optimizer) as scaled_loss: scaled_loss.backward()

For now, the deprecated “Amp” API documentation can still be found on the Github README: https://github.com/NVIDIA/apex/tree/master/apex/amp. The old API calls that annotate user functions to runwith a particular precision are still honored by the new API.

For users of the old FP16_Optimizer¶

opt-level O2 is equivalent to FP16_Optimizer with dynamic_loss_scale=True.Once again, the backward pass must be changed to the unified version:

optimizer.backward(loss)->with amp.scale_loss(loss, optimizer) as scaled_loss: scaled_loss.backward()

One annoying aspect of FP16_Optimizer was that the user had to manually convert their model to half(either by calling .half() on it, or using a function or module wrapper fromapex.fp16_utils), and also manually call .half() on input data. Neither of these arenecessary in the new API. No matter what –opt-levelyou choose, you can and should simply build your model and pass input data in the default FP32 format.The new Amp API will perform the right conversions duringmodel, optimizer = amp.initialize(model, optimizer, opt_level=....) based on the --opt-leveland any overridden flags. Floating point input data may be FP32 or FP16, but you may as well justlet it be FP16, because the model returned by amp.initialize will have its forwardmethod patched to cast the input data appropriately.

apex.amp — Apex 0.1.0 documentation (2024)

opt_levels and Properties¶

Properties¶

opt_levels¶

O0: FP32 training¶

O1: Mixed Precision (recommended for typical use)¶

O2: “Almost FP16” Mixed Precision¶

O3: FP16 training¶