tinyms.optimizers

Optimizers module provides common optimizers for training, such as SGD, ADAM, Momentum. The optimizer is used to calculate and update the gradients.

class tinyms.optimizers.Optimizer(learning_rate, parameters, weight_decay=0.0, loss_scale=1.0)[源代码]

Base class for all optimizers.

注解

This class defines the API to add Ops to train a model. Never use this class directly, but instead instantiate one of its subclasses.

Different parameter groups can set different learning_rate and weight_decay.

When separating parameter groups, the weight decay in each group will be applied on the parameters if the weight_decay is positive. For most optimizer, when not separating parameters, the weight_decay in the API will be applied on the parameters without ‘beta’ or ‘gamma’ in their names if weight_decay is positive.

To improve parameter groups performance, the customized order of parameters can be supported.

参数
  • learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]) – A value or a graph for the learning rate. When the learning_rate is an Iterable or a Tensor in a 1D dimension, use dynamic learning rate, then the i-th step will take the i-th value as the learning rate. When the learning_rate is LearningRateSchedule, use dynamic learning rate, the i-th learning rate will be calculated during the process of training according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor in a zero dimension, use fixed learning rate. Other cases are not supported. The float learning rate must be equal to or greater than 0. If the type of learning_rate is int, it will be converted to float.

  • parameters (Union[list[Parameter], list[dict]]) –

    When the parameters is a list of Parameter which will be updated, the element in parameters must be class Parameter. When the parameters is a list of dict, the “params”, “lr”, “weight_decay” and “order_params” are the keys can be parsed.

    • params: Required. The value must be a list of Parameter.

    • lr: Optional. If “lr” in the keys, the value of corresponding learning rate will be used. If not, the learning_rate in the API will be used.

    • weight_decay: Optional. If “weight_decay” in the keys, the value of corresponding weight decay will be used. If not, the weight_decay in the API will be used.

    • order_params: Optional. If “order_params” in the keys, the value must be the order of parameters and the order will be followed in optimizer. There are no other keys in the dict and the parameters which in the value of ‘order_params’ must be in one of group parameters.

  • weight_decay (float) – A floating point value for the weight decay. It must be equal to or greater than 0. If the type of weight_decay input is int, it will be converted to float. Default: 0.0.

  • loss_scale (float) – A floating point value for the loss scale. It must be greater than 0. If the type of loss_scale input is int, it will be converted to float. Default: 1.0.

引发
  • ValueError – If the learning_rate is a Tensor, but the dimension of tensor is greater than 1.

  • TypeError – If the learning_rate is not any of the three types: float, Tensor, nor Iterable.

Supported Platforms:

Ascend GPU

broadcast_params(optim_result)[源代码]

Apply Broadcast operations in the sequential order of parameter groups.

返回

bool, the status flag.

decay_weight(gradients)[源代码]

Weight decay.

An approach to reduce the overfitting of a deep learning neural network model.

参数

gradients (tuple[Tensor]) – The gradients of self.parameters, and have the same shape as self.parameters.

返回

tuple[Tensor], The gradients after weight decay.

get_lr()[源代码]

Get the learning rate of current step.

返回

float, the learning rate of current step.

get_lr_parameter(param)[源代码]

Get the learning rate of parameter.

参数

param (Union[Parameter, list[Parameter]]) – The Parameter or list of Parameter.

返回

Parameter, single Parameter or list[Parameter] according to the input type.

scale_grad(gradients)[源代码]

Loss scale for mixed precision.

An approach of mixed precision training to improve the speed and energy efficiency of training deep neural network.

参数

gradients (tuple[Tensor]) – The gradients of self.parameters, and have the same shape as self.parameters.

返回

tuple[Tensor], The gradients after loss scale.

property target

The method is used to determine whether the parameter is updated on host or device. The input type is str and can only be ‘CPU’, ‘Ascend’ or ‘GPU’.

property unique

The method is to see whether to make unique. The input type is bool. The method is read-only.

class tinyms.optimizers.Momentum(params, learning_rate, momentum, weight_decay=0.0, loss_scale=1.0, use_nesterov=False)[源代码]

Implements the Momentum algorithm.

Refer to the paper on the importance of initialization and momentum in deep learning for more details.

\[v_{t} = v_{t-1} \ast u + gradients\]

If use_nesterov is True:

\[p_{t} = p_{t-1} - (grad \ast lr + v_{t} \ast u \ast lr)\]

If use_nesterov is Flase:

\[p_{t} = p_{t-1} - lr \ast v_{t}\]

Here: where grad, lr, p, v and u denote the gradients, learning_rate, params, moments, and momentum respectively.

注解

When separating parameter groups, the weight decay in each group will be applied on the parameters if the weight decay is positive. When not separating parameter groups, the weight_decay in the API will be applied on the parameters without ‘beta’ or ‘gamma’ in their names if weight_decay is positive.

To improve parameter groups performance, the customized order of parameters can be supported.

参数
  • params (Union[list[Parameter], list[dict]]) –

    When the params is a list of Parameter which will be updated, the element in params must be class Parameter. When the params is a list of dict, the “params”, “lr”, “weight_decay” and “order_params” are the keys can be parsed.

    • params: Required. The value must be a list of Parameter.

    • lr: Optional. If “lr” in the keys, the value of corresponding learning rate will be used. If not, the learning_rate in the API will be used.

    • weight_decay: Optional. If “weight_decay” in the keys, the value of corresponding weight decay will be used. If not, the weight_decay in the API will be used.

    • order_params: Optional. If “order_params” in the keys, the value must be the order of parameters and the order will be followed in optimizer. There are no other keys in the dict and the parameters which in the value of ‘order_params’ must be in one of group parameters.

  • learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]) – A value or a graph for the learning rate. When the learning_rate is an Iterable or a Tensor in a 1D dimension, use dynamic learning rate, then the i-th step will take the i-th value as the learning rate. When the learning_rate is LearningRateSchedule, use dynamic learning rate, the i-th learning rate will be calculated during the process of training according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor in a zero dimension, use fixed learning rate. Other cases are not supported. The float learning rate must be equal to or greater than 0. If the type of learning_rate is int, it will be converted to float.

  • momentum (float) – Hyperparameter of type float, means momentum for the moving average. It must be at least 0.0.

  • weight_decay (int, float) – Weight decay (L2 penalty). It must be equal to or greater than 0.0. Default: 0.0.

  • loss_scale (int, float) – A floating point value for the loss scale. It must be greater than 0.0. Default: 1.0.

  • use_nesterov (bool) – Enable Nesterov momentum. Default: False.

Inputs:
  • gradients (tuple[Tensor]) - The gradients of params, the shape is the same as params.

Outputs:

tuple[bool], all elements are True.

引发
  • ValueError – If the momentum is less than 0.0.

  • TypeError – If the momentum is not a float or use_nesterov is not a bool.

Supported Platforms:

Ascend GPU CPU

实际案例

>>> net = Net()
>>> #1) All parameters use the same learning rate and weight decay
>>> optim = nn.Momentum(params=net.trainable_params(), learning_rate=0.1, momentum=0.9)
>>>
>>> #2) Use parameter groups and set different values
>>> conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params()))
>>> no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params()))
>>> group_params = [{'params': conv_params, 'weight_decay': 0.01},
...                 {'params': no_conv_params, 'lr': 0.01},
...                 {'order_params': net.trainable_params()}]
>>> optim = nn.Momentum(group_params, learning_rate=0.1, momentum=0.9, weight_decay=0.0)
>>> # The conv_params's parameters will use a learning rate of default value 0.1 and a weight decay of 0.01.
>>> # The no_conv_params's parameters will use a learning rate of 0.01 and a weight decay of default value 0.0.
>>> # The final parameters order in which the optimizer will be followed is the value of 'order_params'.
>>>
>>> loss = nn.SoftmaxCrossEntropyWithLogits()
>>> model = Model(net, loss_fn=loss, optimizer=optim, metrics=None)
class tinyms.optimizers.LARS(optimizer, epsilon=1e-05, coefficient=0.001, use_clip=False, lars_filter=<function LARS.<lambda>>)[源代码]

Implements the LARS algorithm with LARSUpdate Operator.

LARS is an optimization algorithm employing a large batch optimization technique. Refer to paper LARGE BATCH TRAINING OF CONVOLUTIONAL NETWORKS.

The updating formulas are as follows,

\[\begin{split}\begin{array}{ll} \\ \lambda = \frac{\theta \text{ * } || \omega || }{|| g_{t} || \text{ + } \delta \text{ * } || \omega || } \\ \lambda = \begin{cases} \min(\frac{\lambda}{\alpha }, 1) & \text{ if } clip = True \\ \lambda & \text{ otherwise } \end{cases}\\ g_{t+1} = \lambda * (g_{t} + \delta * \omega) \end{array}\end{split}\]

\(\theta\) represents coefficient, \(\omega\) represents parameters, \(g\) represents gradients, \(t\) represents updateing step, \(\delta\) represents weight_decay, \(\alpha\) represents learning_rate, \(clip\) represents use_clip.

参数
  • optimizer (Optimizer) – MindSpore optimizer for which to wrap and modify gradients.

  • epsilon (float) – Term added to the denominator to improve numerical stability. Default: 1e-05.

  • coefficient (float) – Trust coefficient for calculating the local learning rate. Default: 0.001.

  • use_clip (bool) – Whether to use clip operation for calculating the local learning rate. Default: False.

  • lars_filter (Function) – A function to determine whether apply the LARS algorithm. Default: lambda x: ‘LayerNorm’ not in x.name and ‘bias’ not in x.name.

Inputs:
  • gradients (tuple[Tensor]) - The gradients of params in the optimizer, the shape is the as same as the params in the optimizer.

Outputs:

Union[Tensor[bool], tuple[Parameter]], it depends on the output of optimizer.

Supported Platforms:

Ascend

实际案例

>>> net = Net()
>>> loss = nn.SoftmaxCrossEntropyWithLogits()
>>> opt = nn.Momentum(net.trainable_params(), 0.1, 0.9)
>>> opt_lars = nn.LARS(opt, epsilon=1e-08, coefficient=0.02)
>>> model = Model(net, loss_fn=loss, optimizer=opt_lars, metrics=None)
class tinyms.optimizers.Adam(params, learning_rate=0.001, beta1=0.9, beta2=0.999, eps=1e-08, use_locking=False, use_nesterov=False, weight_decay=0.0, loss_scale=1.0)[源代码]

Updates gradients by the Adaptive Moment Estimation (Adam) algorithm.

The Adam algorithm is proposed in Adam: A Method for Stochastic Optimization.

The updating formulas are as follows,

\[\begin{split}\begin{array}{ll} \\ m = \beta_1 * m + (1 - \beta_1) * g \\ v = \beta_2 * v + (1 - \beta_2) * g * g \\ l = \alpha * \frac{\sqrt{1-\beta_2^t}}{1-\beta_1^t} \\ w = w - l * \frac{m}{\sqrt{v} + \epsilon} \end{array}\end{split}\]

\(m\) represents the 1st moment vector moment1, \(v\) represents the 2nd moment vector moment2, \(g\) represents gradients, \(l\) represents scaling factor lr, \(\beta_1, \beta_2\) represent beta1 and beta2, \(t\) represents updating step while \(beta_1^t\) and \(beta_2^t\) represent beta1_power and beta2_power, \(\alpha\) represents learning_rate, \(w\) represents params, \(\epsilon\) represents eps.

注解

When separating parameter groups, the weight decay in each group will be applied on the parameters if the weight decay is positive. When not separating parameter groups, the weight_decay in the API will be applied on the parameters without ‘beta’ or ‘gamma’ in their names if weight_decay is positive.

To improve parameter groups performance, the customized order of parameters is supported.

The sparse strategy is applied while the SparseGatherV2 operator is used for forward network. The sparse feature is under continuous development. If the sparse strategy wants to be executed on the host, set the target to the CPU.

参数
  • params (Union[list[Parameter], list[dict]]) –

    When the params is a list of Parameter which will be updated, the element in params must be class Parameter. When the params is a list of dict, the “params”, “lr”, “weight_decay” and “order_params” are the keys can be parsed.

    • params: Required. The value must be a list of Parameter.

    • lr: Optional. If “lr” is in the keys, the value of the corresponding learning rate will be used. If not, the learning_rate in the API will be used.

    • weight_decay: Optional. If “weight_decay” is in the keys, the value of the corresponding weight decay will be used. If not, the weight_decay in the API will be used.

    • order_params: Optional. If “order_params” is in the keys, the value must be the order of parameters and the order will be followed in the optimizer. There are no other keys in the dict and the parameters which in the ‘order_params’ must be in one of group parameters.

  • learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]) – A value or a graph for the learning rate. When the learning_rate is an Iterable or a Tensor in a 1D dimension, use the dynamic learning rate, then the i-th step will take the i-th value as the learning rate. When the learning_rate is LearningRateSchedule, use dynamic learning rate, the i-th learning rate will be calculated during the process of training according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor in a zero dimension, use fixed learning rate. Other cases are not supported. The float learning rate must be equal to or greater than 0. If the type of learning_rate is int, it will be converted to float. Default: 1e-3.

  • beta1 (float) – The exponential decay rate for the 1st moment estimations. Should be in range (0.0, 1.0). Default: 0.9.

  • beta2 (float) – The exponential decay rate for the 2nd moment estimations. Should be in range (0.0, 1.0). Default: 0.999.

  • eps (float) – Term added to the denominator to improve numerical stability. Should be greater than 0. Default: 1e-8.

  • use_locking (bool) – Whether to enable a lock to protect variable tensors from being updated. If true, updates of the var, m, and v tensors will be protected by a lock. If false, the result is unpredictable. Default: False.

  • use_nesterov (bool) – Whether to use Nesterov Accelerated Gradient (NAG) algorithm to update the gradients. If true, update the gradients using NAG. If false, update the gradients without using NAG. Default: False.

  • weight_decay (float) – Weight decay (L2 penalty). It must be equal to or greater than 0. Default: 0.0.

  • loss_scale (float) – A floating point value for the loss scale. Should be greater than 0. Default: 1.0.

Inputs:
  • gradients (tuple[Tensor]) - The gradients of params, the shape is the same as params.

Outputs:

Tensor[bool], the value is True.

Supported Platforms:

Ascend GPU

实际案例

>>> net = Net()
>>> #1) All parameters use the same learning rate and weight decay
>>> optim = nn.Adam(params=net.trainable_params())
>>>
>>> #2) Use parameter groups and set different values
>>> conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params()))
>>> no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params()))
>>> group_params = [{'params': conv_params, 'weight_decay': 0.01},
...                 {'params': no_conv_params, 'lr': 0.01},
...                 {'order_params': net.trainable_params()}]
>>> optim = nn.Adam(group_params, learning_rate=0.1, weight_decay=0.0)
>>> # The conv_params's parameters will use default learning rate of 0.1 and weight decay of 0.01.
>>> # The no_conv_params's parameters will use learning rate of 0.01 and defaule weight decay of 0.0.
>>> # The final parameters order in which the optimizer will be followed is the value of 'order_params'.
>>>
>>> loss = nn.SoftmaxCrossEntropyWithLogits()
>>> model = Model(net, loss_fn=loss, optimizer=optim)
property target

The method is used to determine whether the parameter is updated on host or device. The input type is str and can only be ‘CPU’, ‘Ascend’ or ‘GPU’.

class tinyms.optimizers.AdamWeightDecay(params, learning_rate=0.001, beta1=0.9, beta2=0.999, eps=1e-06, weight_decay=0.0)[源代码]

Implements the Adam algorithm to fix the weight decay.

注解

When separating parameter groups, the weight decay in each group will be applied on the parameters if the weight decay is positive. When not separating parameter groups, the weight_decay in the API will be applied on the parameters without ‘beta’ or ‘gamma’ in their names if weight_decay is positive.

To improve parameter groups performance, the customized order of parameters can be supported.

参数
  • params (Union[list[Parameter], list[dict]]) –

    When the params is a list of Parameter which will be updated, the element in params must be class Parameter. When the params is a list of dict, the “params”, “lr”, “weight_decay” and “order_params” are the keys can be parsed.

    • params: Required. The value must be a list of Parameter.

    • lr: Optional. If “lr” is in the keys, the value of the corresponding learning rate will be used. If not, the learning_rate in the API will be used.

    • weight_decay: Optional. If “weight_decay” is in the keys, the value of the corresponding weight decay will be used. If not, the weight_decay in the API will be used.

    • order_params: Optional. If “order_params” is in the keys, the value must be the order of parameters and the order will be followed in the optimizer. There are no other keys in the dict and the parameters which in the ‘order_params’ must be in one of group parameters.

  • learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]) – A value or a graph for the learning rate. When the learning_rate is an Iterable or a Tensor in a 1D dimension, use the dynamic learning rate, then the i-th step will take the i-th value as the learning rate. When the learning_rate is LearningRateSchedule, use dynamic learning rate, the i-th learning rate will be calculated during the process of training according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor in a zero dimension, use fixed learning rate. Other cases are not supported. The float learning rate must be equal to or greater than 0. If the type of learning_rate is int, it will be converted to float. Default: 1e-3.

  • beta1 (float) – The exponential decay rate for the 1st moment estimations. Default: 0.9. Should be in range (0.0, 1.0).

  • beta2 (float) – The exponential decay rate for the 2nd moment estimations. Default: 0.999. Should be in range (0.0, 1.0).

  • eps (float) – Term added to the denominator to improve numerical stability. Default: 1e-6. Should be greater than 0.

  • weight_decay (float) – Weight decay (L2 penalty). It must be equal to or greater than 0. Default: 0.0.

Inputs:
  • gradients (tuple[Tensor]) - The gradients of params, the shape is the same as params.

Outputs:

tuple[bool], all elements are True.

Supported Platforms:

Ascend GPU

实际案例

>>> net = Net()
>>> #1) All parameters use the same learning rate and weight decay
>>> optim = nn.AdamWeightDecay(params=net.trainable_params())
>>>
>>> #2) Use parameter groups and set different values
>>> conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params()))
>>> no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params()))
>>> group_params = [{'params': conv_params, 'weight_decay': 0.01},
...                 {'params': no_conv_params, 'lr': 0.01},
...                 {'order_params': net.trainable_params()}]
>>> optim = nn.AdamWeightDecay(group_params, learning_rate=0.1, weight_decay=0.0)
>>> # The conv_params's parameters will use default learning rate of 0.1 and weight decay of 0.01.
>>> # The no_conv_params's parameters will use learning rate of 0.01 and default weight decay of 0.0.
>>> # The final parameters order in which the optimizer will be followed is the value of 'order_params'.
>>>
>>> loss = nn.SoftmaxCrossEntropyWithLogits()
>>> model = Model(net, loss_fn=loss, optimizer=optim)
class tinyms.optimizers.LazyAdam(params, learning_rate=0.001, beta1=0.9, beta2=0.999, eps=1e-08, use_locking=False, use_nesterov=False, weight_decay=0.0, loss_scale=1.0)[源代码]

This optimizer will apply a lazy adam algorithm when gradient is sparse.

The original adam algorithm is proposed in Adam: A Method for Stochastic Optimization.

The updating formulas are as follows,

\[\begin{split}\begin{array}{ll} \\ m = \beta_1 * m + (1 - \beta_1) * g \\ v = \beta_2 * v + (1 - \beta_2) * g * g \\ l = \alpha * \frac{\sqrt{1-\beta_2^t}}{1-\beta_1^t} \\ w = w - l * \frac{m}{\sqrt{v} + \epsilon} \end{array}\end{split}\]

\(m\) represents the 1st moment vector moment1, \(v\) represents the 2nd moment vector moment2, \(g\) represents gradients, \(l\) represents scaling factor lr, \(\beta_1, \beta_2\) represent beta1 and beta2, \(t\) represents updating step while \(beta_1^t\) and \(beta_2^t\) represent beta1_power and beta2_power, \(\alpha\) represents learning_rate, \(w\) represents params, \(\epsilon\) represents eps.

注解

When separating parameter groups, the weight decay in each group will be applied on the parameters if the weight decay is positive. When not separating parameter groups, the weight_decay in the API will be applied on the parameters without ‘beta’ or ‘gamma’ in their names if weight_decay is positive.

To improve parameter groups performance, the customized order of parameters can be supported.

The sparse strategy is applied while the SparseGatherV2 operator being used for forward network. The sparse behavior, to be notice, is not equivalent to the original Adam algorithm, as only the current indices parames will be updated. The sparse feature is under continuous development. If the sparse strategy wants to be executed on the host, set the target to the CPU.

参数
  • params (Union[list[Parameter], list[dict]]) –

    When the params is a list of Parameter which will be updated, the element in params must be class Parameter. When the params is a list of dict, the “params”, “lr” and “weight_decay” are the keys can be parsed.

    • params: Required. The value must be a list of Parameter.

    • lr: Optional. If “lr” in the keys, the value of corresponding learning rate will be used. If not, the learning_rate in the API will be used.

    • weight_decay: Optional. If “weight_decay” in the keys, the value of corresponding weight decay will be used. If not, the weight_decay in the API will be used.

    • order_params: Optional. If “order_params” in the keys, the value must be the order of parameters and the order will be followed in optimizer. There are no other keys in the dict and the parameters which in the value of ‘order_params’ must be in one of group parameters.

  • learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]) – A value or a graph for the learning rate. When the learning_rate is an Iterable or a Tensor in a 1D dimension, use dynamic learning rate, then the i-th step will take the i-th value as the learning rate. When the learning_rate is LearningRateSchedule, use dynamic learning rate, the i-th learning rate will be calculated during the process of training according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor in a zero dimension, use fixed learning rate. Other cases are not supported. The float learning rate must be equal to or greater than 0. If the type of learning_rate is int, it will be converted to float. Default: 1e-3.

  • beta1 (float) – The exponential decay rate for the 1st moment estimations. Should be in range (0.0, 1.0). Default: 0.9.

  • beta2 (float) – The exponential decay rate for the 2nd moment estimations. Should be in range (0.0, 1.0). Default: 0.999.

  • eps (float) – Term added to the denominator to improve numerical stability. Should be greater than 0. Default: 1e-8.

  • use_locking (bool) – Whether to enable a lock to protect variable tensors from being updated. If true, updates of the var, m, and v tensors will be protected by a lock. If false, the result is unpredictable. Default: False.

  • use_nesterov (bool) – Whether to use Nesterov Accelerated Gradient (NAG) algorithm to update the gradients. If true, update the gradients using NAG. If false, update the gradients without using NAG. Default: False.

  • weight_decay (float) – Weight decay (L2 penalty). Default: 0.0.

  • loss_scale (float) – A floating point value for the loss scale. Should be equal to or greater than 1. Default: 1.0.

Inputs:
  • gradients (tuple[Tensor]) - The gradients of params, the shape is the same as params.

Outputs:

Tensor[bool], the value is True.

Supported Platforms:

Ascend

实际案例

>>> net = Net()
>>> #1) All parameters use the same learning rate and weight decay
>>> optim = nn.LazyAdam(params=net.trainable_params())
>>>
>>> #2) Use parameter groups and set different values
>>> conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params()))
>>> no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params()))
>>> group_params = [{'params': conv_params, 'weight_decay': 0.01},
...                 {'params': no_conv_params, 'lr': 0.01},
...                 {'order_params': net.trainable_params()}]
>>> optim = nn.LazyAdam(group_params, learning_rate=0.1, weight_decay=0.0)
>>> # The conv_params's parameters will use default learning rate of 0.1 and weight decay of 0.01.
>>> # The no_conv_params's parameters will use learning rate of 0.01 and default weight decay of 0.0.
>>> # The final parameters order in which the optimizer will be followed is the value of 'order_params'.
>>>
>>> loss = nn.SoftmaxCrossEntropyWithLogits()
>>> model = Model(net, loss_fn=loss, optimizer=optim)
property target

The method is used to determine whether the parameter is updated on host or device. The input type is str and can only be ‘CPU’, ‘Ascend’ or ‘GPU’.

class tinyms.optimizers.AdamOffload(params, learning_rate=0.001, beta1=0.9, beta2=0.999, eps=1e-08, use_locking=False, use_nesterov=False, weight_decay=0.0, loss_scale=1.0)[源代码]

This optimizer will offload Adam optimizer to host CPU and keep parameters being updated on the device, to minimize the memory cost. Although that would bring about an increase of performance overhead, the optimizer could be used to run a larger model.

The Adam algorithm is proposed in Adam: A Method for Stochastic Optimization.

The updating formulas are as follows,

\[\begin{split}\begin{array}{ll} \\ m = \beta_1 * m + (1 - \beta_1) * g \\ v = \beta_2 * v + (1 - \beta_2) * g * g \\ l = \alpha * \frac{\sqrt{1-\beta_2^t}}{1-\beta_1^t} \\ w = w - l * \frac{m}{\sqrt{v} + \epsilon} \end{array}\end{split}\]

\(m\) represents the 1st moment vector moment1, \(v\) represents the 2nd moment vector moment2, \(g\) represents gradients, \(l\) represents scaling factor lr, \(\beta_1, \beta_2\) represent beta1 and beta2, \(t\) represents updating step while \(beta_1^t\) and \(beta_2^t\) represent beta1_power and beta2_power, \(\alpha\) represents learning_rate, \(w\) represents params, \(\epsilon\) represents eps.

注解

This optimizer only supports GRAPH_MODE currently.

When separating parameter groups, the weight decay in each group will be applied on the parameters if the weight decay is positive. When not separating parameter groups, the weight_decay in the API will be applied on the parameters without ‘beta’ or ‘gamma’ in their names if weight_decay is positive.

To improve parameter groups performance, the customized order of parameters is supported.

参数
  • params (Union[list[Parameter], list[dict]]) –

    When the params is a list of Parameter which will be updated, the element in params must be class Parameter. When the params is a list of dict, the “params”, “lr”, “weight_decay” and “order_params” are the keys can be parsed.

    • params: Required. The value must be a list of Parameter.

    • lr: Optional. If “lr” is in the keys, the value of the corresponding learning rate will be used. If not, the learning_rate in the API will be used.

    • weight_decay: Optional. If “weight_decay” is in the keys, the value of the corresponding weight decay will be used. If not, the weight_decay in the API will be used.

    • order_params: Optional. If “order_params” is in the keys, the value must be the order of parameters and the order will be followed in the optimizer. There are no other keys in the dict and the parameters which in the ‘order_params’ must be in one of group parameters.

  • learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]) – A value or a graph for the learning rate. When the learning_rate is an Iterable or a Tensor in a 1D dimension, use the dynamic learning rate, then the i-th step will take the i-th value as the learning rate. When the learning_rate is LearningRateSchedule, use dynamic learning rate, the i-th learning rate will be calculated during the process of training according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor in a zero dimension, use fixed learning rate. Other cases are not supported. The float learning rate must be equal to or greater than 0. If the type of learning_rate is int, it will be converted to float. Default: 1e-3.

  • beta1 (float) – The exponential decay rate for the 1st moment estimations. Should be in range (0.0, 1.0). Default: 0.9.

  • beta2 (float) – The exponential decay rate for the 2nd moment estimations. Should be in range (0.0, 1.0). Default: 0.999.

  • eps (float) – Term added to the denominator to improve numerical stability. Should be greater than 0. Default: 1e-8.

  • use_locking (bool) – Whether to enable a lock to protect variable tensors from being updated. If true, updates of the var, m, and v tensors will be protected by a lock. If false, the result is unpredictable. Default: False.

  • use_nesterov (bool) – Whether to use Nesterov Accelerated Gradient (NAG) algorithm to update the gradients. If true, update the gradients using NAG. If false, update the gradients without using NAG. Default: False.

  • weight_decay (float) – Weight decay (L2 penalty). It must be equal to or greater than 0. Default: 0.0.

  • loss_scale (float) – A floating point value for the loss scale. Should be greater than 0. Default: 1.0.

Inputs:
  • gradients (tuple[Tensor]) - The gradients of params, the shape is the same as params.

Outputs:

Tensor[bool], the value is True.

Supported Platforms:

Ascend GPU CPU

实际案例

>>> net = Net()
>>> #1) All parameters use the same learning rate and weight decay
>>> optim = nn.AdamOffload(params=net.trainable_params())
>>>
>>> #2) Use parameter groups and set different values
>>> conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params()))
>>> no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params()))
>>> group_params = [{'params': conv_params, 'weight_decay': 0.01},
...                 {'params': no_conv_params, 'lr': 0.01},
...                 {'order_params': net.trainable_params()}]
>>> optim = nn.AdamOffload(group_params, learning_rate=0.1, weight_decay=0.0)
>>> # The conv_params's parameters will use default learning rate of 0.1 and weight decay of 0.01.
>>> # The no_conv_params's parameters will use learning rate of 0.01 and defaule weight decay of 0.0.
>>> # The final parameters order in which the optimizer will be followed is the value of 'order_params'.
>>>
>>> loss = nn.SoftmaxCrossEntropyWithLogits()
>>> model = Model(net, loss_fn=loss, optimizer=optim)
class tinyms.optimizers.Lamb(params, learning_rate, beta1=0.9, beta2=0.999, eps=1e-06, weight_decay=0.0)[源代码]

Lamb Dynamic Learning Rate.

LAMB is an optimization algorithm employing a layerwise adaptive large batch optimization technique. Refer to the paper LARGE BATCH OPTIMIZATION FOR DEEP LEARNING: TRAINING BERT IN 76 MINUTES.

注解

When separating parameter groups, the weight decay in each group will be applied on the parameters if the weight decay is positive. When not separating parameter groups, the weight_decay in the API will be applied on the parameters without ‘beta’ or ‘gamma’ in their names if weight_decay is positive.

To improve parameter groups performance, the customized order of parameters can be supported.

参数
  • params (Union[list[Parameter], list[dict]]) –

    When the params is a list of Parameter which will be updated, the element in params must be class Parameter. When the params is a list of dict, the “params”, “lr”, “weight_decay” and “order_params” are the keys can be parsed.

    • params: Required. The value must be a list of Parameter.

    • lr: Optional. If “lr” in the keys, the value of corresponding learning rate will be used. If not, the learning_rate in the API will be used.

    • weight_decay: Optional. If “weight_decay” in the keys, the value of corresponding weight decay will be used. If not, the weight_decay in the API will be used.

    • order_params: Optional. If “order_params” in the keys, the value must be the order of parameters and the order will be followed in optimizer. There are no other keys in the dict and the parameters which in the value of ‘order_params’ must be in one of group parameters.

  • learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]) – A value or a graph for the learning rate. When the learning_rate is an Iterable or a Tensor in a 1D dimension, use dynamic learning rate, then the i-th step will take the i-th value as the learning rate. When the learning_rate is LearningRateSchedule, use dynamic learning rate, the i-th learning rate will be calculated during the process of training according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor in a zero dimension, use fixed learning rate. Other cases are not supported. The float learning rate must be equal to or greater than 0. If the type of learning_rate is int, it will be converted to float.

  • beta1 (float) – The exponential decay rate for the 1st moment estimations. Default: 0.9. Should be in range (0.0, 1.0).

  • beta2 (float) – The exponential decay rate for the 2nd moment estimations. Default: 0.999. Should be in range (0.0, 1.0).

  • eps (float) – Term added to the denominator to improve numerical stability. Default: 1e-6. Should be greater than 0.

  • weight_decay (float) – Weight decay (L2 penalty). Default: 0.0. Should be equal to or greater than 0.

Inputs:
  • gradients (tuple[Tensor]) - The gradients of params, the shape is the same as params.

Outputs:

tuple[bool], all elements are True.

Supported Platforms:

Ascend GPU

实际案例

>>> net = Net()
>>> #1) All parameters use the same learning rate and weight decay
>>> optim = nn.Lamb(params=net.trainable_params(), learning_rate=0.1)
>>>
>>> #2) Use parameter groups and set different values
>>> poly_decay_lr = learning_rate_schedule.PolynomialDecayLR(learning_rate=0.1, end_learning_rate=0.01,
...                                                    decay_steps=4, power = 0.5)
>>> conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params()))
>>> no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params()))
>>> group_params = [{'params': conv_params, 'weight_decay': 0.01},
...                 {'params': no_conv_params, 'lr': poly_decay_lr},
...                 {'order_params': net.trainable_params(0.01)}]
>>> optim = nn.Lamb(group_params, learning_rate=0.1, weight_decay=0.0)
>>> # The conv_params's parameters will use default learning rate of 0.1 and weight decay of 0.01.
>>> # The no_conv_params's parameters will use dynamic learning rate of poly decay learning rate and default
>>> # weight decay of 0.0.
>>> # The final parameters order in which the optimizer will be followed is the value of 'order_params'.
>>>
>>> loss = nn.SoftmaxCrossEntropyWithLogits()
>>> model = Model(net, loss_fn=loss, optimizer=optim)
class tinyms.optimizers.SGD(params, learning_rate=0.1, momentum=0.0, dampening=0.0, weight_decay=0.0, nesterov=False, loss_scale=1.0)[源代码]

Implements stochastic gradient descent. Momentum is optional.

Introduction to SGD can be found at https://en.wikipedia.org/wiki/Stochastic_gradient_descent. Nesterov momentum is based on the formula from paper On the importance of initialization and momentum in deep learning.

\[v_{t+1} = u \ast v_{t} + gradient \ast (1-dampening)\]

If nesterov is True:

\[p_{t+1} = p_{t} - lr \ast (gradient + u \ast v_{t+1})\]

If nesterov is Flase:

\[p_{t+1} = p_{t} - lr \ast v_{t+1}\]

To be noticed, for the first step, v_{t+1} = gradient

Here : where p, v and u denote the parameters, accum, and momentum respectively.

注解

When separating parameter groups, the weight decay in each group will be applied on the parameters if the weight decay is positive. When not separating parameter groups, the weight_decay in the API will be applied on the parameters without ‘beta’ or ‘gamma’ in their names if weight_decay is positive.

To improve parameter groups performance, the customized order of parameters can be supported.

参数
  • params (Union[list[Parameter], list[dict]]) –

    When the params is a list of Parameter which will be updated, the element in params must be class Parameter. When the params is a list of dict, the “params”, “lr”, “weight_decay” and “order_params” are the keys can be parsed.

    • params: Required. The value must be a list of Parameter.

    • lr: Optional. If “lr” in the keys, the value of corresponding learning rate will be used. If not, the learning_rate in the API will be used.

    • weight_decay: Optional. If “weight_decay” in the keys, the value of corresponding weight decay will be used. If not, the weight_decay in the API will be used.

    • order_params: Optional. If “order_params” in the keys, the value must be the order of parameters and the order will be followed in optimizer. There are no other keys in the dict and the parameters which in the value of ‘order_params’ must be in one of group parameters.

  • learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]) – A value or a graph for the learning rate. When the learning_rate is an Iterable or a Tensor in a 1D dimension, use dynamic learning rate, then the i-th step will take the i-th value as the learning rate. When the learning_rate is LearningRateSchedule, use dynamic learning rate, the i-th learning rate will be calculated during the process of training according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor in a zero dimension, use fixed learning rate. Other cases are not supported. The float learning rate must be equal to or greater than 0. If the type of learning_rate is int, it will be converted to float. Default: 0.1.

  • momentum (float) – A floating point value the momentum. must be at least 0.0. Default: 0.0.

  • dampening (float) – A floating point value of dampening for momentum. must be at least 0.0. Default: 0.0.

  • weight_decay (float) – Weight decay (L2 penalty). It must be equal to or greater than 0. Default: 0.0.

  • nesterov (bool) – Enables the Nesterov momentum. If use nesterov, momentum must be positive, and dampening must equal to 0.0. Default: False.

  • loss_scale (float) – A floating point value for the loss scale, which must be larger than 0.0. Default: 1.0.

Inputs:
  • gradients (tuple[Tensor]) - The gradients of params, the shape is the same as params.

Outputs:

Tensor[bool], the value is True.

引发

ValueError – If the momentum, dampening or weight_decay value is less than 0.0.

Supported Platforms:

Ascend GPU

实际案例

>>> net = Net()
>>> #1) All parameters use the same learning rate and weight decay
>>> optim = nn.SGD(params=net.trainable_params())
>>>
>>> #2) Use parameter groups and set different values
>>> conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params()))
>>> no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params()))
>>> group_params = [{'params': conv_params, 'weight_decay': 0.01},
...                 {'params': no_conv_params, 'lr': 0.01},
...                 {'order_params': net.trainable_params()}]
>>> optim = nn.SGD(group_params, learning_rate=0.1, weight_decay=0.0)
>>> # The conv_params's parameters will use a learning rate of default value 0.1 and a weight decay of 0.01.
>>> # The no_conv_params's parameters will use a learning rate of 0.01 and a weight decay of default value 0.0.
>>> # The final parameters order in which the optimizer will be followed is the value of 'order_params'.
>>>
>>> loss = nn.SoftmaxCrossEntropyWithLogits()
>>> model = Model(net, loss_fn=loss, optimizer=optim)
class tinyms.optimizers.FTRL(params, initial_accum=0.1, learning_rate=0.001, lr_power=-0.5, l1=0.0, l2=0.0, use_locking=False, loss_scale=1.0, weight_decay=0.0)[源代码]

Implements the FTRL algorithm with ApplyFtrl Operator.

FTRL is an online convex optimization algorithm that adaptively chooses its regularization function based on the loss functions. Refer to paper Adaptive Bound Optimization for Online Convex Optimization. Refer to paper Ad Click Prediction: a View from the Trenches for engineering document.

The updating formulas are as follows,

\[\begin{split}\begin{array}{ll} \\ m_{t+1} = m_{t} + g^2 \\ u_{t+1} = u_{t} + g - \frac{m_{t+1}^\text{-p} - m_{t}^\text{-p}}{\alpha } * \omega_{t} \\ \omega_{t+1} = \begin{cases} \frac{(sign(u_{t+1}) * l1 - u_{t+1})}{\frac{m_{t+1}^\text{-p}}{\alpha } + 2 * l2 } & \text{ if } |u_{t+1}| > l1 \\ 0.0 & \text{ otherwise } \end{cases}\\ \end{array}\end{split}\]

\(m\) represents accum, \(g\) represents grads, \(t\) represents updateing step, \(u\) represents linear, \(p\) represents lr_power, \(\alpha\) represents learning_rate, \(\omega\) represents params.

注解

When separating parameter groups, the weight decay in each group will be applied on the parameters if the weight decay is positive. When not separating parameter groups, the weight_decay in the API will be applied on all of the parameters.

To improve parameter groups performance, the customized order of parameters can be supported.

The sparse strategy is applied while the SparseGatherV2 operator being used for forward network. The sparse feature is under continuous development. If the sparse strategy wants to be executed on the host, set the target to the CPU.

参数
  • params (Union[list[Parameter], list[dict]]) –

    When the params is a list of Parameter which will be updated, the element in params must be class Parameter. When the params is a list of dict, the “params”, “lr”, “weight_decay” and “order_params” are the keys can be parsed.

    • params: Required. The value must be a list of Parameter.

    • lr: Using different learning rate by separating parameters is currently not supported.

    • weight_decay: Optional. If “weight_decay” in the keys, the value of corresponding weight decay will be used. If not, the weight_decay in the API will be used.

    • order_params: Optional. If “order_params” in the keys, the value must be the order of parameters and the order will be followed in optimizer. There are no other keys in the dict and the parameters which in the value of ‘order_params’ must be in one of group parameters.

  • initial_accum (float) – The starting value for accumulators, must be zero or positive values. Default: 0.1.

  • learning_rate (float) – The learning rate value, must be zero or positive, dynamic learning rate is currently not supported. Default: 0.001.

  • lr_power (float) – Learning rate power controls how the learning rate decreases during training, must be less than or equal to zero. Use fixed learning rate if lr_power is zero. Default: -0.5.

  • l1 (float) – l1 regularization strength, must be greater than or equal to zero. Default: 0.0.

  • l2 (float) – l2 regularization strength, must be greater than or equal to zero. Default: 0.0.

  • use_locking (bool) – If true, use locks for updating operation. Default: False.

  • loss_scale (float) – Value for the loss scale. It must be equal to or greater than 1.0. Default: 1.0.

  • weight_decay (float) – Weight decay value to multiply weight, must be zero or positive value. Default: 0.0.

Inputs:
  • grads (tuple[Tensor]) - The gradients of params in the optimizer, the shape is the same as the params in optimizer.

Outputs:

tuple[Parameter], the updated parameters, the shape is the same as params.

Supported Platforms:

Ascend GPU

实际案例

>>> net = Net()
>>> #1) All parameters use the same learning rate and weight decay
>>> optim = nn.FTRL(params=net.trainable_params())
>>>
>>> #2) Use parameter groups and set different values
>>> conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params()))
>>> no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params()))
>>> group_params = [{'params': conv_params, 'weight_decay': 0.01},
...                 {'params': no_conv_params},
...                 {'order_params': net.trainable_params()}]
>>> optim = nn.FTRL(group_params, learning_rate=0.1, weight_decay=0.0)
>>> # The conv_params's parameters will use weight decay of 0.01.
>>> # The no_conv_params's parameters will use default weight decay of 0.0.
>>> # The final parameters order in which the optimizer will be followed is the value of 'order_params'.
>>>
>>> loss = nn.SoftmaxCrossEntropyWithLogits()
>>> model = Model(net, loss_fn=loss, optimizer=optim)
property target

The method is used to determine whether the parameter is updated on host or device. The input type is str and can only be ‘CPU’, ‘Ascend’ or ‘GPU’.

class tinyms.optimizers.RMSProp(params, learning_rate=0.1, decay=0.9, momentum=0.0, epsilon=1e-10, use_locking=False, centered=False, loss_scale=1.0, weight_decay=0.0)[源代码]

Implements Root Mean Squared Propagation (RMSProp) algorithm.

Update params according to the RMSProp algorithm.

The equation is as follows:

\[s_{t} = \rho s_{t-1} + (1 - \rho)(\nabla Q_{i}(w))^2\]
\[m_{t} = \beta m_{t-1} + \frac{\eta} {\sqrt{s_{t} + \epsilon}} \nabla Q_{i}(w)\]
\[w = w - m_{t}\]

The first equation calculates moving average of the squared gradient for each weight. Then dividing the gradient by \(\sqrt{ms_{t} + \epsilon}\).

if centered is True:

\[g_{t} = \rho g_{t-1} + (1 - \rho)\nabla Q_{i}(w)\]
\[s_{t} = \rho s_{t-1} + (1 - \rho)(\nabla Q_{i}(w))^2\]
\[m_{t} = \beta m_{t-1} + \frac{\eta} {\sqrt{s_{t} - g_{t}^2 + \epsilon}} \nabla Q_{i}(w)\]
\[w = w - m_{t}\]

where \(w\) represents params, which will be updated. \(g_{t}\) is mean gradients, \(g_{t-1}\) is the last moment of \(g_{t}\). \(s_{t}\) is the mean square gradients, \(s_{t-1}\) is the last moment of \(s_{t}\), \(m_{t}\) is moment, the delta of w, \(m_{t-1}\) is the last moment of \(m_{t}\). \(\rho\) represents decay. \(\beta\) is the momentum term, represents momentum. \(\epsilon\) is a smoothing term to avoid division by zero, represents epsilon. \(\eta\) is learning rate, represents learning_rate. \(\nabla Q_{i}(w)\) is gradients, represents gradients.

注解

When separating parameter groups, the weight decay in each group will be applied on the parameters if the weight decay is positive. When not separating parameter groups, the weight_decay in the API will be applied on the parameters without ‘beta’ or ‘gamma’ in their names if weight_decay is positive.

To improve parameter groups performance, the customized order of parameters can be supported.

参数
  • params (Union[list[Parameter], list[dict]]) –

    When the params is a list of Parameter which will be updated, the element in params must be class Parameter. When the params is a list of dict, the “params”, “lr”, “weight_decay” and “order_params” are the keys can be parsed.

    • params: Required. The value must be a list of Parameter.

    • lr: Optional. If “lr” in the keys, the value of corresponding learning rate will be used. If not, the learning_rate in the API will be used.

    • weight_decay: Optional. If “weight_decay” in the keys, the value of corresponding weight decay will be used. If not, the weight_decay in the API will be used.

    • order_params: Optional. If “order_params” in the keys, the value must be the order of parameters and the order will be followed in optimizer. There are no other keys in the dict and the parameters which in the value of ‘order_params’ must be in one of group parameters.

  • learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]) – A value or a graph for the learning rate. When the learning_rate is an Iterable or a Tensor in a 1D dimension, use dynamic learning rate, then the i-th step will take the i-th value as the learning rate. When the learning_rate is LearningRateSchedule, use dynamic learning rate, the i-th learning rate will be calculated during the process of training according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor in a zero dimension, use fixed learning rate. Other cases are not supported. The float learning rate must be equal to or greater than 0. If the type of learning_rate is int, it will be converted to float. Default: 0.1.

  • decay (float) – Decay rate. Should be equal to or greater than 0. Default: 0.9.

  • momentum (float) – Hyperparameter of type float, means momentum for the moving average. Should be equal to or greater than 0. Default: 0.0.

  • epsilon (float) – Term added to the denominator to improve numerical stability. Should be greater than 0. Default: 1e-10.

  • use_locking (bool) – Whether to enable a lock to protect the variable and accumlation tensors from being updated. Default: False.

  • centered (bool) – If true, gradients are normalized by the estimated variance of the gradient. Default: False.

  • loss_scale (float) – A floating point value for the loss scale. Should be greater than 0. Default: 1.0.

  • weight_decay (float) – Weight decay (L2 penalty). Should be equal to or greater than 0. Default: 0.0.

Inputs:
  • gradients (tuple[Tensor]) - The gradients of params, the shape is the same as params.

Outputs:

Tensor[bool], the value is True.

Supported Platforms:

Ascend GPU

实际案例

>>> net = Net()
>>> #1) All parameters use the same learning rate and weight decay
>>> optim = nn.RMSProp(params=net.trainable_params(), learning_rate=0.1)
>>>
>>> #2) Use parameter groups and set different values
>>> conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params()))
>>> no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params()))
>>> group_params = [{'params': conv_params, 'weight_decay': 0.01},
...                 {'params': no_conv_params, 'lr': 0.01},
...                 {'order_params': net.trainable_params()}]
>>> optim = nn.RMSProp(group_params, learning_rate=0.1, weight_decay=0.0)
>>> # The conv_params's parameters will use a learning rate of default value 0.1 and a weight decay of 0.01.
>>> # The no_conv_params's parameters will use a learning rate of 0.01 and a weight decay of default value 0.0.
>>> # The final parameters order in which the optimizer will be followed is the value of 'order_params'.
>>>
>>> loss = nn.SoftmaxCrossEntropyWithLogits()
>>> model = Model(net, loss_fn=loss, optimizer=optim)
class tinyms.optimizers.ProximalAdagrad(params, accum=0.1, learning_rate=0.001, l1=0.0, l2=0.0, use_locking=False, loss_scale=1.0, weight_decay=0.0)[源代码]

Implements the ProximalAdagrad algorithm with ApplyProximalAdagrad Operator.

ProximalAdagrad is an online Learning and Stochastic Optimization. Refer to paper Efficient Learning using Forward-Backward Splitting.

注解

When separating parameter groups, the weight decay in each group will be applied on the parameters if the weight decay is positive. When not separating parameter groups, the weight_decay in the API will be applied on the parameters without ‘beta’ or ‘gamma’ in their names if weight_decay is positive.

To improve parameter groups performance, the customized order of parameters can be supported.

The sparse strategy is applied while the SparseGatherV2 operator being used for forward network. The sparse feature is under continuous development. If the sparse strategy wants to be executed on the host, set the target to the CPU.

参数
  • params (Union[list[Parameter], list[dict]]) –

    When the params is a list of Parameter which will be updated, the element in params must be class Parameter. When the params is a list of dict, the “params”, “lr”, “weight_decay” and “order_params” are the keys can be parsed.

    • params: Required. The value must be a list of Parameter.

    • lr: Optional. If “lr” in the keys, the value of corresponding learning rate will be used. If not, the learning_rate in the API will be used.

    • weight_decay: Optional. If “weight_decay” in the keys, the value of corresponding weight decay will be used. If not, the weight_decay in the API will be used.

    • order_params: Optional. If “order_params” in the keys, the value must be the order of parameters and the order will be followed in optimizer. There are no other keys in the dict and the parameters which in the value of ‘order_params’ must be in one of group parameters.

  • accum (float) – The starting value for accumulators, must be zero or positive values. Default: 0.1.

  • learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]) – A value or a graph for the learning rate. When the learning_rate is an Iterable or a Tensor in a 1D dimension, use dynamic learning rate, then the i-th step will take the i-th value as the learning rate. When the learning_rate is LearningRateSchedule, use dynamic learning rate, the i-th learning rate will be calculated during the process of training according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor in a zero dimension, use fixed learning rate. Other cases are not supported. The float learning rate must be equal to or greater than 0. If the type of learning_rate is int, it will be converted to float. Default: 0.001.

  • l1 (float) – l1 regularization strength, must be greater than or equal to zero. Default: 0.0.

  • l2 (float) – l2 regularization strength, must be greater than or equal to zero. Default: 0.0.

  • use_locking (bool) – If true, use locks for updating operation. Default: False.

  • loss_scale (float) – Value for the loss scale. It must be greater than 0.0. Default: 1.0.

  • weight_decay (float) – Weight decay value to multiply weight, must be zero or positive value. Default: 0.0.

Inputs:
  • grads (tuple[Tensor]) - The gradients of params in the optimizer, the shape is the same as the params in optimizer.

Outputs:

Tensor[bool], the value is True.

Supported Platforms:

Ascend

实际案例

>>> net = Net()
>>> #1) All parameters use the same learning rate and weight decay
>>> optim = nn.ProximalAdagrad(params=net.trainable_params())
>>>
>>> #2) Use parameter groups and set different values
>>> conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params()))
>>> no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params()))
>>> group_params = [{'params': conv_params, 'weight_decay': 0.01},
...                 {'params': no_conv_params, 'lr': 0.01},
...                 {'order_params': net.trainable_params()}]
>>> optim = nn.ProximalAdagrad(group_params, learning_rate=0.1, weight_decay=0.0)
>>> # The conv_params's parameters will use default learning rate of 0.1 and weight decay of 0.01.
>>> # The no_conv_params's parameters will use learning rate of 0.01 and default weight decay of 0.0.
>>> # The final parameters order in which the optimizer will be followed is the value of 'order_params'.
>>>
>>> loss = nn.SoftmaxCrossEntropyWithLogits()
>>> model = Model(net, loss_fn=loss, optimizer=optim)
property target

The method is used to determine whether the parameter is updated on host or device. The input type is str and can only be ‘CPU’, ‘Ascend’ or ‘GPU’.

class tinyms.optimizers.Adagrad(params, accum=0.1, learning_rate=0.001, update_slots=True, loss_scale=1.0, weight_decay=0.0)[源代码]

Implements the Adagrad algorithm with ApplyAdagrad Operator.

Adagrad is an online Learning and Stochastic Optimization. Refer to paper Efficient Learning using Forward-Backward Splitting.

注解

When separating parameter groups, the weight decay in each group will be applied on the parameters if the weight decay is positive. When not separating parameter groups, the weight_decay in the API will be applied on the parameters without ‘beta’ or ‘gamma’ in their names if weight_decay is positive.

To improve parameter groups performance, the customized order of parameters can be supported.

参数
  • params (Union[list[Parameter], list[dict]]) –

    When the params is a list of Parameter which will be updated, the element in params must be class Parameter. When the params is a list of dict, the “params”, “lr”, “weight_decay” and “order_params” are the keys can be parsed.

    • params: Required. The value must be a list of Parameter.

    • lr: Optional. If “lr” in the keys, the value of corresponding learning rate will be used. If not, the learning_rate in the API will be used.

    • weight_decay: Optional. If “weight_decay” in the keys, the value of corresponding weight decay will be used. If not, the weight_decay in the API will be used.

    • order_params: Optional. If “order_params” in the keys, the value must be the order of parameters and the order will be followed in optimizer. There are no other keys in the dict and the parameters which in the value of ‘order_params’ must be in one of group parameters.

  • accum (float) – The starting value for accumulators, must be zero or positive values. Default: 0.1.

  • learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]) – A value or a graph for the learning rate. When the learning_rate is an Iterable or a Tensor in a 1D dimension, use dynamic learning rate, then the i-th step will take the i-th value as the learning rate. When the learning_rate is LearningRateSchedule, use dynamic learning rate, the i-th learning rate will be calculated during the process of training according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor in a zero dimension, use fixed learning rate. Other cases are not supported. The float learning rate must be equal to or greater than 0. If the type of learning_rate is int, it will be converted to float. Default: 0.001.

  • update_slots (bool) – If true, update accumulation. Default: True.

  • loss_scale (float) – Value for the loss scale. It must be greater than 0.0. Default: 1.0.

  • weight_decay (float) – Weight decay value to multiply weight, must be zero or positive value. Default: 0.0.

Inputs:
  • grads (tuple[Tensor]) - The gradients of params in the optimizer, the shape is the same as the params in optimizer.

Outputs:

Tensor[bool], the value is True.

Supported Platforms:

Ascend CPU GPU

实际案例

>>> net = Net()
>>> #1) All parameters use the same learning rate and weight decay
>>> optim = nn.Adagrad(params=net.trainable_params())
>>>
>>> #2) Use parameter groups and set different values
>>> conv_params = list(filter(lambda x: 'conv' in x.name, net.trainable_params()))
>>> no_conv_params = list(filter(lambda x: 'conv' not in x.name, net.trainable_params()))
>>> group_params = [{'params': conv_params, 'weight_decay': 0.01},
...                 {'params': no_conv_params, 'lr': 0.01},
...                 {'order_params': net.trainable_params()}]
>>> optim = nn.Adagrad(group_params, learning_rate=0.1, weight_decay=0.0)
>>> # The conv_params's parameters will use default learning rate of 0.1 and weight decay of 0.01.
>>> # The no_conv_params's parameters will use learning rate of 0.01 and default weight decay of 0.0.
>>> # The final parameters order in which the optimizer will be followed is the value of 'order_params'.
>>>
>>> loss = nn.SoftmaxCrossEntropyWithLogits()
>>> model = Model(net, loss_fn=loss, optimizer=optim)