tinyms.callbacks¶

Callback related classes and functions in model training phase.

class tinyms.callbacks.LossTimeMonitor(lr_init=None)[源代码]¶

Monitor loss and time.

参数: lr_init (numpy.ndarray) – Train learning rate. Default: None.
返回: None

实际案例

>>> from tinyms import Tensor
>>> from tinyms.callbacks import LossTimeMonitor
>>>
>>> LossTimeMonitor(lr_init=Tensor([0.05] * 100).asnumpy())

class tinyms.callbacks.BertLossCallBack(dataset_size=1)[源代码]¶

Monitor the loss in training. If the loss in NAN or INF terminating training.

参数: dataset_size (int) – Print loss every times. Default: 1.
返回: None

实际案例

>>> from tinyms.callbacks import BertLossCallBack
>>>
>>> BertLossCallBack(dataset_size=1)

step_end(run_context)[源代码]¶: Print loss after each step

class tinyms.callbacks.Callback[源代码]¶

Abstract base class used to build a callback class. Callbacks are context managers which will be entered and exited when passing into the Model. You can use this mechanism to initialize and release resources automatically.

Callback function will execute some operations in the current step or epoch.

实际案例

>>> class Print_info(Callback):
>>>     def step_end(self, run_context):
>>>         cb_params = run_context.original_args()
>>>         print(cb_params.cur_epoch_num)
>>>         print(cb_params.cur_step_num)
>>>
>>> print_cb = Print_info()
>>> model.train(epoch, dataset, callbacks=print_cb)

begin(run_context)[源代码]¶

Called once before the network executing.

参数: run_context (RunContext) – Include some information of the model.

end(run_context)[源代码]¶

Called once after network training.

参数: run_context (RunContext) – Include some information of the model.

epoch_begin(run_context)[源代码]¶

Called before each epoch beginning.

参数: run_context (RunContext) – Include some information of the model.

epoch_end(run_context)[源代码]¶

Called after each epoch finished.

参数: run_context (RunContext) – Include some information of the model.

step_begin(run_context)[源代码]¶

Called before each epoch beginning.

参数: run_context (RunContext) – Include some information of the model.

step_end(run_context)[源代码]¶

Called after each step finished.

参数: run_context (RunContext) – Include some information of the model.

class tinyms.callbacks.LossMonitor(per_print_times=1)[源代码]¶

Monitor the loss in training.

If the loss is NAN or INF, it will terminate training.

注解

If per_print_times is 0, do not print loss.

参数: per_print_times (int) – Print the loss each every time. Default: 1.
引发: ValueError – If print_step is not an integer or less than zero.

class tinyms.callbacks.TimeMonitor(data_size=None)[源代码]¶

Monitor the time in training.

参数: data_size (int) – Dataset size. Default: None.

class tinyms.callbacks.ModelCheckpoint(prefix='CKP', directory=None, config=None)[源代码]¶

The checkpoint callback class.

It is called to combine with train process and save the model and network parameters after training.

注解

In the distributed training scenario, please specify different directories for each training process to save the checkpoint file. Otherwise, the training may fail.

参数

prefix (str) – The prefix name of checkpoint files. Default: “CKP”.
directory (str) – The path of the folder which will be saved in the checkpoint file. Default: None.
config (CheckpointConfig) – Checkpoint strategy configuration. Default: None.

引发

ValueError – If the prefix is invalid.
TypeError – If the config is not CheckpointConfig type.

end(run_context)[源代码]¶

Save the last checkpoint after training finished.

参数: run_context (RunContext) – Context of the train running.

property latest_ckpt_file_name¶: Return the latest checkpoint path and file name.

step_end(run_context)[源代码]¶

Save the checkpoint at the end of step.

参数: run_context (RunContext) – Context of the train running.

class tinyms.callbacks.SummaryCollector(summary_dir, collect_freq=10, collect_specified_data=None, keep_default_action=True, custom_lineage_data=None, collect_tensor_freq=None, max_file_size=None, export_options=None)[源代码]¶

SummaryCollector can help you to collect some common information.

It can help you to collect loss, learning late, computational graph and so on. SummaryCollector also enables the summary operator to collect data to summary files.

注解

Multiple SummaryCollector instances in callback list are not allowed.
Not all information is collected at the training phase or at the eval phase.
SummaryCollector always record the data collected by the summary operator.
SummaryCollector only supports Linux systems.

参数

summary_dir (str) – The collected data will be persisted to this directory. If the directory does not exist, it will be created automatically.
collect_freq (int) – Set the frequency of data collection, it should be greater then zero, and the unit is step. If a frequency is set, we will collect data when (current steps % freq) equals to 0, and the first step will be collected at any time. It is important to note that if the data sink mode is used, the unit will become the epoch. It is not recommended to collect data too frequently, which can affect performance. Default: 10.
collect_specified_data (Union[None, dict]) –
Perform custom operations on the collected data. By default, if set to None, all data is collected as the default behavior. You can customize the collected data with a dictionary. For example, you can set {‘collect_metric’: False} to control not collecting metrics. The data that supports control is shown below. Default: None.
- collect_metric (bool): Whether to collect training metrics, currently only the loss is collected. The first output will be treated as the loss and it will be averaged. Optional: True/False. Default: True.
- collect_graph (bool): Whether to collect the computational graph. Currently, only training computational graph is collected. Optional: True/False. Default: True.
- collect_train_lineage (bool): Whether to collect lineage data for the training phase, this field will be displayed on the lineage page of Mindinsight. Optional: True/False. Default: True.
- collect_eval_lineage (bool): Whether to collect lineage data for the evaluation phase, this field will be displayed on the lineage page of Mindinsight. Optional: True/False. Default: True.
- collect_input_data (bool): Whether to collect dataset for each training. Currently only image data is supported. If there are multiple columns of data in the dataset, the first column should be image data. Optional: True/False. Default: True.
- collect_dataset_graph (bool): Whether to collect dataset graph for the training phase. Optional: True/False. Default: True.
- histogram_regular (Union[str, None]): Collect weight and bias for parameter distribution page and displayed in MindInsight. This field allows regular strings to control which parameters to collect. It is not recommended to collect too many parameters at once, as it can affect performance. Note that if you collect too many parameters and run out of memory, the training will fail. Default: None, it means only the first five parameters are collected.
keep_default_action (bool) – This field affects the collection behavior of the ‘collect_specified_data’ field. True: it means that after specified data is set, non-specified data is collected as the default behavior. False: it means that after specified data is set, only the specified data is collected, and the others are not collected. Optional: True/False, Default: True.
custom_lineage_data (Union[dict, None]) – Allows you to customize the data and present it on the MingInsight lineage page. In the custom data, the type of the key supports str, and the type of value supports str, int and float. Default: None, it means there is no custom data.
collect_tensor_freq (Optional[int]) – The same semantics as the collect_freq, but controls TensorSummary only. Because TensorSummary data is too large to be compared with other summary data, this parameter is used to reduce its collection. By default, The maximum number of steps for collecting TensorSummary data is 20, but it will not exceed the number of steps for collecting other summary data. For example, given collect_freq=10, when the total steps is 600, TensorSummary will be collected 20 steps, while other summary data 61 steps, but when the total steps is 20, both TensorSummary and other summary will be collected 3 steps. Also note that when in parallel mode, the total steps will be split evenly, which will affect the number of steps TensorSummary will be collected. Default: None, which means to follow the behavior as described above.
max_file_size (Optional[int]) – The maximum size in bytes of each file that can be written to the disk. For example, to write not larger than 4GB, specify max_file_size=4*1024**3. Default: None, which means no limit.
export_options (Union[None, dict]) –
Perform custom operations on the export data. Note that the size of export files is not limited by the max_file_size. You can customize the export data with a dictionary. For example, you can set {‘tensor_format’: ‘npy’} to export tensor as npy file. The data that supports control is shown below. Default: None, it means that the data is not exported.
- tensor_format (Union[str, None]): Customize the export tensor format. Supports [“npy”, None]. Default: None, it means that the tensor is not exported.
  - npy: export tensor as npy file.

引发

ValueError – If the parameter value is not expected.
TypeError – If the parameter type is not expected.
RuntimeError – If an error occurs during data collection.

实际案例

>>> import mindspore.nn as nn
>>> from mindspore import context
>>> from mindspore.train.callback import SummaryCollector
>>> from mindspore.train import Model
>>> from mindspore.nn.metrics import Accuracy
>>>
>>> if __name__ == '__main__':
...     # If the device_target is GPU, set the device_target to "GPU"
...     context.set_context(mode=context.GRAPH_MODE, device_target="Ascend")
...     mnist_dataset_dir = '/path/to/mnist_dataset_directory'
...     # The detail of create_dataset method shown in model_zoo.official.cv.lenet.src.dataset.py
...     ds_train = create_dataset(mnist_dataset_dir, 32)
...     # The detail of LeNet5 shown in model_zoo.official.cv.lenet.src.lenet.py
...     network = LeNet5(10)
...     net_loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction="mean")
...     net_opt = nn.Momentum(network.trainable_params(), 0.01, 0.9)
...     model = Model(network, net_loss, net_opt, metrics={"Accuracy": Accuracy()}, amp_level="O2")
...
...     # Simple usage:
...     summary_collector = SummaryCollector(summary_dir='./summary_dir')
...     model.train(1, ds_train, callbacks=[summary_collector], dataset_sink_mode=False)
...
...     # Do not collect metric and collect the first layer parameter, others are collected by default
...     specified={'collect_metric': False, 'histogram_regular': '^conv1.*'}
...     summary_collector = SummaryCollector(summary_dir='./summary_dir', collect_specified_data=specified)
...     model.train(1, ds_train, callbacks=[summary_collector], dataset_sink_mode=False)

class tinyms.callbacks.CheckpointConfig(save_checkpoint_steps=1, save_checkpoint_seconds=0, keep_checkpoint_max=5, keep_checkpoint_per_n_minutes=0, integrated_save=True, async_save=False, saved_network=None)[源代码]¶

The configuration of model checkpoint.

注解

During the training process, if dataset is transmitted through the data channel, It is suggested to set ‘save_checkpoint_steps’ to an integer multiple of loop_size. Otherwise, the time to save the checkpoint may be biased.

参数

save_checkpoint_steps (int) – Steps to save checkpoint. Default: 1.
save_checkpoint_seconds (int) – Seconds to save checkpoint. Can’t be used with save_checkpoint_steps at the same time. Default: 0.
keep_checkpoint_max (int) – Maximum number of checkpoint files can be saved. Default: 5.
keep_checkpoint_per_n_minutes (int) – Keep one checkpoint every n minutes. Can’t be used with keep_checkpoint_max at the same time. Default: 0.
integrated_save (bool) – Whether to perform integrated save function in automatic model parallel scene. Integrated save function is only supported in automatic parallel scene, not supported in manual parallel. Default: True.
async_save (bool) – Whether asynchronous execution saves the checkpoint to a file. Default: False.
saved_network (Cell) – Network to be saved in checkpoint file. If the saved_network has no relation with the network in training, the initial value of saved_network will be saved. Default: None.

引发

ValueError – If the input_param is None or 0.

实际案例

>>> class LeNet5(nn.Cell):
>>>     def __init__(self, num_class=10, num_channel=1):
>>>         super(LeNet5, self).__init__()
>>>         self.conv1 = nn.Conv2d(num_channel, 6, 5, pad_mode='valid')
>>>         self.conv2 = nn.Conv2d(6, 16, 5, pad_mode='valid')
>>>         self.fc1 = nn.Dense(16 * 5 * 5, 120, weight_init=Normal(0.02))
>>>         self.fc2 = nn.Dense(120, 84, weight_init=Normal(0.02))
>>>         self.fc3 = nn.Dense(84, num_class, weight_init=Normal(0.02))
>>>         self.relu = nn.ReLU()
>>>         self.max_pool2d = nn.MaxPool2d(kernel_size=2, stride=2)
>>>         self.flatten = nn.Flatten()
>>>
>>>     def construct(self, x):
>>>         x = self.max_pool2d(self.relu(self.conv1(x)))
>>>         x = self.max_pool2d(self.relu(self.conv2(x)))
>>>         x = self.flatten(x)
>>>         x = self.relu(self.fc1(x))
>>>         x = self.relu(self.fc2(x))
>>>         x = self.fc3(x)
>>>         return x
>>>
>>> net = LeNet5()
>>> loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction='mean')
>>> optim = nn.Momentum(net.trainable_params(), 0.01, 0.9)
>>> model = Model(net, loss_fn=loss, optimizer=optim)
>>> data_path = './MNIST_Data'
>>> dataset = create_dataset(data_path)
>>> config = CheckpointConfig(saved_network=net)
>>> ckpoint_cb = ModelCheckpoint(prefix='LeNet5', directory='./checkpoint', config=config)
>>> model.train(10, dataset, callbacks=ckpoint_cb)

property async_save¶: Get the value of _async_save.

get_checkpoint_policy()[源代码]¶: Get the policy of checkpoint.

property integrated_save¶: Get the value of _integrated_save.

property keep_checkpoint_max¶: Get the value of _keep_checkpoint_max.

property keep_checkpoint_per_n_minutes¶: Get the value of _keep_checkpoint_per_n_minutes.

property save_checkpoint_seconds¶: Get the value of _save_checkpoint_seconds.

property save_checkpoint_steps¶: Get the value of _save_checkpoint_steps.

property saved_network¶: Get the value of _saved_network

class tinyms.callbacks.RunContext(original_args)[源代码]¶

Provide information about the model.

Provide information about original request to model function. Callback objects can stop the loop by calling request_stop() of run_context.

参数: original_args (dict) – Holding the related information of model.

get_stop_requested()[源代码]¶

Return whether a stop is requested or not.

返回: bool, if true, model.train() stops iterations.

original_args()[源代码]¶

Get the _original_args object.

返回: Dict, an object that holds the original arguments of model.

request_stop()[源代码]¶

Set stop requirement during training.

Callbacks can use this function to request stop of iterations. model.train() checks whether this is called or not.

class tinyms.callbacks.LearningRateScheduler(learning_rate_function)[源代码]¶

Change the learning_rate during training.

注解

This class are not supported on CPU.

参数: learning_rate_function (Function) – The function about how to change the learning rate during training.

实际案例

>>> from _lr_scheduler_callback import LearningRateScheduler
>>> import mindspore.nn as nn
>>> from mindspore.train import Model
...
>>> def learning_rate_function(lr, cur_step_num):
...     if cur_step_num%1000 == 0:
...         lr = lr*0.1
...     return lr
...
>>> lr = 0.1
>>> momentum = 0.9
>>> net = Net()
>>> loss = nn.SoftmaxCrossEntropyWithLogits()
>>> optim = nn.Momentum(net.trainable_params(), learning_rate=lr, momentum=momentum)
>>> model = Model(net, loss_fn=loss, optimizer=optim)
...
>>> dataset = create_custom_dataset("custom_dataset_path")
>>> model.train(1, dataset, callbacks=[LearningRateScheduler(learning_rate_function)],
...             dataset_sink_mode=False)