tinyms.data¶

class tinyms.data.UnalignedDataset(dataset_path, phase, max_dataset_size=inf, shuffle=True)[源代码]¶

This dataset class can load unaligned/unpaired datasets.

参数

dataset_path (str) – The path of images (should have subfolders trainA, trainB, testA, testB, etc).
phase (str) – Train or test. It requires two directories in dataset_path, like trainA and trainB to. host training images from domain A ‘{dataset_path}/trainA’ and from domain B ‘{dataset_path}/trainB’ respectively.
max_dataset_size (int) – Maximum number of return image paths.

返回

Two domain image path list.

class tinyms.data.GanImageFolderDataset(dataset_path, max_dataset_size=inf)[源代码]¶

This dataset class can load images from image folder.

参数

dataset_path (str) – ‘{dataset_path}/testA’, ‘{dataset_path}/testB’, etc.
max_dataset_size (int) – Maximum number of return image paths.

返回

Image path list.

class tinyms.data.DistributedSampler(dataset_size, num_replicas=None, rank=None, shuffle=True)[源代码]¶

Distributed sampler.

参数

dataset_size (int) – Dataset list length
num_replicas (int) – Replicas num.
rank (int) – Device rank.
shuffle (bool) – Whether the dataset needs to be shuffled. Default: True.

返回

DistributedSampler instance.

class tinyms.data.CelebADataset(dataset_dir, num_parallel_workers=None, shuffle=None, usage='all', sampler=None, decode=False, extensions=None, num_samples=None, num_shards=None, shard_id=None, cache=None)[源代码]¶

A source dataset for reading and parsing CelebA dataset. Currently supported: list_attr_celeba.txt only.

注解

The generated dataset has two columns [‘image’, ‘attr’]. The type of the image tensor is uint8. The attribute tensor is uint32 and one hot type.

Citation of CelebA dataset.

@article{DBLP:journals/corr/LiuLWT14,
author    = {Ziwei Liu and Ping Luo and Xiaogang Wang and Xiaoou Tang},
title     = {Deep Learning Face Attributes in the Wild},
journal   = {CoRR},
volume    = {abs/1411.7766},
year      = {2014},
url       = {http://arxiv.org/abs/1411.7766},
archivePrefix = {arXiv},
eprint    = {1411.7766},
timestamp = {Tue, 10 Dec 2019 15:37:26 +0100},
biburl    = {https://dblp.org/rec/journals/corr/LiuLWT14.bib},
bibsource = {dblp computer science bibliography, https://dblp.org},
howpublished = {http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html},
description  = {CelebFaces Attributes Dataset (CelebA) is a large-scale face attributes dataset
                with more than 200K celebrity images, each with 40 attribute annotations.
                The images in this dataset cover large pose variations and background clutter.
                CelebA has large diversities, large quantities, and rich annotations, including
                * 10,177 number of identities,
                * 202,599 number of face images, and
                * 5 landmark locations, 40 binary attributes annotations per image.
                The dataset can be employed as the training and test sets for the following computer
                vision tasks: face attribute recognition, face detection, landmark (or facial part)
                localization, and face editing & synthesis.}
}

参数

dataset_dir (str) – Path to the root directory that contains the dataset.
num_parallel_workers (int, optional) – Number of workers to read the data (default=value set in the config).
shuffle (bool, optional) – Whether to perform shuffle on the dataset (default=None).
usage (str) – one of ‘all’, ‘train’, ‘valid’ or ‘test’.
sampler (Sampler, optional) – Object used to choose samples from the dataset (default=None).
decode (bool, optional) – decode the images after reading (default=False).
extensions (list[str], optional) – List of file extensions to be included in the dataset (default=None).
num_samples (int, optional) – The number of images to be included in the dataset. (default=None, all images).
num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None).
shard_id (int, optional) – The shard ID within num_shards (default=None). This argument can only be specified when num_shards is also specified.
cache (DatasetCache, optional) – Use tensor caching service to speed up dataset processing. (default=None which means no cache is used).

实际案例

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "/path/to/celeba_directory"
>>> dataset = ds.CelebADataset(dataset_dir=dataset_dir, usage='train')

class tinyms.data.Cifar100Dataset(dataset_dir, usage=None, num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, num_shards=None, shard_id=None, cache=None)[源代码]¶

A source dataset that reads cifar100 data.

The generated dataset has three columns [‘image’, ‘coarse_label’, ‘fine_label’]. The type of the image tensor is uint8. The coarse and fine labels are each a scalar uint32 tensor. This dataset can take in a sampler. ‘sampler’ and ‘shuffle’ are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using ‘sampler’ and ‘shuffle’¶
Parameter ‘sampler’	Parameter ‘shuffle’	Expected Order Behavior
None	None	random order
None	True	random order
None	False	sequential order
Sampler object	None	order defined by sampler
Sampler object	True	not allowed
Sampler object	False	not allowed

Citation of Cifar100 dataset.

@techreport{Krizhevsky09,
author       = {Alex Krizhevsky},
title        = {Learning multiple layers of features from tiny images},
institution  = {},
year         = {2009},
howpublished = {http://www.cs.toronto.edu/~kriz/cifar.html},
description  = {This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images
                each. There are 500 training images and 100 testing images per class. The 100 classes in
                the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the
                class to which it belongs) and a "coarse" label (the superclass to which it belongs).}
}

参数

dataset_dir (str) – Path to the root directory that contains the dataset.
usage (str, optional) – Usage of this dataset, can be “train”, “test” or “all” . “train” will read from 50,000 train samples, “test” will read from 10,000 test samples, “all” will read from all 60,000 samples. (default=None, all samples)
num_samples (int, optional) – The number of images to be included in the dataset. (default=None, all images).
num_parallel_workers (int, optional) – Number of workers to read the data (default=None, number set in the config).
shuffle (bool, optional) – Whether to perform shuffle on the dataset (default=None, expected order behavior shown in the table).
sampler (Sampler, optional) – Object used to choose samples from the dataset (default=None, expected order behavior shown in the table).
num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None).
shard_id (int, optional) – The shard ID within num_shards (default=None). This argument can only be specified when num_shards is also specified.
cache (DatasetCache, optional) – Use tensor caching service to speed up dataset processing. (default=None which means no cache is used).

引发

RuntimeError – If sampler and shuffle are specified at the same time.
RuntimeError – If sampler and sharding are specified at the same time.
RuntimeError – If num_shards is specified but shard_id is None.
RuntimeError – If shard_id is specified but num_shards is None.
ValueError – If shard_id is invalid (< 0 or >= num_shards).

实际案例

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "/path/to/cifar100_dataset_directory"
>>>
>>> # 1) Get all samples from CIFAR100 dataset in sequence
>>> cifar100_dataset = ds.Cifar100Dataset(dataset_dir=dataset_dir, shuffle=False)
>>>
>>> # 2) Randomly select 350 samples from CIFAR100 dataset
>>> cifar100_dataset = ds.Cifar100Dataset(dataset_dir=dataset_dir, num_samples=350, shuffle=True)
>>>
>>> # In CIFAR100 dataset, each dictionary has 3 keys: "image", "fine_label" and "coarse_label"

class tinyms.data.Cifar10Dataset(dataset_dir, usage=None, num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, num_shards=None, shard_id=None, cache=None)[源代码]¶

A source dataset that reads cifar10 data.

The generated dataset has two columns [‘image’, ‘label’]. The type of the image tensor is uint8. The label is a scalar uint32 tensor. This dataset can take in a sampler. ‘sampler’ and ‘shuffle’ are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using ‘sampler’ and ‘shuffle’¶
Parameter ‘sampler’	Parameter ‘shuffle’	Expected Order Behavior
None	None	random order
None	True	random order
None	False	sequential order
Sampler object	None	order defined by sampler
Sampler object	True	not allowed
Sampler object	False	not allowed

Citation of Cifar10 dataset.

@techreport{Krizhevsky09,
author       = {Alex Krizhevsky},
title        = {Learning multiple layers of features from tiny images},
institution  = {},
year         = {2009},
howpublished = {http://www.cs.toronto.edu/~kriz/cifar.html},
description  = {The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes,
                with 6000 images per class. There are 50000 training images and 10000 test images.}
}

参数

dataset_dir (str) – Path to the root directory that contains the dataset.
usage (str, optional) – Usage of this dataset, can be “train”, “test” or “all” . “train” will read from 50,000 train samples, “test” will read from 10,000 test samples, “all” will read from all 60,000 samples. (default=None, all samples)
num_samples (int, optional) – The number of images to be included in the dataset. (default=None, all images).
num_parallel_workers (int, optional) – Number of workers to read the data (default=None, number set in the config).
shuffle (bool, optional) – Whether to perform shuffle on the dataset (default=None, expected order behavior shown in the table).
sampler (Sampler, optional) – Object used to choose samples from the dataset (default=None, expected order behavior shown in the table).
num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None).
shard_id (int, optional) – The shard ID within num_shards (default=None). This argument can only be specified when num_shards is also specified.
cache (DatasetCache, optional) – Use tensor caching service to speed up dataset processing. (default=None which means no cache is used).

引发

RuntimeError – If sampler and shuffle are specified at the same time.
RuntimeError – If sampler and sharding are specified at the same time.
RuntimeError – If num_shards is specified but shard_id is None.
RuntimeError – If shard_id is specified but num_shards is None.
ValueError – If shard_id is invalid (< 0 or >= num_shards).

实际案例

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "/path/to/cifar10_dataset_directory"
>>>
>>> # 1) Get all samples from CIFAR10 dataset in sequence
>>> dataset = ds.Cifar10Dataset(dataset_dir=dataset_dir, shuffle=False)
>>>
>>> # 2) Randomly select 350 samples from CIFAR10 dataset
>>> dataset = ds.Cifar10Dataset(dataset_dir=dataset_dir, num_samples=350, shuffle=True)
>>>
>>> # 3) Get samples from CIFAR10 dataset for shard 0 in a 2-way distributed training
>>> dataset = ds.Cifar10Dataset(dataset_dir=dataset_dir, num_shards=2, shard_id=0)
>>>
>>> # In CIFAR10 dataset, each dictionary has keys "image" and "label"

class tinyms.data.CLUEDataset(dataset_files, task='AFQMC', usage='train', num_samples=None, num_parallel_workers=None, shuffle=<Shuffle.GLOBAL: 'global'>, num_shards=None, shard_id=None, cache=None)[源代码]¶

A source dataset that reads and parses CLUE datasets. CLUE, the Chinese Language Understanding Evaluation Benchmark, is a collection of datasets, baselines, pre-trained models, corpus and leaderboard. Supported CLUE classification tasks: ‘AFQMC’, ‘TNEWS’, ‘IFLYTEK’, ‘CMNLI’, ‘WSC’ and ‘CSL’.

Citation of CLUE dataset.

@article{CLUEbenchmark,
title   = {CLUE: A Chinese Language Understanding Evaluation Benchmark},
author  = {Liang Xu, Xuanwei Zhang, Lu Li, Hai Hu, Chenjie Cao, Weitang Liu, Junyi Li, Yudong Li,
           Kai Sun, Yechen Xu, Yiming Cui, Cong Yu, Qianqian Dong, Yin Tian, Dian Yu, Bo Shi, Jun Zeng,
           Rongzhao Wang, Weijian Xie, Yanting Li, Yina Patterson, Zuoyu Tian, Yiwen Zhang, He Zhou,
           Shaoweihua Liu, Qipeng Zhao, Cong Yue, Xinrui Zhang, Zhengliang Yang, Zhenzhong Lan},
journal = {arXiv preprint arXiv:2004.05986},
year    = {2020},
howpublished = {https://github.com/CLUEbenchmark/CLUE},
description  = {CLUE, a Chinese Language Understanding Evaluation benchmark. It contains eight different
                tasks, including single-sentence classification, sentence pair classification, and machine
                reading comprehension.}
}

参数

dataset_files (Union[str, list[str]]) – String or list of files to be read or glob strings to search for a pattern of files. The list will be sorted in a lexicographical order.
task (str, optional) – The kind of task, one of ‘AFQMC’, ‘TNEWS’, ‘IFLYTEK’, ‘CMNLI’, ‘WSC’ and ‘CSL’. (default=AFQMC).
usage (str, optional) – Need train, test or eval data (default=”train”).
num_samples (int, optional) – Number of samples (rows) to read (default=None, reads the full dataset).
num_parallel_workers (int, optional) – Number of workers to read the data (default=None, number set in the config).
shuffle (Union[bool, Shuffle level], optional) –
Perform reshuffling of the data every epoch (default=Shuffle.GLOBAL). If shuffle is False, no shuffling will be performed; If shuffle is True, the behavior is the same as setting shuffle to be Shuffle.GLOBAL Otherwise, there are two levels of shuffling:
- Shuffle.GLOBAL: Shuffle both the files and samples.
- Shuffle.FILES: Shuffle files only.
num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None).
shard_id (int, optional) – The shard ID within num_shards (default=None). This argument can only be specified when num_shards is also specified.
cache (DatasetCache, optional) – Use tensor caching service to speed up dataset processing. (default=None which means no cache is used).

实际案例

>>> import mindspore.dataset as ds
>>>
>>> dataset_files = ["/path/to/1", "/path/to/2"] # contains 1 or multiple text files
>>> dataset = ds.CLUEDataset(dataset_files=dataset_files, task='AFQMC', usage='train')

class tinyms.data.CocoDataset(dataset_dir, annotation_file, task='Detection', num_samples=None, num_parallel_workers=None, shuffle=None, decode=False, sampler=None, num_shards=None, shard_id=None, cache=None)[源代码]¶

A source dataset for reading and parsing COCO dataset.

CocoDataset support four kinds of task: 2017 Train/Val/Test Detection, Keypoints, Stuff, Panoptic.

The generated dataset has multi-columns :

task=’Detection’, column: [[‘image’, dtype=uint8], [‘bbox’, dtype=float32], [‘category_id’, dtype=uint32], [‘iscrowd’, dtype=uint32]].

task=’Stuff’, column: [[‘image’, dtype=uint8], [‘segmentation’,dtype=float32], [‘iscrowd’,dtype=uint32]].

task=’Keypoint’, column: [[‘image’, dtype=uint8], [‘keypoints’, dtype=float32], [‘num_keypoints’, dtype=uint32]].

task=’Panoptic’, column: [[‘image’, dtype=uint8], [‘bbox’, dtype=float32], [‘category_id’, dtype=uint32], [‘iscrowd’, dtype=uint32], [‘area’, dtype=uint32]].

This dataset can take in a sampler. ‘sampler’ and ‘shuffle’ are mutually exclusive. CocoDataset doesn’t support PKSampler. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using ‘sampler’ and ‘shuffle’¶
Parameter ‘sampler’	Parameter ‘shuffle’	Expected Order Behavior
None	None	random order
None	True	random order
None	False	sequential order
Sampler object	None	order defined by sampler
Sampler object	True	not allowed
Sampler object	False	not allowed

Citation of Coco dataset.

@article{DBLP:journals/corr/LinMBHPRDZ14,
author        = {Tsung{-}Yi Lin and Michael Maire and Serge J. Belongie and
                 Lubomir D. Bourdev and  Ross B. Girshick and James Hays and
                 Pietro Perona and Deva Ramanan and Piotr Doll{'{a}}r and C. Lawrence Zitnick},
title         = {Microsoft {COCO:} Common Objects in Context},
journal       = {CoRR},
volume        = {abs/1405.0312},
year          = {2014},
url           = {http://arxiv.org/abs/1405.0312},
archivePrefix = {arXiv},
eprint        = {1405.0312},
timestamp     = {Mon, 13 Aug 2018 16:48:13 +0200},
biburl        = {https://dblp.org/rec/journals/corr/LinMBHPRDZ14.bib},
bibsource     = {dblp computer science bibliography, https://dblp.org},
description   = {COCO is a large-scale object detection, segmentation, and captioning dataset.
                 It contains 91 common object categories with 82 of them having more than 5,000
                 labeled instances. In contrast to the popular ImageNet dataset, COCO has fewer
                 categories but more instances per category.}
}

参数

dataset_dir (str) – Path to the root directory that contains the dataset.
annotation_file (str) – Path to the annotation JSON.
task (str) – Set the task type for reading COCO data. Supported task types: ‘Detection’, ‘Stuff’, ‘Panoptic’ and ‘Keypoint’ (default=’Detection’).
num_samples (int, optional) – The number of images to be included in the dataset (default=None, all images).
num_parallel_workers (int, optional) – Number of workers to read the data (default=None, number set in the configuration file).
shuffle (bool, optional) – Whether to perform shuffle on the dataset (default=None, expected order behavior shown in the table).
decode (bool, optional) – Decode the images after reading (default=False).
sampler (Sampler, optional) – Object used to choose samples from the dataset (default=None, expected order behavior shown in the table).
num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None).
shard_id (int, optional) – The shard ID within num_shards (default=None). This argument can only be specified when num_shards is also specified.
cache (DatasetCache, optional) – Use tensor caching service to speed up dataset processing. (default=None which means no cache is used).

引发

RuntimeError – If sampler and shuffle are specified at the same time.
RuntimeError – If sampler and sharding are specified at the same time.
RuntimeError – If num_shards is specified but shard_id is None.
RuntimeError – If shard_id is specified but num_shards is None.
RuntimeError – If parse JSON file failed.
ValueError – If task is not in [‘Detection’, ‘Stuff’, ‘Panoptic’, ‘Keypoint’].
ValueError – If annotation_file is not exist.
ValueError – If dataset_dir is not exist.
ValueError – If shard_id is invalid (< 0 or >= num_shards).

实际案例

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "/path/to/coco_dataset_directory/image_folder"
>>> annotation_file = "/path/to/coco_dataset_directory/annotation_folder/annotation.json"
>>>
>>> # 1) Read COCO data for Detection task
>>> coco_dataset = ds.CocoDataset(dataset_dir, annotation_file=annotation_file, task='Detection')
>>>
>>> # 2) Read COCO data for Stuff task
>>> coco_dataset = ds.CocoDataset(dataset_dir, annotation_file=annotation_file, task='Stuff')
>>>
>>> # 3) Read COCO data for Panoptic task
>>> coco_dataset = ds.CocoDataset(dataset_dir, annotation_file=annotation_file, task='Panoptic')
>>>
>>> # 4) Read COCO data for Keypoint task
>>> coco_dataset = ds.CocoDataset(dataset_dir, annotation_file=annotation_file, task='Keypoint')
>>>
>>> # In COCO dataset, each dictionary has keys "image" and "annotation"

get_class_indexing()[源代码]¶

Get the class index.

返回: dict, a str-to-list<int> mapping from label name to index

class tinyms.data.CSVDataset(dataset_files, field_delim=', ', column_defaults=None, column_names=None, num_samples=None, num_parallel_workers=None, shuffle=<Shuffle.GLOBAL: 'global'>, num_shards=None, shard_id=None, cache=None)[源代码]¶

A source dataset that reads and parses comma-separated values (CSV) datasets.

参数

dataset_files (Union[str, list[str]]) – String or list of files to be read or glob strings to search for a pattern of files. The list will be sorted in a lexicographical order.
field_delim (str, optional) – A string that indicates the char delimiter to separate fields (default=’,’).
column_defaults (list, optional) – List of default values for the CSV field (default=None). Each item in the list is either a valid type (float, int, or string). If this is not provided, treats all columns as string type.
column_names (list[str], optional) – List of column names of the dataset (default=None). If this is not provided, infers the column_names from the first row of CSV file.
num_samples (int, optional) – Number of samples (rows) to read (default=None, reads the full dataset).
num_parallel_workers (int, optional) – Number of workers to read the data (default=None, number set in the config).
shuffle (Union[bool, Shuffle level], optional) –
Perform reshuffling of the data every epoch (default=Shuffle.GLOBAL). If shuffle is False, no shuffling will be performed; If shuffle is True, the behavior is the same as setting shuffle to be Shuffle.GLOBAL Otherwise, there are two levels of shuffling:
- Shuffle.GLOBAL: Shuffle both the files and samples.
- Shuffle.FILES: Shuffle files only.
num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None).
shard_id (int, optional) – The shard ID within num_shards (default=None). This argument can only be specified when num_shards is also specified.
cache (DatasetCache, optional) – Use tensor caching service to speed up dataset processing. (default=None which means no cache is used).

实际案例

>>> import mindspore.dataset as ds
>>>
>>> dataset_files = ["/path/to/1", "/path/to/2"] # contains 1 or multiple text files
>>> dataset = ds.CSVDataset(dataset_files=dataset_files, column_names=['col1', 'col2', 'col3', 'col4'])

class tinyms.data.GeneratorDataset(source, column_names=None, column_types=None, schema=None, num_samples=None, num_parallel_workers=1, shuffle=None, sampler=None, num_shards=None, shard_id=None, python_multiprocessing=True)[源代码]¶

A source dataset that generates data from Python by invoking Python data source each epoch.

This dataset can take in a sampler. ‘sampler’ and ‘shuffle’ are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using ‘sampler’ and ‘shuffle’¶
Parameter ‘sampler’	Parameter ‘shuffle’	Expected Order Behavior
None	None	random order
None	True	random order
None	False	sequential order
Sampler object	None	order defined by sampler
Sampler object	True	not allowed
Sampler object	False	not allowed

参数

source (Union[Callable, Iterable, Random Accessible]) – A generator callable object, an iterable Python object or a random accessible Python object. Callable source is required to return a tuple of NumPy arrays as a row of the dataset on source().next(). Iterable source is required to return a tuple of NumPy arrays as a row of the dataset on iter(source).next(). Random accessible source is required to return a tuple of NumPy arrays as a row of the dataset on source[idx].
column_names (Union[str, list[str]], optional) – List of column names of the dataset (default=None). Users are required to provide either column_names or schema.
column_types (list[mindspore.dtype], optional) – List of column data types of the dataset (default=None). If provided, sanity check will be performed on generator output.
schema (Union[Schema, str], optional) – Path to the JSON schema file or schema object (default=None). Users are required to provide either column_names or schema. If both are provided, schema will be used.
num_samples (int, optional) – The number of samples to be included in the dataset (default=None, all images).
num_parallel_workers (int, optional) – Number of subprocesses used to fetch the dataset in parallel (default=1).
shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. Random accessible input is required. (default=None, expected order behavior shown in the table).
sampler (Union[Sampler, Iterable], optional) – Object used to choose samples from the dataset. Random accessible input is required (default=None, expected order behavior shown in the table).
num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None). When this argument is specified, ‘num_samples’ will not used. Random accessible input is required.
shard_id (int, optional) – The shard ID within num_shards (default=None). This argument must be specified only when num_shards is also specified. Random accessible input is required.
python_multiprocessing (bool, optional) – Parallelize Python operations with multiple worker process. This option could be beneficial if the Python operation is computational heavy (default=True).

实际案例

>>> import mindspore.dataset as ds
>>>
>>> # 1) Multidimensional generator function as callable input
>>> def GeneratorMD():
>>>     for i in range(64):
>>>         yield (np.array([[i, i + 1], [i + 2, i + 3]]),)
>>> # Create multi_dimension_generator_dataset with GeneratorMD and column name "multi_dimensional_data"
>>> multi_dimension_generator_dataset = ds.GeneratorDataset(GeneratorMD, ["multi_dimensional_data"])
>>>
>>> # 2) Multi-column generator function as callable input
>>> def GeneratorMC(maxid = 64):
>>>     for i in range(maxid):
>>>         yield (np.array([i]), np.array([[i, i + 1], [i + 2, i + 3]]))
>>> # Create multi_column_generator_dataset with GeneratorMC and column names "col1" and "col2"
>>> multi_column_generator_dataset = ds.GeneratorDataset(GeneratorMC, ["col1", "col2"])
>>>
>>> # 3) Iterable dataset as iterable input
>>> class MyIterable():
>>>     def __iter__(self):
>>>         return # User implementation
>>> # Create iterable_generator_dataset with MyIterable object
>>> iterable_generator_dataset = ds.GeneratorDataset(MyIterable(), ["col1"])
>>>
>>> # 4) Random accessible dataset as random accessible input
>>> class MyRA():
>>>     def __getitem__(self, index):
>>>         return # User implementation
>>> # Create ra_generator_dataset with MyRA object
>>> ra_generator_dataset = ds.GeneratorDataset(MyRA(), ["col1"])
>>> # List/Dict/Tuple is also random accessible
>>> list_generator = ds.GeneratorDataset([(np.array(0),), (np.array(1)), (np.array(2))], ["col1"])
>>>
>>> # 5) Built-in Sampler
>>> my_generator = ds.GeneratorDataset(my_ds, ["img", "label"], sampler=samplers.RandomSampler())

class tinyms.data.GraphData(dataset_file, num_parallel_workers=None, working_mode='local', hostname='127.0.0.1', port=50051, num_client=1, auto_shutdown=True)[源代码]¶

Reads the graph dataset used for GNN training from the shared file and database.

参数

dataset_file (str) – One of file names in the dataset.
num_parallel_workers (int, optional) – Number of workers to process the dataset in parallel (default=None).
working_mode (str, optional) –
Set working mode, now supports ‘local’/’client’/’server’ (default=’local’).
- ’local’, used in non-distributed training scenarios.
- ’client’, used in distributed training scenarios. The client does not load data, but obtains data from the server.
- ’server’, used in distributed training scenarios. The server loads the data and is available to the client.
hostname (str, optional) – Hostname of the graph data server. This parameter is only valid when working_mode is set to ‘client’ or ‘server’ (default=’127.0.0.1’).
port (int, optional) – Port of the graph data server. The range is 1024-65535. This parameter is only valid when working_mode is set to ‘client’ or ‘server’ (default=50051).
num_client (int, optional) – Maximum number of clients expected to connect to the server. The server will allocate resources according to this parameter. This parameter is only valid when working_mode is set to ‘server’ (default=1).
auto_shutdown (bool, optional) – Valid when working_mode is set to ‘server’, when the number of connected clients reaches num_client and no client is being connected, the server automatically exits (default=True).

实际案例

>>> import mindspore.dataset as ds
>>>
>>> data_graph = ds.GraphData('dataset_file', 2)
>>> nodes = data_graph.get_all_nodes(0)
>>> features = data_graph.get_node_feature(nodes, [1])

get_all_edges(edge_type)[源代码]¶

Get all edges in the graph.

参数: edge_type (int) – Specify the type of edge.
返回: numpy.ndarray, array of edges.

实际案例

>>> import mindspore.dataset as ds
>>>
>>> data_graph = ds.GraphData('dataset_file', 2)
>>> nodes = data_graph.get_all_edges(0)

引发: TypeError – If edge_type is not integer.

get_all_neighbors(node_list, neighbor_type)[源代码]¶

Get neighbor_type neighbors of the nodes in node_list.

参数

node_list (Union[list, numpy.ndarray]) – The given list of nodes.
neighbor_type (int) – Specify the type of neighbor.

返回

numpy.ndarray, array of neighbors.

实际案例

>>> import mindspore.dataset as ds
>>>
>>> data_graph = ds.GraphData('dataset_file', 2)
>>> nodes = data_graph.get_all_nodes(0)
>>> neighbors = data_graph.get_all_neighbors(nodes, 0)

引发

TypeError – If node_list is not list or ndarray.
TypeError – If neighbor_type is not integer.

get_all_nodes(node_type)[源代码]¶

Get all nodes in the graph.

参数: node_type (int) – Specify the type of node.
返回: numpy.ndarray, array of nodes.

实际案例

>>> import mindspore.dataset as ds
>>>
>>> data_graph = ds.GraphData('dataset_file', 2)
>>> nodes = data_graph.get_all_nodes(0)

引发: TypeError – If node_type is not integer.

get_edge_feature(edge_list, feature_types)[源代码]¶

Get feature_types feature of the edges in edge_list.

参数

edge_list (Union[list, numpy.ndarray]) – The given list of edges.
feature_types (Union[list, numpy.ndarray]) – The given list of feature types.

返回

numpy.ndarray, array of features.

实际案例

>>> import mindspore.dataset as ds
>>>
>>> data_graph = ds.GraphData('dataset_file', 2)
>>> edges = data_graph.get_all_edges(0)
>>> features = data_graph.get_edge_feature(edges, [1])

引发

TypeError – If edge_list is not list or ndarray.
TypeError – If feature_types is not list or ndarray.

get_neg_sampled_neighbors(node_list, neg_neighbor_num, neg_neighbor_type)[源代码]¶

Get neg_neighbor_type negative sampled neighbors of the nodes in node_list.

参数

node_list (Union[list, numpy.ndarray]) – The given list of nodes.
neg_neighbor_num (int) – Number of neighbors sampled.
neg_neighbor_type (int) – Specify the type of negative neighbor.

返回

numpy.ndarray, array of neighbors.

实际案例

>>> import mindspore.dataset as ds
>>>
>>> data_graph = ds.GraphData('dataset_file', 2)
>>> nodes = data_graph.get_all_nodes(0)
>>> neg_neighbors = data_graph.get_neg_sampled_neighbors(nodes, 5, 0)

引发

TypeError – If node_list is not list or ndarray.
TypeError – If neg_neighbor_num is not integer.
TypeError – If neg_neighbor_type is not integer.

get_node_feature(node_list, feature_types)[源代码]¶

Get feature_types feature of the nodes in node_list.

参数

node_list (Union[list, numpy.ndarray]) – The given list of nodes.
feature_types (Union[list, numpy.ndarray]) – The given list of feature types.

返回

numpy.ndarray, array of features.

实际案例

>>> import mindspore.dataset as ds
>>>
>>> data_graph = ds.GraphData('dataset_file', 2)
>>> nodes = data_graph.get_all_nodes(0)
>>> features = data_graph.get_node_feature(nodes, [1])

引发

TypeError – If node_list is not list or ndarray.
TypeError – If feature_types is not list or ndarray.

get_nodes_from_edges(edge_list)[源代码]¶

Get nodes from the edges.

参数: edge_list (Union[list, numpy.ndarray]) – The given list of edges.
返回: numpy.ndarray, array of nodes.
引发: TypeError – If edge_list is not list or ndarray.

get_sampled_neighbors(node_list, neighbor_nums, neighbor_types)[源代码]¶

Get sampled neighbor information.

The api supports multi-hop neighbor sampling. That is, the previous sampling result is used as the input of next-hop sampling. A maximum of 6-hop are allowed.

The sampling result is tiled into a list in the format of [input node, 1-hop sampling result, 2-hop samling result …]

参数

node_list (Union[list, numpy.ndarray]) – The given list of nodes.
neighbor_nums (Union[list, numpy.ndarray]) – Number of neighbors sampled per hop.
neighbor_types (Union[list, numpy.ndarray]) – Neighbor type sampled per hop.

返回

numpy.ndarray, array of neighbors.

实际案例

>>> import mindspore.dataset as ds
>>>
>>> data_graph = ds.GraphData('dataset_file', 2)
>>> nodes = data_graph.get_all_nodes(0)
>>> neighbors = data_graph.get_sampled_neighbors(nodes, [2, 2], [0, 0])

引发

TypeError – If node_list is not list or ndarray.
TypeError – If neighbor_nums is not list or ndarray.
TypeError – If neighbor_types is not list or ndarray.

graph_info()[源代码]¶

Get the meta information of the graph, including the number of nodes, the type of nodes, the feature information of nodes, the number of edges, the type of edges, and the feature information of edges.

返回: dict, meta information of the graph. The key is node_type, edge_type, node_num, edge_num, node_feature_type and edge_feature_type.

random_walk(target_nodes, meta_path, step_home_param=1.0, step_away_param=1.0, default_node=-1)[源代码]¶

Random walk in nodes.

参数

target_nodes (list[int]) – Start node list in random walk
meta_path (list[int]) – node type for each walk step
step_home_param (float, optional) – return hyper parameter in node2vec algorithm (Default = 1.0).
step_away_param (float, optional) – inout hyper parameter in node2vec algorithm (Default = 1.0).
default_node (int, optional) – default node if no more neighbors found (Default = -1). A default value of -1 indicates that no node is given.

返回

numpy.ndarray, array of nodes.

实际案例

>>> import mindspore.dataset as ds
>>>
>>> data_graph = ds.GraphData('dataset_file', 2)
>>> nodes = data_graph.random_walk([1,2], [1,2,1,2,1])

引发

TypeError – If target_nodes is not list or ndarray.
TypeError – If meta_path is not list or ndarray.

class tinyms.data.ImageFolderDataset(dataset_dir, num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, extensions=None, class_indexing=None, decode=False, num_shards=None, shard_id=None, cache=None)[源代码]¶

A source dataset that reads images from a tree of directories.

All images within one folder have the same label. The generated dataset has two columns [‘image’, ‘label’]. The shape of the image column is [image_size] if decode flag is False, or [H,W,C] otherwise. The type of the image tensor is uint8. The label is a scalar int32 tensor. This dataset can take in a sampler. ‘sampler’ and ‘shuffle’ are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using ‘sampler’ and ‘shuffle’¶
Parameter ‘sampler’	Parameter ‘shuffle’	Expected Order Behavior
None	None	random order
None	True	random order
None	False	sequential order
Sampler object	None	order defined by sampler
Sampler object	True	not allowed
Sampler object	False	not allowed

参数

dataset_dir (str) – Path to the root directory that contains the dataset.
num_samples (int, optional) – The number of images to be included in the dataset (default=None, all images).
num_parallel_workers (int, optional) – Number of workers to read the data (default=None, set in the config).
shuffle (bool, optional) – Whether or not to perform shuffle on the dataset (default=None, expected order behavior shown in the table).
sampler (Sampler, optional) – Object used to choose samples from the dataset (default=None, expected order behavior shown in the table).
extensions (list[str], optional) – List of file extensions to be included in the dataset (default=None).
class_indexing (dict, optional) – A str-to-int mapping from folder name to index (default=None, the folder names will be sorted alphabetically and each class will be given a unique index starting from 0).
decode (bool, optional) – Decode the images after reading (default=False).
num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None).
shard_id (int, optional) – The shard ID within num_shards (default=None). This argument can only be specified when num_shards is also specified.
cache (DatasetCache, optional) – Use tensor caching service to speed up dataset processing. (default=None which means no cache is used).

引发

RuntimeError – If sampler and shuffle are specified at the same time.
RuntimeError – If sampler and sharding are specified at the same time.
RuntimeError – If num_shards is specified but shard_id is None.
RuntimeError – If shard_id is specified but num_shards is None.
RuntimeError – If class_indexing is not a dictionary.
ValueError – If shard_id is invalid (< 0 or >= num_shards).

实际案例

>>> import mindspore.dataset as ds
>>>
>>> # Set path to the imagefolder directory.
>>> # This directory needs to contain sub-directories which contain the images
>>> dataset_dir = "/path/to/imagefolder_directory"
>>>
>>> # 1) Read all samples (image files) in dataset_dir with 8 threads
>>> imagefolder_dataset = ds.ImageFolderDataset(dataset_dir, num_parallel_workers=8)
>>>
>>> # 2) Read all samples (image files) from folder cat and folder dog with label 0 and 1
>>> imagefolder_dataset = ds.ImageFolderDataset(dataset_dir, class_indexing={"cat":0, "dog":1})
>>>
>>> # 3) Read all samples (image files) in dataset_dir with extensions .JPEG and .png (case sensitive)
>>> imagefolder_dataset = ds.ImageFolderDataset(dataset_dir, extensions=[".JPEG", ".png"])

class tinyms.data.ManifestDataset(dataset_file, usage='train', num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, class_indexing=None, decode=False, num_shards=None, shard_id=None, cache=None)[源代码]¶

A source dataset that reads images from a manifest file.

The generated dataset has two columns [‘image’, ‘label’]. The shape of the image column is [image_size] if decode flag is False, or [H,W,C] otherwise. The type of the image tensor is uint8. The label is a scalar uint64 tensor. This dataset can take in a sampler. ‘sampler’ and ‘shuffle’ are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using ‘sampler’ and ‘shuffle’¶
Parameter ‘sampler’	Parameter ‘shuffle’	Expected Order Behavior
None	None	random order
None	True	random order
None	False	sequential order
Sampler object	None	order defined by sampler
Sampler object	True	not allowed
Sampler object	False	not allowed

参数

dataset_file (str) – File to be read.
usage (str, optional) – acceptable usages include train, eval and inference (default=”train”).
num_samples (int, optional) – The number of images to be included in the dataset. (default=None, all images).
num_parallel_workers (int, optional) – Number of workers to read the data (default=None, number set in the config).
shuffle (bool, optional) – Whether to perform shuffle on the dataset (default=None, expected order behavior shown in the table).
sampler (Sampler, optional) – Object used to choose samples from the dataset (default=None, expected order behavior shown in the table).
class_indexing (dict, optional) – A str-to-int mapping from label name to index (default=None, the folder names will be sorted alphabetically and each class will be given a unique index starting from 0).
decode (bool, optional) – decode the images after reading (default=False).
num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None).
shard_id (int, optional) – The shard ID within num_shards (default=None). This argument can only be specified when num_shards is also specified.
cache (DatasetCache, optional) – Use tensor caching service to speed up dataset processing. (default=None which means no cache is used).

引发

RuntimeError – If sampler and shuffle are specified at the same time.
RuntimeError – If sampler and sharding are specified at the same time.
RuntimeError – If num_shards is specified but shard_id is None.
RuntimeError – If shard_id is specified but num_shards is None.
RuntimeError – If class_indexing is not a dictionary.
ValueError – If shard_id is invalid (< 0 or >= num_shards).

实际案例

>>> import mindspore.dataset as ds
>>>
>>> dataset_file = "/path/to/manifest_file.manifest"
>>>
>>> # 1) Read all samples specified in manifest_file dataset with 8 threads for training
>>> manifest_dataset = ds.ManifestDataset(dataset_file, usage="train", num_parallel_workers=8)
>>>
>>> # 2) Read samples (specified in manifest_file.manifest) for shard 0
>>> # in a 2-way distributed training setup
>>> manifest_dataset = ds.ManifestDataset(dataset_file, num_shards=2, shard_id=0)

get_class_indexing()[源代码]¶

Get the class index.

返回: dict, a str-to-int mapping from label name to index.

class tinyms.data.MindDataset(dataset_file, columns_list=None, num_parallel_workers=None, shuffle=None, num_shards=None, shard_id=None, sampler=None, padded_sample=None, num_padded=None, num_samples=None)[源代码]¶

A source dataset that reads MindRecord files.

参数

dataset_file (Union[str, list[str]]) – If dataset_file is a str, it represents for a file name of one component of a mindrecord source, other files with identical source in the same path will be found and loaded automatically. If dataset_file is a list, it represents for a list of dataset files to be read directly.
columns_list (list[str], optional) – List of columns to be read (default=None).
num_parallel_workers (int, optional) – The number of readers (default=None).
shuffle (bool, optional) – Whether or not to perform shuffle on the dataset (default=None, performs shuffle).
num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None).
shard_id (int, optional) – The shard ID within num_shards (default=None). This argument can only be specified when num_shards is also specified.
sampler (Sampler, optional) – Object used to choose samples from the dataset (default=None, sampler is exclusive with shuffle and block_reader). Support list: SubsetRandomSampler, PkSampler, RandomSampler, SequentialSampler, DistributedSampler.
padded_sample (dict, optional) – Samples will be appended to dataset, where keys are the same as column_list.
num_padded (int, optional) – Number of padding samples. Dataset size plus num_padded should be divisible by num_shards.
num_samples (int, optional) – The number of samples to be included in the dataset (default=None, all samples).

引发

ValueError – If num_shards is specified but shard_id is None.
ValueError – If shard_id is specified but num_shards is None.

class tinyms.data.MnistDataset(dataset_dir, usage=None, num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, num_shards=None, shard_id=None, cache=None)[源代码]¶

A source dataset for reading and parsing the MNIST dataset.

The generated dataset has two columns [‘image’, ‘label’]. The type of the image tensor is uint8. The label is a scalar uint32 tensor. This dataset can take in a sampler. ‘sampler’ and ‘shuffle’ are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using ‘sampler’ and ‘shuffle’¶
Parameter ‘sampler’	Parameter ‘shuffle’	Expected Order Behavior
None	None	random order
None	True	random order
None	False	sequential order
Sampler object	None	order defined by sampler
Sampler object	True	not allowed
Sampler object	False	not allowed

Citation of Mnist dataset.

@article{lecun2010mnist,
title        = {MNIST handwritten digit database},
author       = {LeCun, Yann and Cortes, Corinna and Burges, CJ},
journal      = {ATT Labs [Online]},
volume       = {2},
year         = {2010},
howpublished = {http://yann.lecun.com/exdb/mnist},
description  = {The MNIST database of handwritten digits has a training set of 60,000 examples,
                and a test set of 10,000 examples. It is a subset of a larger set available from
                NIST. The digits have been size-normalized and centered in a fixed-size image.}
}

参数

dataset_dir (str) – Path to the root directory that contains the dataset.
usage (str, optional) – Usage of this dataset, can be “train”, “test” or “all” . “train” will read from 60,000 train samples, “test” will read from 10,000 test samples, “all” will read from all 70,000 samples. (default=None, all samples)
num_samples (int, optional) – The number of images to be included in the dataset (default=None, all images).
num_parallel_workers (int, optional) – Number of workers to read the data (default=None, set in the config).
shuffle (bool, optional) – Whether or not to perform shuffle on the dataset (default=None, expected order behavior shown in the table).
sampler (Sampler, optional) – Object used to choose samples from the dataset (default=None, expected order behavior shown in the table).
num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None).
shard_id (int, optional) – The shard ID within num_shards (default=None). This argument can only be specified when num_shards is also specified.
cache (DatasetCache, optional) – Use tensor caching service to speed up dataset processing. (default=None which means no cache is used).

引发

RuntimeError – If sampler and shuffle are specified at the same time.
RuntimeError – If sampler and sharding are specified at the same time.
RuntimeError – If num_shards is specified but shard_id is None.
RuntimeError – If shard_id is specified but num_shards is None.
ValueError – If shard_id is invalid (< 0 or >= num_shards).

实际案例

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "/path/to/mnist_folder"
>>> # Read 3 samples from MNIST dataset
>>> mnist_dataset = ds.MnistDataset(dataset_dir=dataset_dir, num_samples=3)
>>> # Note: In mnist_dataset dataset, each dictionary has keys "image" and "label"

class tinyms.data.NumpySlicesDataset(data, column_names=None, num_samples=None, num_parallel_workers=1, shuffle=None, sampler=None, num_shards=None, shard_id=None)[源代码]¶

Create a dataset with given data slices, mainly for loading Python data into dataset.

This dataset can take in a sampler. ‘sampler’ and ‘shuffle’ are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using ‘sampler’ and ‘shuffle’¶
Parameter ‘sampler’	Parameter ‘shuffle’	Expected Order Behavior
None	None	random order
None	True	random order
None	False	sequential order
Sampler object	None	order defined by sampler
Sampler object	True	not allowed
Sampler object	False	not allowed

参数

data (Union[list, tuple, dict]) – list, tuple, dict and other NumPy formats. Input data will be sliced along the first dimension and generate additional rows, if input is list, there will be one column in each row, otherwise there tends to be multi columns. Large data is not recommended to be loaded in this way as data is loading into memory.
column_names (list[str], optional) – List of column names of the dataset (default=None). If column_names is not provided, when data is dict, column_names will be its keys, otherwise it will be like column_0, column_1 …
num_samples (int, optional) – The number of samples to be included in the dataset (default=None, all images).
num_parallel_workers (int, optional) – Number of subprocesses used to fetch the dataset in parallel (default=1).
shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. Random accessible input is required. (default=None, expected order behavior shown in the table).
sampler (Union[Sampler, Iterable], optional) – Object used to choose samples from the dataset. Random accessible input is required (default=None, expected order behavior shown in the table).
num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None). When this argument is specified, ‘num_samples’ will not used. Random accessible input is required.
shard_id (int, optional) – The shard ID within num_shards (default=None). This argument must be specified only when num_shards is also specified. Random accessible input is required.

实际案例

>>> import mindspore.dataset as ds
>>>
>>> # 1) Input data can be a list
>>> data = [1, 2, 3]
>>> dataset1 = ds.NumpySlicesDataset(data, column_names=["column_1"])
>>>
>>> # 2) Input data can be a dictionary, and column_names will be its keys
>>> data = {"a": [1, 2], "b": [3, 4]}
>>> dataset2 = ds.NumpySlicesDataset(data)
>>>
>>> # 3) Input data can be a tuple of lists (or NumPy arrays), each tuple element refers to data in each column
>>> data = ([1, 2], [3, 4], [5, 6])
>>> dataset3 = ds.NumpySlicesDataset(data, column_names=["column_1", "column_2", "column_3"])
>>>
>>> # 4) Load data from CSV file
>>> import pandas as pd
>>> df = pd.read_csv("file.csv")
>>> dataset4 = ds.NumpySlicesDataset(dict(df), shuffle=False)

class tinyms.data.PaddedDataset(padded_samples)[源代码]¶

Create a dataset with fake data provided by user. Mainly used to add to the original data set and assign it to the corresponding shard.

参数

padded_samples (list(dict)) – Samples provided by user.

引发

TypeError – If padded_samples is not an instance of list.
TypeError – If the element of padded_samples is not an instance of dict.
ValueError – If the padded_samples is empty.

实际案例

>>> import mindspore.dataset as ds
>>> data1 = [{'image': np.zeros(1, np.uint8)}, {'image': np.zeros(2, np.uint8)}]
>>> ds1 = ds.PaddedDataset(data1)

class tinyms.data.TextFileDataset(dataset_files, num_samples=None, num_parallel_workers=None, shuffle=<Shuffle.GLOBAL: 'global'>, num_shards=None, shard_id=None, cache=None)[源代码]¶

A source dataset that reads and parses datasets stored on disk in text format. The generated dataset has one column [‘text’].

参数

dataset_files (Union[str, list[str]]) – String or list of files to be read or glob strings to search for a pattern of files. The list will be sorted in a lexicographical order.
num_samples (int, optional) – Number of samples (rows) to read (default=None, reads the full dataset).
num_parallel_workers (int, optional) – Number of workers to read the data (default=None, number set in the config).
shuffle (Union[bool, Shuffle level], optional) –
Perform reshuffling of the data every epoch (default=Shuffle.GLOBAL). If shuffle is False, no shuffling will be performed; If shuffle is True, the behavior is the same as setting shuffle to be Shuffle.GLOBAL Otherwise, there are two levels of shuffling:
- Shuffle.GLOBAL: Shuffle both the files and samples.
- Shuffle.FILES: Shuffle files only.
num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None).
shard_id (int, optional) – The shard ID within num_shards (default=None). This argument can only be specified when num_shards is also specified.
cache (DatasetCache, optional) – Use tensor caching service to speed up dataset processing. (default=None which means no cache is used).

实际案例

>>> import mindspore.dataset as ds
>>>
>>> dataset_files = ["/path/to/1", "/path/to/2"] # contains 1 or multiple text files
>>> dataset = ds.TextFileDataset(dataset_files=dataset_files)

class tinyms.data.TFRecordDataset(dataset_files, schema=None, columns_list=None, num_samples=None, num_parallel_workers=None, shuffle=<Shuffle.GLOBAL: 'global'>, num_shards=None, shard_id=None, shard_equal_rows=False, cache=None)[源代码]¶

A source dataset that reads and parses datasets stored on disk in TFData format.

参数

dataset_files (Union[str, list[str]]) – String or list of files to be read or glob strings to search for a pattern of files. The list will be sorted in a lexicographical order.
schema (Union[str, Schema], optional) – Path to the JSON schema file or schema object (default=None). If the schema is not provided, the meta data from the TFData file is considered the schema.
columns_list (list[str], optional) – List of columns to be read (default=None, read all columns)
num_samples (int, optional) – Number of samples (rows) to read (default=None). If num_samples is None and numRows(parsed from schema) does not exist, read the full dataset; If num_samples is None and numRows(parsed from schema) is greater than 0, read numRows rows; If both num_samples and numRows(parsed from schema) are greater than 0, read num_samples rows.
num_parallel_workers (int, optional) – Number of workers to read the data (default=None, number set in the config).
shuffle (Union[bool, Shuffle level], optional) –
Perform reshuffling of the data every epoch (default=Shuffle.GLOBAL). If shuffle is False, no shuffling will be performed; If shuffle is True, the behavior is the same as setting shuffle to be Shuffle.GLOBAL Otherwise, there are two levels of shuffling:
- Shuffle.GLOBAL: Shuffle both the files and samples.
- Shuffle.FILES: Shuffle files only.
num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None).
shard_id (int, optional) – The shard ID within num_shards (default=None). This argument can only be specified when num_shards is also specified.
shard_equal_rows (bool, optional) – Get equal rows for all shards(default=False). If shard_equal_rows is false, number of rows of each shard may be not equal.
cache (DatasetCache, optional) – Use tensor caching service to speed up dataset processing. (default=None which means no cache is used).

实际案例

>>> import mindspore.dataset as ds
>>> import mindspore.common.dtype as mstype
>>>
>>> dataset_files = ["/path/to/1", "/path/to/2"] # contains 1 or multiple tf data files
>>>
>>> # 1) Get all rows from dataset_files with no explicit schema
>>> # The meta-data in the first row will be used as a schema.
>>> tfdataset = ds.TFRecordDataset(dataset_files=dataset_files)
>>>
>>> # 2) Get all rows from dataset_files with user-defined schema
>>> schema = ds.Schema()
>>> schema.add_column('col_1d', de_type=mindspore.int64, shape=[2])
>>> tfdataset = ds.TFRecordDataset(dataset_files=dataset_files, schema=schema)
>>>
>>> # 3) Get all rows from dataset_files with schema file "./schema.json"
>>> tfdataset = ds.TFRecordDataset(dataset_files=dataset_files, schema="./schema.json")

class tinyms.data.VOCDataset(dataset_dir, task='Segmentation', usage='train', class_indexing=None, num_samples=None, num_parallel_workers=None, shuffle=None, decode=False, sampler=None, num_shards=None, shard_id=None, cache=None)[源代码]¶

A source dataset for reading and parsing VOC dataset.

The generated dataset has multiple columns :

task=’Detection’, column: [[‘image’, dtype=uint8], [‘bbox’, dtype=float32], [‘label’, dtype=uint32], [‘difficult’, dtype=uint32], [‘truncate’, dtype=uint32]].

task=’Segmentation’, column: [[‘image’, dtype=uint8], [‘target’,dtype=uint8]].

This dataset can take in a sampler. ‘sampler’ and ‘shuffle’ are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using ‘sampler’ and ‘shuffle’¶
Parameter ‘sampler’	Parameter ‘shuffle’	Expected Order Behavior
None	None	random order
None	True	random order
None	False	sequential order
Sampler object	None	order defined by sampler
Sampler object	True	not allowed
Sampler object	False	not allowed

Citation of VOC dataset.

@article{Everingham10,
author       = {Everingham, M. and Van~Gool, L. and Williams, C. K. I. and Winn, J. and Zisserman, A.},
title        = {The Pascal Visual Object Classes (VOC) Challenge},
journal      = {International Journal of Computer Vision},
volume       = {88},
year         = {2010},
number       = {2},
month        = {jun},
pages        = {303--338},
biburl       = {http://host.robots.ox.ac.uk/pascal/VOC/pubs/everingham10.html#bibtex},
howpublished = {http://host.robots.ox.ac.uk/pascal/VOC/voc{year}/index.html},
description  = {The PASCAL Visual Object Classes (VOC) challenge is a benchmark in visual
                object category recognition and detection, providing the vision and machine
                learning communities with a standard dataset of images and annotation, and
                standard evaluation procedures.}
}

参数

dataset_dir (str) – Path to the root directory that contains the dataset.
task (str) – Set the task type of reading voc data, now only support “Segmentation” or “Detection” (default=”Segmentation”).
usage (str) – The type of data list text file to be read (default=”train”).
class_indexing (dict, optional) – A str-to-int mapping from label name to index, only valid in “Detection” task (default=None, the folder names will be sorted alphabetically and each class will be given a unique index starting from 0).
num_samples (int, optional) – The number of images to be included in the dataset (default=None, all images).
num_parallel_workers (int, optional) – Number of workers to read the data (default=None, number set in the config).
shuffle (bool, optional) – Whether to perform shuffle on the dataset (default=None, expected order behavior shown in the table).
decode (bool, optional) – Decode the images after reading (default=False).
sampler (Sampler, optional) – Object used to choose samples from the dataset (default=None, expected order behavior shown in the table).
num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None).
shard_id (int, optional) – The shard ID within num_shards (default=None). This argument can only be specified when num_shards is also specified.
cache (DatasetCache, optional) – Use tensor caching service to speed up dataset processing. (default=None which means no cache is used).

引发

RuntimeError – If xml of Annotations is an invalid format.
RuntimeError – If xml of Annotations loss attribution of “object”.
RuntimeError – If xml of Annotations loss attribution of “bndbox”.
RuntimeError – If sampler and shuffle are specified at the same time.
RuntimeError – If sampler and sharding are specified at the same time.
RuntimeError – If num_shards is specified but shard_id is None.
RuntimeError – If shard_id is specified but num_shards is None.
ValueError – If task is not equal ‘Segmentation’ or ‘Detection’.
ValueError – If task equal ‘Segmentation’ but class_indexing is not None.
ValueError – If txt related to mode is not exist.
ValueError – If shard_id is invalid (< 0 or >= num_shards).

实际案例

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "/path/to/voc_dataset_directory"
>>>
>>> # 1) Read VOC data for segmentatation training
>>> voc_dataset = ds.VOCDataset(dataset_dir, task="Segmentation", usage="train")
>>>
>>> # 2) Read VOC data for detection training
>>> voc_dataset = ds.VOCDataset(dataset_dir, task="Detection", usage="train")
>>>
>>> # 3) Read all VOC dataset samples in dataset_dir with 8 threads in random order
>>> voc_dataset = ds.VOCDataset(dataset_dir, task="Detection", usage="train", num_parallel_workers=8)
>>>
>>> # 4) Read then decode all VOC dataset samples in dataset_dir in sequence
>>> voc_dataset = ds.VOCDataset(dataset_dir, task="Detection", usage="train", decode=True, shuffle=False)
>>>
>>> # In VOC dataset, if task='Segmentation', each dictionary has keys "image" and "target"
>>> # In VOC dataset, if task='Detection', each dictionary has keys "image" and "annotation"

get_class_indexing()[源代码]¶

Get the class index.

返回: dict, a str-to-int mapping from label name to index.

class tinyms.data.DistributedSampler(dataset_size, num_replicas=None, rank=None, shuffle=True)[源代码]

Distributed sampler.

参数

dataset_size (int) – Dataset list length
num_replicas (int) – Replicas num.
rank (int) – Device rank.
shuffle (bool) – Whether the dataset needs to be shuffled. Default: True.

返回

DistributedSampler instance.

class tinyms.data.PKSampler(num_val, num_class=None, shuffle=False, class_column='label', num_samples=None)[源代码]¶

Samples K elements for each P class in the dataset.

参数

num_val (int) – Number of elements to sample for each class.
num_class (int, optional) – Number of classes to sample (default=None, all classes). The parameter does not supported to specify currently.
shuffle (bool, optional) – If True, the class IDs are shuffled (default=False).
class_column (str, optional) – Name of column with class labels for MindDataset (default=’label’).
num_samples (int, optional) – The number of samples to draw (default=None, all elements).

实际案例

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "path/to/imagefolder_directory"
>>>
>>> # creates a PKSampler that will get 3 samples from every class.
>>> sampler = ds.PKSampler(3)
>>> data = ds.ImageFolderDataset(dataset_dir, num_parallel_workers=8, sampler=sampler)

引发

ValueError – If num_val is not positive.
NotImplementedError – If num_class is not None.
ValueError – If shuffle is not boolean.

class tinyms.data.RandomSampler(replacement=False, num_samples=None)[源代码]¶

Samples the elements randomly.

参数

replacement (bool, optional) – If True, put the sample ID back for the next draw (default=False).
num_samples (int, optional) – Number of elements to sample (default=None, all elements).

实际案例

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "path/to/imagefolder_directory"
>>>
>>> # creates a RandomSampler
>>> sampler = ds.RandomSampler()
>>> data = ds.ImageFolderDataset(dataset_dir, num_parallel_workers=8, sampler=sampler)

引发

ValueError – If replacement is not boolean.
ValueError – If num_samples is not positive.

class tinyms.data.SequentialSampler(start_index=None, num_samples=None)[源代码]¶

Samples the dataset elements sequentially, same as not having a sampler.

参数

start_index (int, optional) – Index to start sampling at. (dafault=None, start at first ID)
num_samples (int, optional) – Number of elements to sample (default=None, all elements).

实际案例

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "path/to/imagefolder_directory"
>>>
>>> # creates a SequentialSampler
>>> sampler = ds.SequentialSampler()
>>> data = ds.ImageFolderDataset(dataset_dir, num_parallel_workers=8, sampler=sampler)

class tinyms.data.SubsetRandomSampler(indices, num_samples=None)[源代码]¶

Samples the elements randomly from a sequence of indices.

参数

indices (list[int]) – A sequence of indices.
num_samples (int, optional) – Number of elements to sample (default=None, all elements).

实际案例

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "path/to/imagefolder_directory"
>>>
>>> indices = [0, 1, 2, 3, 7, 88, 119]
>>>
>>> # creates a SubsetRandomSampler, will sample from the provided indices
>>> sampler = ds.SubsetRandomSampler(indices)
>>> data = ds.ImageFolderDataset(dataset_dir, num_parallel_workers=8, sampler=sampler)

class tinyms.data.WeightedRandomSampler(weights, num_samples=None, replacement=True)[源代码]¶

Samples the elements from [0, len(weights) - 1] randomly with the given weights (probabilities).

参数

weights (list[float, int]) – A sequence of weights, not necessarily summing up to 1.
num_samples (int, optional) – Number of elements to sample (default=None, all elements).
replacement (bool) – If True, put the sample ID back for the next draw (default=True).

实际案例

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir = "path/to/imagefolder_directory"
>>>
>>> weights = [0.9, 0.01, 0.4, 0.8, 0.1, 0.1, 0.3]
>>>
>>> # creates a WeightedRandomSampler that will sample 4 elements without replacement
>>> sampler = ds.WeightedRandomSampler(weights, 4)
>>> data = ds.ImageFolderDataset(dataset_dir, num_parallel_workers=8, sampler=sampler)

引发

ValueError – If num_samples is not positive.
ValueError – If replacement is not boolean.

class tinyms.data.DatasetCache(session_id, size=0, spilling=False, hostname=None, port=None, num_connections=None, prefetch_size=None)[源代码]¶

A client to interface with tensor caching service.

For details, please check Chinese tutorial, Chinese programming guide.

参数

session_id (int) – A user assigned session id for the current pipeline.
size (int, optional) – Size of the memory set aside for the row caching (default=0 which means unlimited, note that it might bring in the risk of running out of memory on the machine).
spilling (bool, optional) – Whether or not spilling to disk if out of memory (default=False).
hostname (str, optional) – Host name (default=”127.0.0.1”).
port (int, optional) – Port to connect to server (default=50052).
num_connections (int, optional) – Number of tcp/ip connections (default=12).
prefetch_size (int, optional) – Prefetch size (default=20).

class tinyms.data.Schema(schema_file=None)[源代码]¶

Class to represent a schema of a dataset.

参数: schema_file (str) – Path of schema file (default=None).
返回: Schema object, schema info about dataset.
引发: RuntimeError – If schema file failed to load.

示例

>>> import mindspore.dataset as ds
>>> import mindspore.common.dtype as mstype
>>>
>>> # Create schema; specify column name, mindspore.dtype and shape of the column
>>> schema = ds.Schema()
>>> schema.add_column('col1', de_type=mindspore.int64, shape=[2])

add_column(name, de_type, shape=None)[源代码]¶

Add new column to the schema.

参数

name (str) – Name of the column.
de_type (str) – Data type of the column.
shape (list[int], optional) – Shape of the column (default=None, [-1] which is an unknown shape of rank 1).

引发

ValueError – If column type is unknown.

from_json(json_obj)[源代码]¶

Get schema file from JSON object.

参数

json_obj (dictionary) – Object of JSON parsed.

引发

RuntimeError – if there is unknown item in the object.
RuntimeError – if dataset type is missing in the object.
RuntimeError – if columns are missing in the object.

parse_columns(columns)[源代码]¶

Parse the columns and add it to self.

参数

columns (Union[dict, list[dict], tuple[dict]]) –

Dataset attribute information, decoded from schema file.

list[dict], ‘name’ and ‘type’ must be in keys, ‘shape’ optional.
dict, columns.keys() as name, columns.values() is dict, and ‘type’ inside, ‘shape’ optional.

引发

RuntimeError – If failed to parse columns.
RuntimeError – If column’s name field is missing.
RuntimeError – If column’s type field is missing.

示例

>>> schema = Schema()
>>> columns1 = [{'name': 'image', 'type': 'int8', 'shape': [3, 3]},
>>>             {'name': 'label', 'type': 'int8', 'shape': [1]}]
>>> schema.parse_columns(columns1)
>>> columns2 = {'image': {'shape': [3, 3], 'type': 'int8'}, 'label': {'shape': [1], 'type': 'int8'}}
>>> schema.parse_columns(columns2)

to_json()[源代码]¶

Get a JSON string of the schema.

返回: str, JSON string of the schema.

tinyms.data.zip(datasets)[源代码]¶

Zip the datasets in the input tuple of datasets.

参数

datasets (tuple of class Dataset) – A tuple of datasets to be zipped together. The number of datasets must be more than 1.

返回

ZipDataset, dataset zipped.

引发

ValueError – If the number of datasets is 1.
TypeError – If datasets is not a tuple.

实际案例

>>> import mindspore.dataset as ds
>>>
>>> dataset_dir1 = "path/to/imagefolder_directory1"
>>> dataset_dir2 = "path/to/imagefolder_directory2"
>>> ds1 = ds.ImageFolderDataset(dataset_dir1, num_parallel_workers=8)
>>> ds2 = ds.ImageFolderDataset(dataset_dir2, num_parallel_workers=8)
>>>
>>> # Create a dataset which is the combination of ds1 and ds2
>>> data = ds.zip((ds1, ds2))

tinyms.data.download_dataset(dataset_name, local_path='.')[源代码]¶

This function is defined to easily download any public dataset without specifing much details.

参数

dataset_name (str) – The official name of dataset, currently supports mnist, cifar10 and cifar100.
local_path (str) – Specifies the local location of dataset to be downloaded. Default: ..

返回

str, the source location of dataset downloaded.

实际案例

>>> from tinyms.data import download_dataset
>>>
>>> ds_path = download_dataset('mnist')

tinyms.data.generate_image_list(dir_path, max_dataset_size=inf)[源代码]¶

Traverse the directory to generate a list of images path.

参数

dir_path (str) – image directory.
max_dataset_size (int) – Maximum number of return image paths.

返回

Image path list.

tinyms.data.load_resized_img(path, width=256, height=256)[源代码]¶

Load image with RGB and resize to (256, 256).

参数

path (str) – image path.
width (int) – image width, default: 256.
height (int) – image height, default: 256.

返回

PIL image class.

tinyms.data.load_img(path)[源代码]¶

Load image with RGB.

参数: path (str) – image path.
返回: PIL image class.