tinyms.data¶

class tinyms.data.UnalignedDataset(dataset_path, phase, max_dataset_size=inf, shuffle=True)[源代码]¶

This dataset class can load unaligned/unpaired datasets.

参数

dataset_path (str) – The path of images (should have subfolders trainA, trainB, testA, testB, etc).
phase (str) – Train or test. It requires two directories in dataset_path, like trainA and trainB to. host training images from domain A ‘{dataset_path}/trainA’ and from domain B ‘{dataset_path}/trainB’ respectively.
max_dataset_size (int) – Maximum number of return image paths.

返回

Two domain image path list.

class tinyms.data.GanImageFolderDataset(dataset_path, max_dataset_size=inf)[源代码]¶

This dataset class can load images from image folder.

参数

dataset_path (str) – ‘{dataset_path}/testA’, ‘{dataset_path}/testB’, etc.
max_dataset_size (int) – Maximum number of return image paths.

返回

Image path list.

class tinyms.data.ImdbDataset(imdb_path, glove_path, embed_size=300)[源代码]¶

parse aclImdb data to features and labels. sentence->tokenized->encoded->padding->features

参数

imdb_path (str) – The path where the aclImdb dataset stored.
glove_path (str) – The path where the GloVe stored.
embed_size (int) – Embed_size. Default: 300.

实际案例

>>> from tinyms.data import ImdbDataset
>>>
>>> imdb_ds = ImdbDataset('./aclImdb', './glove')

convert_to_mindrecord(preprocess_path, shard_num=1)[源代码]¶: convert imdb dataset to mindrecoed dataset

get_datas(seg)[源代码]¶: get features, labels, and weight by gensim.

parse()[源代码]¶: parse imdb data to memory

class tinyms.data.BertDataset(data_dir, schema_dir=None, shuffle=True, num_parallel_workers=None)[源代码]¶

This dataset class can load bert from data folder.

参数

data_dir (str) – ‘{data_dir}/result1.tfrecord’, ‘{data_dir}/result2.tfrecord’, etc.
num_parallel_workers (int) – The number of concurrent workers. Default: None.
shuffle (Union[bool, Shuffle level], optional) –
Perform reshuffling of the data every epoch (default=Shuffle.GLOBAL). If shuffle is False, no shuffling will be performed; If shuffle is True, the behavior is the same as setting shuffle to be Shuffle.GLOBAL Otherwise, there are two levels of shuffling:
- Shuffle.GLOBAL: Shuffle both the files and samples.
- Shuffle.FILES: Shuffle files only.
schema (Union[str, Schema], optional) – Path to the JSON schema file or schema object (default=None). If the schema is not provided, the meta data from the TFData file is considered the schema.

实际案例

>>> from tinyms.data import BertDataset
>>>
>>> bert_ds = BertDataset('data')

class tinyms.data.KaggleDisplayAdvertisingDataset(data_dir, num_parallel_workers=None, shuffle=True)[源代码]¶

parse aclImdb data to features and labels. sentence->tokenized->encoded->padding->features

参数

data_dir (str) – The path where the uncompressed dataset stored.
num_parallel_workers (int) – The number of concurrent workers. Default: None.
shuffle (bol) – Whether the dataset needs to be shuffled. Default: True.

实际案例

>>> from tinyms.data import KaggleDisplayAdvertisingDataset
>>>
>>> kaggle_display_advertising_ds = KaggleDisplayAdvertisingDataset('data')
>>> kaggle_display_advertising_ds.stats_data()
>>> kaggle_display_advertising_ds.convert_to_mindrecord()
>>> train_ds = kaggle_display_advertising_ds.load_mindreocrd_dataset(usage='train')
>>> test_ds = kaggle_display_advertising_ds.load_mindreocrd_dataset(usage='test')

load_mindreocrd_dataset(usage='train', batch_size=1000)[源代码]¶

load mindrecord dataset. :param usage: Dataset mode. Default: ‘train’. :type usage: str :param batch_size: batch size. Default: 1000. :type batch_size: int

返回: MindDataset

stats_data()[源代码]¶: stats data

class tinyms.data.DistributedSampler(dataset_size, num_replicas=None, rank=None, shuffle=True)[源代码]¶

Distributed sampler.

参数

dataset_size (int) – Dataset list length
num_replicas (int) – Replicas num.
rank (int) – Device rank.
shuffle (bool) – Whether the dataset needs to be shuffled. Default: True.

返回

DistributedSampler instance.

class tinyms.data.CelebADataset(dataset_dir, num_parallel_workers=None, shuffle=None, usage='all', sampler=None, decode=False, extensions=None, num_samples=None, num_shards=None, shard_id=None, cache=None)[源代码]¶

A source dataset for reading and parsing CelebA dataset. Only support to read list_attr_celeba.txt currently, which is the attribute annotations of the dataset.

The generated dataset has two columns: [image, attr]. The tensor of column image is of the uint8 type. The tensor of column attr is of the uint32 type and one hot encoded.

参数

dataset_dir (str) – Path to the root directory that contains the dataset.
num_parallel_workers (int, optional) – Number of workers to read the data (default=None, will use value set in the config).
shuffle (bool, optional) – Whether to perform shuffle on the dataset (default=None).
usage (str, optional) – Specify the train, valid, test part or all parts of dataset (default=`all`, will read all samples).
sampler (Sampler, optional) – Object used to choose samples from the dataset (default=None).
decode (bool, optional) – decode the images after reading (default=False).
extensions (list[str], optional) – List of file extensions to be included in the dataset (default=None).
num_samples (int, optional) – The number of images to be included in the dataset (default=None, will include all images).
num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None). When this argument is specified, num_samples reflects the maximum sample number of per shard.
shard_id (int, optional) – The shard ID within num_shards (default=None). This argument can only be specified when num_shards is also specified.
cache (DatasetCache, optional) – Use tensor caching service to speed up dataset processing. (default=None, which means no cache is used).

引发

RuntimeError – If dataset_dir does not contain data files.
RuntimeError – If num_parallel_workers exceeds the max thread numbers.
RuntimeError – If sampler and shuffle are specified at the same time.
RuntimeError – If sampler and sharding are specified at the same time.
RuntimeError – If num_shards is specified but shard_id is None.
RuntimeError – If shard_id is specified but num_shards is None.
ValueError – If shard_id is invalid (< 0 or >= num_shards).

注解

This dataset can take in a sampler. sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle¶
Parameter sampler	Parameter shuffle	Expected Order Behavior
None	None	random order
None	True	random order
None	False	sequential order
Sampler object	None	order defined by sampler
Sampler object	True	not allowed
Sampler object	False	not allowed

实际案例

>>> celeba_dataset_dir = "/path/to/celeba_dataset_directory"
>>>
>>> # Read 5 samples from CelebA dataset
>>> dataset = ds.CelebADataset(dataset_dir=celeba_dataset_dir, usage='train', num_samples=5)
>>>
>>> # Note: In celeba dataset, each data dictionary owns keys "image" and "attr"

About CelebA dataset:

CelebFaces Attributes Dataset (CelebA) is a large-scale face attributes dataset with more than 200K celebrity images, each with 40 attribute annotations.

The images in this dataset cover large pose variations and background clutter. CelebA has large diversities, large quantities, and rich annotations, including

10,177 number of identities,
202,599 number of face images, and
5 landmark locations, 40 binary attributes annotations per image.

The dataset can be employed as the training and test sets for the following computer vision tasks: face attribute recognition, face detection, landmark (or facial part) localization, and face editing & synthesis.

Original CelebA dataset structure:

.
└── CelebA
     ├── README.md
     ├── Img
     │    ├── img_celeba.7z
     │    ├── img_align_celeba_png.7z
     │    └── img_align_celeba.zip
     ├── Eval
     │    └── list_eval_partition.txt
     └── Anno
          ├── list_landmarks_celeba.txt
          ├── list_landmarks_align_celeba.txt
          ├── list_bbox_celeba.txt
          ├── list_attr_celeba.txt
          └── identity_CelebA.txt

You can unzip the dataset files into the following structure and read by MindSpore’s API.

.
└── celeba_dataset_directory
    ├── list_attr_celeba.txt
    ├── 000001.jpg
    ├── 000002.jpg
    ├── 000003.jpg
    ├── ...

Citation:

@article{DBLP:journals/corr/LiuLWT14,
author        = {Ziwei Liu and Ping Luo and Xiaogang Wang and Xiaoou Tang},
title         = {Deep Learning Face Attributes in the Wild},
journal       = {CoRR},
volume        = {abs/1411.7766},
year          = {2014},
url           = {http://arxiv.org/abs/1411.7766},
archivePrefix = {arXiv},
eprint        = {1411.7766},
timestamp     = {Tue, 10 Dec 2019 15:37:26 +0100},
biburl        = {https://dblp.org/rec/journals/corr/LiuLWT14.bib},
bibsource     = {dblp computer science bibliography, https://dblp.org},
howpublished  = {http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html}
}

class tinyms.data.Cifar100Dataset(dataset_dir, usage=None, num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, num_shards=None, shard_id=None, cache=None)[源代码]¶

A source dataset for reading and parsing Cifar100 dataset.

The generated dataset has three columns [image, coarse_label, fine_label]. The tensor of column image is of the uint8 type. The tensor of column coarse_label and fine_labels are each a scalar of uint32 type.

参数

dataset_dir (str) – Path to the root directory that contains the dataset.
usage (str, optional) – Usage of this dataset, can be train, test or all . train will read from 50,000 train samples, test will read from 10,000 test samples, all will read from all 60,000 samples (default=None, all samples).
num_samples (int, optional) – The number of images to be included in the dataset (default=None, all images).
num_parallel_workers (int, optional) – Number of workers to read the data (default=None, number set in the config).
shuffle (bool, optional) – Whether to perform shuffle on the dataset (default=None, expected order behavior shown in the table).
sampler (Sampler, optional) – Object used to choose samples from the dataset (default=None, expected order behavior shown in the table).
num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None). When this argument is specified, ‘num_samples’ reflects the maximum sample number of per shard.
shard_id (int, optional) – The shard ID within num_shards (default=None). This argument can only be specified when num_shards is also specified.
cache (DatasetCache, optional) – Use tensor caching service to speed up dataset processing. (default=None, which means no cache is used).

引发

RuntimeError – If dataset_dir does not contain data files.
RuntimeError – If num_parallel_workers exceeds the max thread numbers.
RuntimeError – If sampler and shuffle are specified at the same time.
RuntimeError – If sampler and sharding are specified at the same time.
RuntimeError – If num_shards is specified but shard_id is None.
RuntimeError – If shard_id is specified but num_shards is None.
ValueError – If shard_id is invalid (< 0 or >= num_shards).

注解

This dataset can take in a sampler. sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle¶
Parameter sampler	Parameter shuffle	Expected Order Behavior
None	None	random order
None	True	random order
None	False	sequential order
Sampler object	None	order defined by sampler
Sampler object	True	not allowed
Sampler object	False	not allowed

实际案例

>>> cifar100_dataset_dir = "/path/to/cifar100_dataset_directory"
>>>
>>> # 1) Get all samples from CIFAR100 dataset in sequence
>>> dataset = ds.Cifar100Dataset(dataset_dir=cifar100_dataset_dir, shuffle=False)
>>>
>>> # 2) Randomly select 350 samples from CIFAR100 dataset
>>> dataset = ds.Cifar100Dataset(dataset_dir=cifar100_dataset_dir, num_samples=350, shuffle=True)
>>>
>>> # In CIFAR100 dataset, each dictionary has 3 keys: "image", "fine_label" and "coarse_label"

About CIFAR-100 dataset:

This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a “fine” label (the class to which it belongs) and a “coarse” label (the superclass to which it belongs).

Here is the original CIFAR-100 dataset structure. You can unzip the dataset files into the following directory structure and read by MindSpore’s API.

.
└── cifar-100-binary
    ├── train.bin
    ├── test.bin
    ├── fine_label_names.txt
    └── coarse_label_names.txt

Citation:

@techreport{Krizhevsky09,
author       = {Alex Krizhevsky},
title        = {Learning multiple layers of features from tiny images},
institution  = {},
year         = {2009},
howpublished = {http://www.cs.toronto.edu/~kriz/cifar.html}
}

class tinyms.data.Cifar10Dataset(dataset_dir, usage=None, num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, num_shards=None, shard_id=None, cache=None)[源代码]¶

A source dataset for reading and parsing Cifar10 dataset. This api only supports parsing Cifar10 file in binary version now.

The generated dataset has two columns [image, label]. The tensor of column image is of the uint8 type. The tensor of column label is a scalar of the uint32 type.

参数

dataset_dir (str) – Path to the root directory that contains the dataset.
usage (str, optional) – Usage of this dataset, can be train, test or all . train will read from 50,000 train samples, test will read from 10,000 test samples, all will read from all 60,000 samples (default=None, all samples).
num_samples (int, optional) – The number of images to be included in the dataset (default=None, all images).
num_parallel_workers (int, optional) – Number of workers to read the data (default=None, number set in the config).
shuffle (bool, optional) – Whether to perform shuffle on the dataset (default=None, expected order behavior shown in the table).
sampler (Sampler, optional) – Object used to choose samples from the dataset (default=None, expected order behavior shown in the table).
num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None). When this argument is specified, num_samples reflects the maximum sample number of per shard.
shard_id (int, optional) – The shard ID within num_shards (default=None). This argument can only be specified when num_shards is also specified.
cache (DatasetCache, optional) – Use tensor caching service to speed up dataset processing. (default=None, which means no cache is used).

引发

RuntimeError – If dataset_dir does not contain data files.
RuntimeError – If num_parallel_workers exceeds the max thread numbers.
RuntimeError – If sampler and shuffle are specified at the same time.
RuntimeError – If sampler and sharding are specified at the same time.
RuntimeError – If num_shards is specified but shard_id is None.
RuntimeError – If shard_id is specified but num_shards is None.
ValueError – If shard_id is invalid (< 0 or >= num_shards).

注解

This dataset can take in a sampler. sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle¶
Parameter sampler	Parameter shuffle	Expected Order Behavior
None	None	random order
None	True	random order
None	False	sequential order
Sampler object	None	order defined by sampler
Sampler object	True	not allowed
Sampler object	False	not allowed

实际案例

>>> cifar10_dataset_dir = "/path/to/cifar10_dataset_directory"
>>>
>>> # 1) Get all samples from CIFAR10 dataset in sequence
>>> dataset = ds.Cifar10Dataset(dataset_dir=cifar10_dataset_dir, shuffle=False)
>>>
>>> # 2) Randomly select 350 samples from CIFAR10 dataset
>>> dataset = ds.Cifar10Dataset(dataset_dir=cifar10_dataset_dir, num_samples=350, shuffle=True)
>>>
>>> # 3) Get samples from CIFAR10 dataset for shard 0 in a 2-way distributed training
>>> dataset = ds.Cifar10Dataset(dataset_dir=cifar10_dataset_dir, num_shards=2, shard_id=0)
>>>
>>> # In CIFAR10 dataset, each dictionary has keys "image" and "label"

About CIFAR-10 dataset:

The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The 10 different classes represent airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks.

Here is the original CIFAR-10 dataset structure. You can unzip the dataset files into the following directory structure and read by MindSpore’s API.

.
└── cifar-10-batches-bin
     ├── data_batch_1.bin
     ├── data_batch_2.bin
     ├── data_batch_3.bin
     ├── data_batch_4.bin
     ├── data_batch_5.bin
     ├── test_batch.bin
     ├── readme.html
     └── batches.meta.txt

Citation:

@techreport{Krizhevsky09,
author       = {Alex Krizhevsky},
title        = {Learning multiple layers of features from tiny images},
institution  = {},
year         = {2009},
howpublished = {http://www.cs.toronto.edu/~kriz/cifar.html}
}

class tinyms.data.CLUEDataset(dataset_files, task='AFQMC', usage='train', num_samples=None, num_parallel_workers=None, shuffle=<Shuffle.GLOBAL: 'global'>, num_shards=None, shard_id=None, cache=None)[源代码]¶

A source dataset that reads and parses CLUE datasets. Supported CLUE classification tasks: AFQMC, TNEWS, IFLYTEK, CMNLI, WSC and CSL.

The generated dataset with different task setting has different output columns:

task = AFQMC
- usage = train, output columns: [sentence1, dtype=string], [sentence2, dtype=string], [label, dtype=string].
- usage = test, output columns: [id, dtype=uint8], [sentence1, dtype=string], [sentence2, dtype=string].
- usage = eval, output columns: [sentence1, dtype=string], [sentence2, dtype=string], [label, dtype=string].
task = TNEWS
- usage = train, output columns: [label, dtype=string], [label_des, dtype=string], [sentence, dtype=string], [keywords, dtype=string].
- usage = test, output columns: [label, dtype=string], [label_des, dtype=string], [sentence, dtype=string], [keywords, dtype=string].
- usage = eval, output columns: [label, dtype=string], [label_des, dtype=string], [sentence, dtype=string], [keywords, dtype=string].
task = IFLYTEK
- usage = train, output columns: [label, dtype=string], [label_des, dtype=string], [sentence, dtype=string].
- usage = test, output columns: [id, dtype=string], [sentence, dtype=string].
- usage = eval, output columns: [label, dtype=string], [label_des, dtype=string], [sentence, dtype=string].
task = CMNLI
- usage = train, output columns: [sentence1, dtype=string], [sentence2, dtype=string], [label, dtype=string].
- usage = test, output columns: [id, dtype=uint8], [sentence1, dtype=string], [sentence2, dtype=string].
- usage = eval, output columns: [sentence1, dtype=string], [sentence2, dtype=string], [label, dtype=string].
task = WSC
- usage = train, output columns: [span1_index, dtype=uint8], [span2_index, dtype=uint8], [span1_text, dtype=string], [span2_text, dtype=string], [idx, dtype=uint8], [text, dtype=string], [label, dtype=string].
- usage = output columns: [span1_index, dtype=uint8], [span2_index, dtype=uint8], [span1_text, dtype=string], [span2_text, dtype=string], [idx, dtype=uint8], [text, dtype=string].
- usage = eval, output columns: [span1_index, dtype=uint8], [span2_index, dtype=uint8], [span1_text, dtype=string], [span2_text, dtype=string], [idx, dtype=uint8], [text, dtype=string], [label, dtype=string].
task = CSL
- usage = train, output columns: [id, dtype=uint8], [abst, dtype=string], [keyword, dtype=string], [label, dtype=string].
- usage = test, output columns: [id, dtype=uint8], [abst, dtype=string], [keyword, dtype=string].
- usage = eval, output columns: [id, dtype=uint8], [abst, dtype=string], [keyword, dtype=string], [label, dtype=string].

参数

dataset_files (Union[str, list[str]]) – String or list of files to be read or glob strings to search for a pattern of files. The list will be sorted in a lexicographical order.
task (str, optional) – The kind of task, one of AFQMC, TNEWS, IFLYTEK, CMNLI, WSC and CSL. (default=AFQMC).
usage (str, optional) – Specify the train, test or eval part of dataset (default=”train”).
num_samples (int, optional) – The number of samples to be included in the dataset (default=None, will include all images).
num_parallel_workers (int, optional) – Number of workers to read the data (default=None, number set in the config).
shuffle (Union[bool, Shuffle level], optional) –
Perform reshuffling of the data every epoch (default=Shuffle.GLOBAL). If shuffle is False, no shuffling will be performed; If shuffle is True, the behavior is the same as setting shuffle to be Shuffle.GLOBAL Otherwise, there are two levels of shuffling:
- Shuffle.GLOBAL: Shuffle both the files and samples.
- Shuffle.FILES: Shuffle files only.
num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None). When this argument is specified, num_samples reflects the maximum sample number of per shard.
shard_id (int, optional) – The shard ID within num_shards (default=None). This argument can only be specified when num_shards is also specified.
cache (DatasetCache, optional) – Use tensor caching service to speed up dataset processing. (default=None, which means no cache is used).

引发

RuntimeError – If dataset_files are not valid or do not exist.
RuntimeError – If num_parallel_workers exceeds the max thread numbers.
RuntimeError – If num_shards is specified but shard_id is None.
RuntimeError – If shard_id is specified but num_shards is None.

实际案例

>>> clue_dataset_dir = ["/path/to/clue_dataset_file"] # contains 1 or multiple clue files
>>> dataset = ds.CLUEDataset(dataset_files=clue_dataset_dir, task='AFQMC', usage='train')

About CLUE dataset:

CLUE, a Chinese Language Understanding Evaluation benchmark. It contains multiple tasks, including single-sentence classification, sentence pair classification, and machine reading comprehension.

You can unzip the dataset files into the following structure and read by MindSpore’s API, such as afqmc dataset:

.
└── afqmc_public
     ├── train.json
     ├── test.json
     └── dev.json

Citation:

@article{CLUEbenchmark,
title   = {CLUE: A Chinese Language Understanding Evaluation Benchmark},
author  = {Liang Xu, Xuanwei Zhang, Lu Li, Hai Hu, Chenjie Cao, Weitang Liu, Junyi Li, Yudong Li,
        Kai Sun, Yechen Xu, Yiming Cui, Cong Yu, Qianqian Dong, Yin Tian, Dian Yu, Bo Shi, Jun Zeng,
        Rongzhao Wang, Weijian Xie, Yanting Li, Yina Patterson, Zuoyu Tian, Yiwen Zhang, He Zhou,
        Shaoweihua Liu, Qipeng Zhao, Cong Yue, Xinrui Zhang, Zhengliang Yang, Zhenzhong Lan},
journal = {arXiv preprint arXiv:2004.05986},
year    = {2020},
howpublished = {https://github.com/CLUEbenchmark/CLUE}
}

class tinyms.data.CocoDataset(dataset_dir, annotation_file, task='Detection', num_samples=None, num_parallel_workers=None, shuffle=None, decode=False, sampler=None, num_shards=None, shard_id=None, cache=None, extra_metadata=False)[源代码]¶

A source dataset for reading and parsing COCO dataset.

CocoDataset supports four kinds of tasks, which are Object Detection, Keypoint Detection, Stuff Segmentation and Panoptic Segmentation of 2017 Train/Val/Test dataset.

The generated dataset with different task setting has different output columns:

task = Detection, output columns: [image, dtype=uint8], [bbox, dtype=float32], [category_id, dtype=uint32], [iscrowd, dtype=uint32].
task = Stuff, output columns: [image, dtype=uint8], [segmentation,dtype=float32], [iscrowd,dtype=uint32].
task = Keypoint, output columns: [image, dtype=uint8], [keypoints, dtype=float32], [num_keypoints, dtype=uint32].
task = Panoptic, output columns: [image, dtype=uint8], [bbox, dtype=float32], [category_id, dtype=uint32], [iscrowd, dtype=uint32], [area, dtype=uint32].

参数

dataset_dir (str) – Path to the root directory that contains the dataset.
annotation_file (str) – Path to the annotation JSON file.
task (str, optional) – Set the task type for reading COCO data. Supported task types: Detection, Stuff, Panoptic and Keypoint (default=`Detection`).
num_samples (int, optional) – The number of images to be included in the dataset (default=None, all images).
num_parallel_workers (int, optional) – Number of workers to read the data (default=None, number set in the configuration file).
shuffle (bool, optional) – Whether to perform shuffle on the dataset (default=None, expected order behavior shown in the table).
decode (bool, optional) – Decode the images after reading (default=False).
sampler (Sampler, optional) – Object used to choose samples from the dataset (default=None, expected order behavior shown in the table).
num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None). When this argument is specified, num_samples reflects the maximum sample number of per shard.
shard_id (int, optional) – The shard ID within num_shards (default=None). This argument can only be specified when num_shards is also specified.
cache (DatasetCache, optional) – Use tensor caching service to speed up dataset processing. (default=None, which means no cache is used).
extra_metadata (bool, optional) – Flag to add extra meta-data to row. If True, an additional column will be output at the end [_meta-filename, dtype=string] (default=False).

引发

RuntimeError – If dataset_dir does not contain data files.
RuntimeError – If num_parallel_workers exceeds the max thread numbers.
RuntimeError – If sampler and shuffle are specified at the same time.
RuntimeError – If sampler and sharding are specified at the same time.
RuntimeError – If num_shards is specified but shard_id is None.
RuntimeError – If shard_id is specified but num_shards is None.
RuntimeError – If parse JSON file failed.
ValueError – If task is not in [Detection, Stuff, Panoptic, Keypoint].
ValueError – If annotation_file is not exist.
ValueError – If dataset_dir is not exist.
ValueError – If shard_id is invalid (< 0 or >= num_shards).

注解

Column ‘[_meta-filename, dtype=string]’ won’t be output unless an explicit rename dataset op is added to remove the prefix(‘_meta-‘).
CocoDataset doesn’t support PKSampler.
This dataset can take in a sampler. sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle¶
Parameter sampler	Parameter shuffle	Expected Order Behavior
None	None	random order
None	True	random order
None	False	sequential order
Sampler object	None	order defined by sampler
Sampler object	True	not allowed
Sampler object	False	not allowed

实际案例

>>> coco_dataset_dir = "/path/to/coco_dataset_directory/images"
>>> coco_annotation_file = "/path/to/coco_dataset_directory/annotation_file"
>>>
>>> # 1) Read COCO data for Detection task
>>> dataset = ds.CocoDataset(dataset_dir=coco_dataset_dir,
...                          annotation_file=coco_annotation_file,
...                          task='Detection')
>>>
>>> # 2) Read COCO data for Stuff task
>>> dataset = ds.CocoDataset(dataset_dir=coco_dataset_dir,
...                          annotation_file=coco_annotation_file,
...                          task='Stuff')
>>>
>>> # 3) Read COCO data for Panoptic task
>>> dataset = ds.CocoDataset(dataset_dir=coco_dataset_dir,
...                          annotation_file=coco_annotation_file,
...                          task='Panoptic')
>>>
>>> # 4) Read COCO data for Keypoint task
>>> dataset = ds.CocoDataset(dataset_dir=coco_dataset_dir,
...                          annotation_file=coco_annotation_file,
...                          task='Keypoint')
>>>
>>> # In COCO dataset, each dictionary has keys "image" and "annotation"

About COCO dataset:

COCO(Microsoft Common Objects in Context) is a large-scale object detection, segmentation, and captioning dataset with several features: Object segmentation, Recognition in context, Superpixel stuff segmentation, 330K images (>200K labeled), 1.5 million object instances, 80 object categories, 91 stuff categories, 5 captions per image, 250,000 people with keypoints. In contrast to the popular ImageNet dataset, COCO has fewer categories but more instances in per category.

You can unzip the original COCO-2017 dataset files into this directory structure and read by MindSpore’s API.

.
└── coco_dataset_directory
     ├── train2017
     │    ├── 000000000009.jpg
     │    ├── 000000000025.jpg
     │    ├── ...
     ├── test2017
     │    ├── 000000000001.jpg
     │    ├── 000000058136.jpg
     │    ├── ...
     ├── val2017
     │    ├── 000000000139.jpg
     │    ├── 000000057027.jpg
     │    ├── ...
     └── annotations
          ├── captions_train2017.json
          ├── captions_val2017.json
          ├── instances_train2017.json
          ├── instances_val2017.json
          ├── person_keypoints_train2017.json
          └── person_keypoints_val2017.json

Citation:

@article{DBLP:journals/corr/LinMBHPRDZ14,
author        = {Tsung{-}Yi Lin and Michael Maire and Serge J. Belongie and
                Lubomir D. Bourdev and  Ross B. Girshick and James Hays and
                Pietro Perona and Deva Ramanan and Piotr Doll{'{a}}r and C. Lawrence Zitnick},
title         = {Microsoft {COCO:} Common Objects in Context},
journal       = {CoRR},
volume        = {abs/1405.0312},
year          = {2014},
url           = {http://arxiv.org/abs/1405.0312},
archivePrefix = {arXiv},
eprint        = {1405.0312},
timestamp     = {Mon, 13 Aug 2018 16:48:13 +0200},
biburl        = {https://dblp.org/rec/journals/corr/LinMBHPRDZ14.bib},
bibsource     = {dblp computer science bibliography, https://dblp.org}
}

get_class_indexing()[源代码]¶

Get the class index.

返回: dict, a str-to-list<int> mapping from label name to index.

实际案例

>>> coco_dataset_dir = "/path/to/coco_dataset_directory/images"
>>> coco_annotation_file = "/path/to/coco_dataset_directory/annotation_file"
>>>
>>> # Read COCO data for Detection task
>>> dataset = ds.CocoDataset(dataset_dir=coco_dataset_dir,
...                          annotation_file=coco_annotation_file,
...                          task='Detection')
>>>
>>> class_indexing = dataset.get_class_indexing()

class tinyms.data.CSVDataset(dataset_files, field_delim=', ', column_defaults=None, column_names=None, num_samples=None, num_parallel_workers=None, shuffle=<Shuffle.GLOBAL: 'global'>, num_shards=None, shard_id=None, cache=None)[源代码]¶

A source dataset that reads and parses comma-separated values (CSV) datasets. The columns of generated dataset depend on the source CSV files.

参数

dataset_files (Union[str, list[str]]) – String or list of files to be read or glob strings to search for a pattern of files. The list will be sorted in a lexicographical order.
field_delim (str, optional) – A string that indicates the char delimiter to separate fields (default=’,’).
column_defaults (list, optional) – List of default values for the CSV field (default=None). Each item in the list is either a valid type (float, int, or string). If this is not provided, treats all columns as string type.
column_names (list[str], optional) – List of column names of the dataset (default=None). If this is not provided, infers the column_names from the first row of CSV file.
num_samples (int, optional) – The number of samples to be included in the dataset (default=None, will include all images).
num_parallel_workers (int, optional) – Number of workers to read the data (default=None, number set in the config).
shuffle (Union[bool, Shuffle level], optional) –
Perform reshuffling of the data every epoch (default=Shuffle.GLOBAL). If shuffle is False, no shuffling will be performed; If shuffle is True, the behavior is the same as setting shuffle to be Shuffle.GLOBAL Otherwise, there are two levels of shuffling:
- Shuffle.GLOBAL: Shuffle both the files and samples.
- Shuffle.FILES: Shuffle files only.
num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None). When this argument is specified, num_samples reflects the maximum sample number of per shard.
shard_id (int, optional) – The shard ID within num_shards (default=None). This argument can only be specified when num_shards is also specified.
cache (DatasetCache, optional) – Use tensor caching service to speed up dataset processing. (default=None, which means no cache is used).

引发

RuntimeError – If dataset_files are not valid or do not exist.
RuntimeError – If num_parallel_workers exceeds the max thread numbers.
RuntimeError – If num_shards is specified but shard_id is None.
RuntimeError – If shard_id is specified but num_shards is None.

实际案例

>>> csv_dataset_dir = ["/path/to/csv_dataset_file"] # contains 1 or multiple csv files
>>> dataset = ds.CSVDataset(dataset_files=csv_dataset_dir, column_names=['col1', 'col2', 'col3', 'col4'])

class tinyms.data.GeneratorDataset(source, column_names=None, column_types=None, schema=None, num_samples=None, num_parallel_workers=1, shuffle=None, sampler=None, num_shards=None, shard_id=None, python_multiprocessing=True, max_rowsize=6)[源代码]¶

A source dataset that generates data from Python by invoking Python data source each epoch.

The column names and column types of generated dataset depend on Python data defined by users.

参数

source (Union[Callable, Iterable, Random Accessible]) – A generator callable object, an iterable Python object or a random accessible Python object. Callable source is required to return a tuple of NumPy arrays as a row of the dataset on source().next(). Iterable source is required to return a tuple of NumPy arrays as a row of the dataset on iter(source).next(). Random accessible source is required to return a tuple of NumPy arrays as a row of the dataset on source[idx].
column_names (Union[str, list[str]], optional) – List of column names of the dataset (default=None). Users are required to provide either column_names or schema.
column_types (list[mindspore.dtype], optional) – List of column data types of the dataset (default=None). If provided, sanity check will be performed on generator output.
schema (Union[Schema, str], optional) – Path to the JSON schema file or schema object (default=None). Users are required to provide either column_names or schema. If both are provided, schema will be used.
num_samples (int, optional) – The number of samples to be included in the dataset (default=None, all images).
num_parallel_workers (int, optional) – Number of subprocesses used to fetch the dataset in parallel (default=1).
shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. Random accessible input is required. (default=None, expected order behavior shown in the table).
sampler (Union[Sampler, Iterable], optional) – Object used to choose samples from the dataset. Random accessible input is required (default=None, expected order behavior shown in the table).
num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None). Random accessible input is required. When this argument is specified, num_samples reflects the maximum sample number of per shard.
shard_id (int, optional) – The shard ID within num_shards (default=None). This argument must be specified only when num_shards is also specified. Random accessible input is required.
python_multiprocessing (bool, optional) – Parallelize Python operations with multiple worker process. This option could be beneficial if the Python operation is computational heavy (default=True).
max_rowsize (int, optional) – Maximum size of row in MB that is used for shared memory allocation to copy data between processes. This is only used if python_multiprocessing is set to True (default 6 MB).

引发

RuntimeError – If source raises an exception during execution.
RuntimeError – If len of column_names does not match output len of source.
RuntimeError – If num_parallel_workers exceeds the max thread numbers.
RuntimeError – If sampler and shuffle are specified at the same time.
RuntimeError – If sampler and sharding are specified at the same time.
RuntimeError – If num_shards is specified but shard_id is None.
RuntimeError – If shard_id is specified but num_shards is None.
ValueError – If shard_id is invalid (< 0 or >= num_shards).

注解

This dataset can take in a sampler. sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle¶
Parameter sampler	Parameter shuffle	Expected Order Behavior
None	None	random order
None	True	random order
None	False	sequential order
Sampler object	None	order defined by sampler
Sampler object	True	not allowed
Sampler object	False	not allowed

实际案例

>>> import numpy as np
>>>
>>> # 1) Multidimensional generator function as callable input.
>>> def generator_multidimensional():
...     for i in range(64):
...         yield (np.array([[i, i + 1], [i + 2, i + 3]]),)
>>>
>>> dataset = ds.GeneratorDataset(source=generator_multidimensional, column_names=["multi_dimensional_data"])
>>>
>>> # 2) Multi-column generator function as callable input.
>>> def generator_multi_column():
...     for i in range(64):
...         yield np.array([i]), np.array([[i, i + 1], [i + 2, i + 3]])
>>>
>>> dataset = ds.GeneratorDataset(source=generator_multi_column, column_names=["col1", "col2"])
>>>
>>> # 3) Iterable dataset as iterable input.
>>> class MyIterable:
...     def __init__(self):
...         self._index = 0
...         self._data = np.random.sample((5, 2))
...         self._label = np.random.sample((5, 1))
...
...     def __next__(self):
...         if self._index >= len(self._data):
...             raise StopIteration
...         else:
...             item = (self._data[self._index], self._label[self._index])
...             self._index += 1
...             return item
...
...     def __iter__(self):
...         self._index = 0
...         return self
...
...     def __len__(self):
...         return len(self._data)
>>>
>>> dataset = ds.GeneratorDataset(source=MyIterable(), column_names=["data", "label"])
>>>
>>> # 4) Random accessible dataset as random accessible input.
>>> class MyAccessible:
...     def __init__(self):
...         self._data = np.random.sample((5, 2))
...         self._label = np.random.sample((5, 1))
...
...     def __getitem__(self, index):
...         return self._data[index], self._label[index]
...
...     def __len__(self):
...         return len(self._data)
>>>
>>> dataset = ds.GeneratorDataset(source=MyAccessible(), column_names=["data", "label"])
>>>
>>> # list, dict, tuple of Python is also random accessible
>>> dataset = ds.GeneratorDataset(source=[(np.array(0),), (np.array(1),), (np.array(2),)], column_names=["col"])

class tinyms.data.GraphData(dataset_file, num_parallel_workers=None, working_mode='local', hostname='127.0.0.1', port=50051, num_client=1, auto_shutdown=True)[源代码]¶

Reads the graph dataset used for GNN training from the shared file and database.

参数

dataset_file (str) – One of file names in the dataset.
num_parallel_workers (int, optional) – Number of workers to process the dataset in parallel (default=None).
working_mode (str, optional) –
Set working mode, now supports ‘local’/’client’/’server’ (default=’local’).
- ’local’, used in non-distributed training scenarios.
- ’client’, used in distributed training scenarios. The client does not load data, but obtains data from the server.
- ’server’, used in distributed training scenarios. The server loads the data and is available to the client.
hostname (str, optional) – Hostname of the graph data server. This parameter is only valid when working_mode is set to ‘client’ or ‘server’ (default=’127.0.0.1’).
port (int, optional) – Port of the graph data server. The range is 1024-65535. This parameter is only valid when working_mode is set to ‘client’ or ‘server’ (default=50051).
num_client (int, optional) – Maximum number of clients expected to connect to the server. The server will allocate resources according to this parameter. This parameter is only valid when working_mode is set to ‘server’ (default=1).
auto_shutdown (bool, optional) – Valid when working_mode is set to ‘server’, when the number of connected clients reaches num_client and no client is being connected, the server automatically exits (default=True).

实际案例

>>> graph_dataset_dir = "/path/to/graph_dataset_file"
>>> graph_dataset = ds.GraphData(dataset_file=graph_dataset_dir, num_parallel_workers=2)
>>> nodes = graph_dataset.get_all_nodes(node_type=1)
>>> features = graph_dataset.get_node_feature(node_list=nodes, feature_types=[1])

get_all_edges(edge_type)[源代码]¶

Get all edges in the graph.

参数: edge_type (int) – Specify the type of edge.
返回: numpy.ndarray, array of edges.

实际案例

>>> edges = graph_dataset.get_all_edges(edge_type=0)

引发: TypeError – If edge_type is not integer.

get_all_neighbors(node_list, neighbor_type, output_format=<OutputFormat.NORMAL: 0>)[源代码]¶

Get neighbor_type neighbors of the nodes in node_list. We try to use the following example to illustrate the definition of these formats. 1 represents connected between two nodes, and 0 represents not connected.

Adjacent Matrix¶
	0	1	2	3
0	0	1	0	0
1	0	0	1	0
2	1	0	0	1
3	1	0	0	0

Normal Format¶
src	0	1	2	3
dst_0	1	2	0	1
dst_1	-1	-1	3	-1

COO Format¶
src	0	1	2	2	3
dst	1	2	0	3	1

CSR Format¶
offsetTable	0	1	2	4
dstTable	1	2	0	3	1

参数

node_list (Union[list, numpy.ndarray]) – The given list of nodes.
neighbor_type (int) – Specify the type of neighbor.
output_format (OutputFormat, optional) – Output storage format (default=OutputFormat.NORMAL) It can be any of [OutputFormat.NORMAL, OutputFormat.COO, OutputFormat.CSR].

返回

For NORMAL format or COO format numpy.ndarray which represents the array of neighbors will return. As if CSR format is specified, two numpy.ndarrays will return. The first one is offset table, the second one is neighbors

实际案例

>>> from mindspore.dataset.engine import OutputFormat
>>> nodes = graph_dataset.get_all_nodes(node_type=1)
>>> neighbors = graph_dataset.get_all_neighbors(node_list=nodes, neighbor_type=2)
>>> neighbors_coo = graph_dataset.get_all_neighbors(node_list=nodes, neighbor_type=2,
...                                                 output_format=OutputFormat.COO)
>>> offset_table, neighbors_csr = graph_dataset.get_all_neighbors(node_list=nodes, neighbor_type=2,
...                                                               output_format=OutputFormat.CSR)

引发

TypeError – If node_list is not list or ndarray.
TypeError – If neighbor_type is not integer.

get_all_nodes(node_type)[源代码]¶

Get all nodes in the graph.

参数: node_type (int) – Specify the type of node.
返回: numpy.ndarray, array of nodes.

实际案例

>>> nodes = graph_dataset.get_all_nodes(node_type=1)

引发: TypeError – If node_type is not integer.

get_edge_feature(edge_list, feature_types)[源代码]¶

Get feature_types feature of the edges in edge_list.

参数

edge_list (Union[list, numpy.ndarray]) – The given list of edges.
feature_types (Union[list, numpy.ndarray]) – The given list of feature types.

返回

numpy.ndarray, array of features.

实际案例

>>> edges = graph_dataset.get_all_edges(edge_type=0)
>>> features = graph_dataset.get_edge_feature(edge_list=edges, feature_types=[1])

引发

TypeError – If edge_list is not list or ndarray.
TypeError – If feature_types is not list or ndarray.

get_edges_from_nodes(node_list)[源代码]¶

Get edges from the nodes.

参数: node_list (Union[list[tuple], numpy.ndarray]) – The given list of pair nodes ID.
返回: numpy.ndarray, array of edges ID.

实际案例

>>> edges = graph_dataset.get_edges_from_nodes(node_list=[(101, 201), (103, 207)])

引发: TypeError – If edge_list is not list or ndarray.

get_neg_sampled_neighbors(node_list, neg_neighbor_num, neg_neighbor_type)[源代码]¶

Get neg_neighbor_type negative sampled neighbors of the nodes in node_list.

参数

node_list (Union[list, numpy.ndarray]) – The given list of nodes.
neg_neighbor_num (int) – Number of neighbors sampled.
neg_neighbor_type (int) – Specify the type of negative neighbor.

返回

numpy.ndarray, array of neighbors.

实际案例

>>> nodes = graph_dataset.get_all_nodes(node_type=1)
>>> neg_neighbors = graph_dataset.get_neg_sampled_neighbors(node_list=nodes, neg_neighbor_num=5,
...                                                         neg_neighbor_type=2)

引发

TypeError – If node_list is not list or ndarray.
TypeError – If neg_neighbor_num is not integer.
TypeError – If neg_neighbor_type is not integer.

get_node_feature(node_list, feature_types)[源代码]¶

Get feature_types feature of the nodes in node_list.

参数

node_list (Union[list, numpy.ndarray]) – The given list of nodes.
feature_types (Union[list, numpy.ndarray]) – The given list of feature types.

返回

numpy.ndarray, array of features.

实际案例

>>> nodes = graph_dataset.get_all_nodes(node_type=1)
>>> features = graph_dataset.get_node_feature(node_list=nodes, feature_types=[2, 3])

引发

TypeError – If node_list is not list or ndarray.
TypeError – If feature_types is not list or ndarray.

get_nodes_from_edges(edge_list)[源代码]¶

Get nodes from the edges.

参数: edge_list (Union[list, numpy.ndarray]) – The given list of edges.
返回: numpy.ndarray, array of nodes.
引发: TypeError – If edge_list is not list or ndarray.

get_sampled_neighbors(node_list, neighbor_nums, neighbor_types, strategy=<SamplingStrategy.RANDOM: 0>)[源代码]¶

Get sampled neighbor information.

The api supports multi-hop neighbor sampling. That is, the previous sampling result is used as the input of next-hop sampling. A maximum of 6-hop are allowed.

The sampling result is tiled into a list in the format of [input node, 1-hop sampling result, 2-hop sampling result …]

参数

node_list (Union[list, numpy.ndarray]) – The given list of nodes.
neighbor_nums (Union[list, numpy.ndarray]) – Number of neighbors sampled per hop.
neighbor_types (Union[list, numpy.ndarray]) – Neighbor type sampled per hop.
strategy (SamplingStrategy, optional) –
Sampling strategy (default=SamplingStrategy.RANDOM). It can be any of [SamplingStrategy.RANDOM, SamplingStrategy.EDGE_WEIGHT].
- SamplingStrategy.RANDOM, random sampling with replacement.
- SamplingStrategy.EDGE_WEIGHT, sampling with edge weight as probability.

返回

numpy.ndarray, array of neighbors.

实际案例

>>> nodes = graph_dataset.get_all_nodes(node_type=1)
>>> neighbors = graph_dataset.get_sampled_neighbors(node_list=nodes, neighbor_nums=[2, 2],
...                                                 neighbor_types=[2, 1])

引发

TypeError – If node_list is not list or ndarray.
TypeError – If neighbor_nums is not list or ndarray.
TypeError – If neighbor_types is not list or ndarray.

graph_info()[源代码]¶

Get the meta information of the graph, including the number of nodes, the type of nodes, the feature information of nodes, the number of edges, the type of edges, and the feature information of edges.

返回: dict, meta information of the graph. The key is node_type, edge_type, node_num, edge_num, node_feature_type and edge_feature_type.

random_walk(target_nodes, meta_path, step_home_param=1.0, step_away_param=1.0, default_node=-1)[源代码]¶

Random walk in nodes.

参数

target_nodes (list[int]) – Start node list in random walk
meta_path (list[int]) – node type for each walk step
step_home_param (float, optional) – return hyper parameter in node2vec algorithm (Default = 1.0).
step_away_param (float, optional) – in out hyper parameter in node2vec algorithm (Default = 1.0).
default_node (int, optional) – default node if no more neighbors found (Default = -1). A default value of -1 indicates that no node is given.

返回

numpy.ndarray, array of nodes.

实际案例

>>> nodes = graph_dataset.get_all_nodes(node_type=1)
>>> walks = graph_dataset.random_walk(target_nodes=nodes, meta_path=[2, 1, 2])

引发

TypeError – If target_nodes is not list or ndarray.
TypeError – If meta_path is not list or ndarray.

class tinyms.data.ImageFolderDataset(dataset_dir, num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, extensions=None, class_indexing=None, decode=False, num_shards=None, shard_id=None, cache=None)[源代码]¶

A source dataset that reads images from a tree of directories. All images within one folder have the same label.

The generated dataset has two columns: [image, label]. The tensor of column image is of the uint8 type. The tensor of column label is of a scalar of uint32 type.

参数

dataset_dir (str) – Path to the root directory that contains the dataset.
num_samples (int, optional) – The number of images to be included in the dataset (default=None, all images).
num_parallel_workers (int, optional) – Number of workers to read the data (default=None, set in the config).
shuffle (bool, optional) – Whether or not to perform shuffle on the dataset (default=None, expected order behavior shown in the table).
sampler (Sampler, optional) – Object used to choose samples from the dataset (default=None, expected order behavior shown in the table).
extensions (list[str], optional) – List of file extensions to be included in the dataset (default=None).
class_indexing (dict, optional) – A str-to-int mapping from folder name to index (default=None, the folder names will be sorted alphabetically and each class will be given a unique index starting from 0).
decode (bool, optional) – Decode the images after reading (default=False).
num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None). When this argument is specified, num_samples reflects the maximum sample number of per shard.
shard_id (int, optional) – The shard ID within num_shards (default=None). This argument can only be specified when num_shards is also specified.
cache (DatasetCache, optional) – Use tensor caching service to speed up dataset processing. (default=None, which means no cache is used).

引发

RuntimeError – If dataset_dir does not contain data files.
RuntimeError – If num_parallel_workers exceeds the max thread numbers.
RuntimeError – If sampler and shuffle are specified at the same time.
RuntimeError – If sampler and sharding are specified at the same time.
RuntimeError – If num_shards is specified but shard_id is None.
RuntimeError – If shard_id is specified but num_shards is None.
RuntimeError – If class_indexing is not a dictionary.
ValueError – If shard_id is invalid (< 0 or >= num_shards).

注解

The shape of the image column is [image_size] if decode flag is False, or [H,W,C] otherwise.
This dataset can take in a sampler. sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle¶
Parameter sampler	Parameter shuffle	Expected Order Behavior
None	None	random order
None	True	random order
None	False	sequential order
Sampler object	None	order defined by sampler
Sampler object	True	not allowed
Sampler object	False	not allowed

实际案例

>>> image_folder_dataset_dir = "/path/to/image_folder_dataset_directory"
>>>
>>> # 1) Read all samples (image files) in image_folder_dataset_dir with 8 threads
>>> dataset = ds.ImageFolderDataset(dataset_dir=image_folder_dataset_dir,
...                                 num_parallel_workers=8)
>>>
>>> # 2) Read all samples (image files) from folder cat and folder dog with label 0 and 1
>>> dataset = ds.ImageFolderDataset(dataset_dir=image_folder_dataset_dir,
...                                 class_indexing={"cat":0, "dog":1})
>>>
>>> # 3) Read all samples (image files) in image_folder_dataset_dir with extensions .JPEG and .png (case sensitive)
>>> dataset = ds.ImageFolderDataset(dataset_dir=image_folder_dataset_dir,
...                                 extensions=[".JPEG", ".png"])

About ImageFolderDataset:

You can construct the following directory structure from your dataset files and read by MindSpore’s API.

.
└── image_folder_dataset_directory
     ├── class1
     │    ├── 000000000001.jpg
     │    ├── 000000000002.jpg
     │    ├── ...
     ├── class2
     │    ├── 000000000001.jpg
     │    ├── 000000000002.jpg
     │    ├── ...
     ├── class3
     │    ├── 000000000001.jpg
     │    ├── 000000000002.jpg
     │    ├── ...
     ├── classN
     ├── ...

class tinyms.data.ManifestDataset(dataset_file, usage='train', num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, class_indexing=None, decode=False, num_shards=None, shard_id=None, cache=None)[源代码]¶

A source dataset for reading images from a Manifest file.

The generated dataset has two columns: [image, label]. The tensor of column image is of the uint8 type. The tensor of column label is of a scalar of uint64 type.

参数

dataset_file (str) – File to be read.
usage (str, optional) – Acceptable usages include train, eval and inference (default=`train`).
num_samples (int, optional) – The number of images to be included in the dataset. (default=None, will include all images).
num_parallel_workers (int, optional) – Number of workers to read the data (default=None, will use value set in the config).
shuffle (bool, optional) – Whether to perform shuffle on the dataset (default=None, expected order behavior shown in the table).
sampler (Sampler, optional) – Object used to choose samples from the dataset (default=None, expected order behavior shown in the table).
class_indexing (dict, optional) – A str-to-int mapping from label name to index (default=None, the folder names will be sorted alphabetically and each class will be given a unique index starting from 0).
decode (bool, optional) – decode the images after reading (default=False).
num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None). When this argument is specified, num_samples reflects the max number of samples per shard.
shard_id (int, optional) – The shard ID within num_shards (default=None). This argument can only be specified when num_shards is also specified.
cache (DatasetCache, optional) – Use tensor caching service to speed up dataset processing. (default=None, which means no cache is used).

引发

RuntimeError – If dataset_files are not valid or do not exist.
RuntimeError – If num_parallel_workers exceeds the max thread numbers.
RuntimeError – If sampler and shuffle are specified at the same time.
RuntimeError – If sampler and sharding are specified at the same time.
RuntimeError – If num_shards is specified but shard_id is None.
RuntimeError – If shard_id is specified but num_shards is None.
RuntimeError – If class_indexing is not a dictionary.
ValueError – If shard_id is invalid (< 0 or >= num_shards).

注解

The shape of the image column is [image_size] if decode flag is False, or [H,W,C] otherwise.
This dataset can take in a sampler. sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle¶
Parameter sampler	Parameter shuffle	Expected Order Behavior
None	None	random order
None	True	random order
None	False	sequential order
Sampler object	None	order defined by sampler
Sampler object	True	not allowed
Sampler object	False	not allowed

实际案例

>>> manifest_dataset_dir = "/path/to/manifest_dataset_file"
>>>
>>> # 1) Read all samples specified in manifest_dataset_dir dataset with 8 threads for training
>>> dataset = ds.ManifestDataset(dataset_file=manifest_dataset_dir, usage="train", num_parallel_workers=8)
>>>
>>> # 2) Read samples (specified in manifest_file.manifest) for shard 0 in a 2-way distributed training setup
>>> dataset = ds.ManifestDataset(dataset_file=manifest_dataset_dir, num_shards=2, shard_id=0)

get_class_indexing()[源代码]¶

Get the class index.

返回: dict, a str-to-int mapping from label name to index.

实际案例

>>> manifest_dataset_dir = "/path/to/manifest_dataset_file"
>>>
>>> dataset = ds.ManifestDataset(dataset_file=manifest_dataset_dir)
>>> class_indexing = dataset.get_class_indexing()

class tinyms.data.MindDataset(dataset_file, columns_list=None, num_parallel_workers=None, shuffle=None, num_shards=None, shard_id=None, sampler=None, padded_sample=None, num_padded=None, num_samples=None, cache=None)[源代码]¶

A source dataset for reading and parsing MindRecord dataset.

The columns of generated dataset depend on the source MindRecord files.

参数

dataset_file (Union[str, list[str]]) – If dataset_file is a str, it represents for a file name of one component of a mindrecord source, other files with identical source in the same path will be found and loaded automatically. If dataset_file is a list, it represents for a list of dataset files to be read directly.
columns_list (list[str], optional) – List of columns to be read (default=None).
num_parallel_workers (int, optional) – The number of readers (default=None).
shuffle (Union[bool, Shuffle level], optional) –
Perform reshuffling of the data every epoch (default=None, performs global shuffle). If shuffle is False, no shuffling will be performed; If shuffle is True, the behavior is the same as setting shuffle to be Shuffle.GLOBAL Otherwise, there are three levels of shuffling:
- Shuffle.GLOBAL: Global shuffle of all rows of data in dataset.
- Shuffle.FILES: Shuffle the file sequence but keep the order of data within each file.
- Shuffle.INFILE: Keep the file sequence the same but shuffle the data within each file.
num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None). When this argument is specified, ‘num_samples’ reflects the maximum sample number of per shard.
shard_id (int, optional) – The shard ID within num_shards (default=None). This argument can only be specified when num_shards is also specified.
sampler (Sampler, optional) – Object used to choose samples from the dataset (default=None, sampler is exclusive with shuffle and block_reader). Support list: SubsetRandomSampler, PkSampler, RandomSampler, SequentialSampler, DistributedSampler.
padded_sample (dict, optional) – Samples will be appended to dataset, where keys are the same as column_list.
num_padded (int, optional) – Number of padding samples. Dataset size plus num_padded should be divisible by num_shards.
num_samples (int, optional) – The number of samples to be included in the dataset (default=None, all samples).
cache (DatasetCache, optional) – Use tensor caching service to speed up dataset processing. (default=None, which means no cache is used).

引发

RuntimeError – If dataset_files are not valid or do not exist.
RuntimeError – If num_parallel_workers exceeds the max thread numbers.
RuntimeError – If num_shards is specified but shard_id is None.
RuntimeError – If shard_id is specified but num_shards is None.
ValueError – If shard_id is invalid (< 0 or >= num_shards).

注解

This dataset can take in a sampler. sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle¶
Parameter sampler	Parameter shuffle	Expected Order Behavior
None	None	random order
None	True	random order
None	False	sequential order
Sampler object	None	order defined by sampler
Sampler object	True	not allowed
Sampler object	False	not allowed

实际案例

>>> mind_dataset_dir = ["/path/to/mind_dataset_file"] # contains 1 or multiple MindRecord files
>>> dataset = ds.MindDataset(dataset_file=mind_dataset_dir)

class tinyms.data.MnistDataset(dataset_dir, usage=None, num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, num_shards=None, shard_id=None, cache=None)[源代码]¶

A source dataset for reading and parsing the MNIST dataset.

The generated dataset has two columns [image, label]. The tensor of column image is of the uint8 type. The tensor of column label is a scalar of the uint32 type.

参数

dataset_dir (str) – Path to the root directory that contains the dataset.
usage (str, optional) – Usage of this dataset, can be train, test or all . train will read from 60,000 train samples, test will read from 10,000 test samples, all will read from all 70,000 samples. (default=None, will read all samples)
num_samples (int, optional) – The number of images to be included in the dataset (default=None, will read all images).
num_parallel_workers (int, optional) – Number of workers to read the data (default=None, will use value set in the config).
shuffle (bool, optional) – Whether or not to perform shuffle on the dataset (default=None, expected order behavior shown in the table).
sampler (Sampler, optional) – Object used to choose samples from the dataset (default=None, expected order behavior shown in the table).
num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None). When this argument is specified, num_samples reflects the maximum sample number of per shard.
shard_id (int, optional) – The shard ID within num_shards (default=None). This argument can only be specified when num_shards is also specified.
cache (DatasetCache, optional) – Use tensor caching service to speed up dataset processing. (default=None, which means no cache is used).

引发

RuntimeError – If dataset_dir does not contain data files.
RuntimeError – If num_parallel_workers exceeds the max thread numbers.
RuntimeError – If sampler and shuffle are specified at the same time.
RuntimeError – If sampler and sharding are specified at the same time.
RuntimeError – If num_shards is specified but shard_id is None.
RuntimeError – If shard_id is specified but num_shards is None.
ValueError – If shard_id is invalid (< 0 or >= num_shards).

注解

This dataset can take in a sampler. sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle¶
Parameter sampler	Parameter shuffle	Expected Order Behavior
None	None	random order
None	True	random order
None	False	sequential order
Sampler object	None	order defined by sampler
Sampler object	True	not allowed
Sampler object	False	not allowed

实际案例

>>> mnist_dataset_dir = "/path/to/mnist_dataset_directory"
>>>
>>> # Read 3 samples from MNIST dataset
>>> dataset = ds.MnistDataset(dataset_dir=mnist_dataset_dir, num_samples=3)
>>>
>>> # Note: In mnist_dataset dataset, each dictionary has keys "image" and "label"

About MNIST dataset:

The MNIST database of handwritten digits has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.

Here is the original MNIST dataset structure. You can unzip the dataset files into this directory structure and read by MindSpore’s API.

.
└── mnist_dataset_dir
     ├── t10k-images-idx3-ubyte
     ├── t10k-labels-idx1-ubyte
     ├── train-images-idx3-ubyte
     └── train-labels-idx1-ubyte

Citation:

@article{lecun2010mnist,
title        = {MNIST handwritten digit database},
author       = {LeCun, Yann and Cortes, Corinna and Burges, CJ},
journal      = {ATT Labs [Online]},
volume       = {2},
year         = {2010},
howpublished = {http://yann.lecun.com/exdb/mnist}
}

class tinyms.data.NumpySlicesDataset(data, column_names=None, num_samples=None, num_parallel_workers=1, shuffle=None, sampler=None, num_shards=None, shard_id=None)[源代码]¶

Creates a dataset with given data slices, mainly for loading Python data into dataset.

The column names and column types of generated dataset depend on Python data defined by users.

参数

data (Union[list, tuple, dict]) – list, tuple, dict and other NumPy formats. Input data will be sliced along the first dimension and generate additional rows, if input is list, there will be one column in each row, otherwise there tends to be multi columns. Large data is not recommended to be loaded in this way as data is loading into memory.
column_names (list[str], optional) – List of column names of the dataset (default=None). If column_names is not provided, the output column names will be named as the keys of dict when the input data is a dict, otherwise they will be named like column_0, column_1 …
num_samples (int, optional) – The number of samples to be included in the dataset (default=None, all samples).
num_parallel_workers (int, optional) – Number of subprocesses used to fetch the dataset in parallel (default=1).
shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. Random accessible input is required. (default=None, expected order behavior shown in the table).
sampler (Union[Sampler, Iterable], optional) – Object used to choose samples from the dataset. Random accessible input is required (default=None, expected order behavior shown in the table).
num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None). Random accessible input is required. When this argument is specified, num_samples reflects the max sample number of per shard.
shard_id (int, optional) – The shard ID within num_shards (default=None). This argument must be specified only when num_shards is also specified. Random accessible input is required.

注解

This dataset can take in a sampler. sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle¶
Parameter sampler	Parameter shuffle	Expected Order Behavior
None	None	random order
None	True	random order
None	False	sequential order
Sampler object	None	order defined by sampler
Sampler object	True	not allowed
Sampler object	False	not allowed

引发

RuntimeError – If len of column_names does not match output len of data.
RuntimeError – If num_parallel_workers exceeds the max thread numbers.
RuntimeError – If sampler and shuffle are specified at the same time.
RuntimeError – If sampler and sharding are specified at the same time.
RuntimeError – If num_shards is specified but shard_id is None.
RuntimeError – If shard_id is specified but num_shards is None.
ValueError – If shard_id is invalid (< 0 or >= num_shards).

实际案例

>>> # 1) Input data can be a list
>>> data = [1, 2, 3]
>>> dataset = ds.NumpySlicesDataset(data=data, column_names=["column_1"])
>>>
>>> # 2) Input data can be a dictionary, and column_names will be its keys
>>> data = {"a": [1, 2], "b": [3, 4]}
>>> dataset = ds.NumpySlicesDataset(data=data)
>>>
>>> # 3) Input data can be a tuple of lists (or NumPy arrays), each tuple element refers to data in each column
>>> data = ([1, 2], [3, 4], [5, 6])
>>> dataset = ds.NumpySlicesDataset(data=data, column_names=["column_1", "column_2", "column_3"])
>>>
>>> # 4) Load data from CSV file
>>> import pandas as pd
>>> df = pd.read_csv(filepath_or_buffer=csv_dataset_dir[0])
>>> dataset = ds.NumpySlicesDataset(data=dict(df), shuffle=False)

class tinyms.data.PaddedDataset(padded_samples)[源代码]¶

Creates a dataset with filler data provided by user. Mainly used to add to the original data set and assign it to the corresponding shard.

参数

padded_samples (list(dict)) – Samples provided by user.

引发

TypeError – If padded_samples is not an instance of list.
TypeError – If the element of padded_samples is not an instance of dict.
ValueError – If the padded_samples is empty.

实际案例

>>> import numpy as np
>>> data = [{'image': np.zeros(1, np.uint8)}, {'image': np.zeros(2, np.uint8)}]
>>> dataset = ds.PaddedDataset(padded_samples=data)

class tinyms.data.TextFileDataset(dataset_files, num_samples=None, num_parallel_workers=None, shuffle=<Shuffle.GLOBAL: 'global'>, num_shards=None, shard_id=None, cache=None)[源代码]¶

A source dataset that reads and parses datasets stored on disk in text format. The generated dataset has one column [text] with type string.

参数

dataset_files (Union[str, list[str]]) – String or list of files to be read or glob strings to search for a pattern of files. The list will be sorted in a lexicographical order.
num_samples (int, optional) – The number of samples to be included in the dataset (default=None, will include all images).
num_parallel_workers (int, optional) – Number of workers to read the data (default=None, number set in the config).
shuffle (Union[bool, Shuffle level], optional) –
Perform reshuffling of the data every epoch (default=Shuffle.GLOBAL). If shuffle is False, no shuffling will be performed; If shuffle is True, the behavior is the same as setting shuffle to be Shuffle.GLOBAL Otherwise, there are two levels of shuffling:
- Shuffle.GLOBAL: Shuffle both the files and samples.
- Shuffle.FILES: Shuffle files only.
num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None). When this argument is specified, num_samples reflects the maximum sample number of per shard.
shard_id (int, optional) – The shard ID within num_shards (default=None). This argument can only be specified when num_shards is also specified.
cache (DatasetCache, optional) – Use tensor caching service to speed up dataset processing. (default=None, which means no cache is used).

引发

RuntimeError – If dataset_files are not valid or do not exist.
RuntimeError – If num_parallel_workers exceeds the max thread numbers.
RuntimeError – If num_shards is specified but shard_id is None.
RuntimeError – If shard_id is specified but num_shards is None.

实际案例

>>> text_file_dataset_dir = ["/path/to/text_file_dataset_file"] # contains 1 or multiple text files
>>> dataset = ds.TextFileDataset(dataset_files=text_file_dataset_dir)

class tinyms.data.TFRecordDataset(dataset_files, schema=None, columns_list=None, num_samples=None, num_parallel_workers=None, shuffle=<Shuffle.GLOBAL: 'global'>, num_shards=None, shard_id=None, shard_equal_rows=False, cache=None)[源代码]¶

A source dataset for reading and parsing datasets stored on disk in TFData format.

The columns of generated dataset depend on the source TFRecord files.

参数

dataset_files (Union[str, list[str]]) – String or list of files to be read or glob strings to search for a pattern of files. The list will be sorted in a lexicographical order.
schema (Union[str, Schema], optional) – Path to the JSON schema file or schema object (default=None). If the schema is not provided, the meta data from the TFData file is considered the schema.
columns_list (list[str], optional) – List of columns to be read (default=None, read all columns).
num_samples (int, optional) – The number of samples (rows) to be included in the dataset (default=None). If num_samples is None and numRows(parsed from schema) does not exist, read the full dataset; If num_samples is None and numRows(parsed from schema) is greater than 0, read numRows rows; If both num_samples and numRows(parsed from schema) are greater than 0, read num_samples rows.
num_parallel_workers (int, optional) – Number of workers to read the data (default=None, number set in the config).
shuffle (Union[bool, Shuffle level], optional) –
Perform reshuffling of the data every epoch (default=Shuffle.GLOBAL). If shuffle is False, no shuffling will be performed; If shuffle is True, the behavior is the same as setting shuffle to be Shuffle.GLOBAL Otherwise, there are two levels of shuffling:
- Shuffle.GLOBAL: Shuffle both the files and samples.
- Shuffle.FILES: Shuffle files only.
num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None). When this argument is specified, num_samples reflects the maximum sample number of per shard.
shard_id (int, optional) – The shard ID within num_shards (default=None). This argument can only be specified when num_shards is also specified.
shard_equal_rows (bool, optional) – Get equal rows for all shards(default=False). If shard_equal_rows is false, number of rows of each shard may be not equal, and may lead to a failure in distributed training. When the number of samples of per TFRecord file are not equal, it is suggested to set to true. This argument should only be specified when num_shards is also specified.
cache (DatasetCache, optional) – Use tensor caching service to speed up dataset processing. (default=None, which means no cache is used).

引发

RuntimeError – If dataset_files are not valid or do not exist.
RuntimeError – If num_parallel_workers exceeds the max thread numbers.
RuntimeError – If num_shards is specified but shard_id is None.
RuntimeError – If shard_id is specified but num_shards is None.
ValueError – If shard_id is invalid (< 0 or >= num_shards).

实际案例

>>> from mindspore import dtype as mstype
>>>
>>> tfrecord_dataset_dir = ["/path/to/tfrecord_dataset_file"] # contains 1 or multiple TFRecord files
>>> tfrecord_schema_file = "/path/to/tfrecord_schema_file"
>>>
>>> # 1) Get all rows from tfrecord_dataset_dir with no explicit schema.
>>> # The meta-data in the first row will be used as a schema.
>>> dataset = ds.TFRecordDataset(dataset_files=tfrecord_dataset_dir)
>>>
>>> # 2) Get all rows from tfrecord_dataset_dir with user-defined schema.
>>> schema = ds.Schema()
>>> schema.add_column(name='col_1d', de_type=mstype.int64, shape=[2])
>>> dataset = ds.TFRecordDataset(dataset_files=tfrecord_dataset_dir, schema=schema)
>>>
>>> # 3) Get all rows from tfrecord_dataset_dir with schema file.
>>> dataset = ds.TFRecordDataset(dataset_files=tfrecord_dataset_dir, schema=tfrecord_schema_file)

class tinyms.data.VOCDataset(dataset_dir, task='Segmentation', usage='train', class_indexing=None, num_samples=None, num_parallel_workers=None, shuffle=None, decode=False, sampler=None, num_shards=None, shard_id=None, cache=None, extra_metadata=False)[源代码]¶

A source dataset for reading and parsing VOC dataset.

The generated dataset with different task setting has different output columns:

task = Detection, output columns: [image, dtype=uint8], [bbox, dtype=float32], [label, dtype=uint32], [difficult, dtype=uint32], [truncate, dtype=uint32].
task = Segmentation, output columns: [image, dtype=uint8], [target,dtype=uint8].

参数

dataset_dir (str) – Path to the root directory that contains the dataset.
task (str, optional) – Set the task type of reading voc data, now only support Segmentation or Detection (default=`Segmentation`).
usage (str, optional) – Set the task type of ImageSets(default=`train`). If task is Segmentation, image and annotation list will be loaded in ./ImageSets/Segmentation/usage + “.txt”; If task is Detection, image and annotation list will be loaded in ./ImageSets/Main/usage + “.txt”; if task and usage are not set, image and annotation list will be loaded in ./ImageSets/Segmentation/train.txt as default.
class_indexing (dict, optional) – A str-to-int mapping from label name to index, only valid in Detection task (default=None, the folder names will be sorted alphabetically and each class will be given a unique index starting from 0).
num_samples (int, optional) – The number of images to be included in the dataset (default=None, all images).
num_parallel_workers (int, optional) – Number of workers to read the data (default=None, number set in the config).
shuffle (bool, optional) – Whether to perform shuffle on the dataset (default=None, expected order behavior shown in the table).
decode (bool, optional) – Decode the images after reading (default=False).
sampler (Sampler, optional) – Object used to choose samples from the dataset (default=None, expected order behavior shown in the table).
num_shards (int, optional) – Number of shards that the dataset will be divided into (default=None). When this argument is specified, num_samples reflects the maximum sample number of per shard.
shard_id (int, optional) – The shard ID within num_shards (default=None). This argument can only be specified when num_shards is also specified.
cache (DatasetCache, optional) – Use tensor caching service to speed up dataset processing. (default=None, which means no cache is used).
extra_metadata (bool, optional) – Flag to add extra meta-data to row. If True, an additional column named [_meta-filename, dtype=string] will be output at the end (default=False).

引发

RuntimeError – If dataset_dir does not contain data files.
RuntimeError – If num_parallel_workers exceeds the max thread numbers.
RuntimeError – If xml of Annotations is an invalid format.
RuntimeError – If xml of Annotations loss attribution of object.
RuntimeError – If xml of Annotations loss attribution of bndbox.
RuntimeError – If sampler and shuffle are specified at the same time.
RuntimeError – If sampler and sharding are specified at the same time.
RuntimeError – If num_shards is specified but shard_id is None.
RuntimeError – If shard_id is specified but num_shards is None.
ValueError – If task is not equal ‘Segmentation’ or ‘Detection’.
ValueError – If task equal ‘Segmentation’ but class_indexing is not None.
ValueError – If txt related to mode is not exist.
ValueError – If shard_id is invalid (< 0 or >= num_shards).

注解

Column ‘[_meta-filename, dtype=string]’ won’t be output unless an explicit rename dataset op is added to remove the prefix(‘_meta-‘).
This dataset can take in a sampler. sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle¶
Parameter sampler	Parameter shuffle	Expected Order Behavior
None	None	random order
None	True	random order
None	False	sequential order
Sampler object	None	order defined by sampler
Sampler object	True	not allowed
Sampler object	False	not allowed

实际案例

>>> voc_dataset_dir = "/path/to/voc_dataset_directory"
>>>
>>> # 1) Read VOC data for segmentation training
>>> dataset = ds.VOCDataset(dataset_dir=voc_dataset_dir, task="Segmentation", usage="train")
>>>
>>> # 2) Read VOC data for detection training
>>> dataset = ds.VOCDataset(dataset_dir=voc_dataset_dir, task="Detection", usage="train")
>>>
>>> # 3) Read all VOC dataset samples in voc_dataset_dir with 8 threads in random order
>>> dataset = ds.VOCDataset(dataset_dir=voc_dataset_dir, task="Detection", usage="train",
...                         num_parallel_workers=8)
>>>
>>> # 4) Read then decode all VOC dataset samples in voc_dataset_dir in sequence
>>> dataset = ds.VOCDataset(dataset_dir=voc_dataset_dir, task="Detection", usage="train",
...                         decode=True, shuffle=False)
>>>
>>> # In VOC dataset, if task='Segmentation', each dictionary has keys "image" and "target"
>>> # In VOC dataset, if task='Detection', each dictionary has keys "image" and "annotation"

About VOC dataset.

The PASCAL Visual Object Classes (VOC) challenge is a benchmark in visual object category recognition and detection, providing the vision and machine learning communities with a standard dataset of images and annotation, and standard evaluation procedures.

You can unzip the original VOC-2012 dataset files into this directory structure and read by MindSpore’s API.

.
└── voc2012_dataset_dir
    ├── Annotations
    │    ├── 2007_000027.xml
    │    ├── 2007_000032.xml
    │    ├── ...
    ├── ImageSets
    │    ├── Action
    │    ├── Layout
    │    ├── Main
    │    └── Segmentation
    ├── JPEGImages
    │    ├── 2007_000027.jpg
    │    ├── 2007_000032.jpg
    │    ├── ...
    ├── SegmentationClass
    │    ├── 2007_000032.png
    │    ├── 2007_000033.png
    │    ├── ...
    └── SegmentationObject
         ├── 2007_000032.png
         ├── 2007_000033.png
         ├── ...

Citation:

@article{Everingham10,
author       = {Everingham, M. and Van~Gool, L. and Williams, C. K. I. and Winn, J. and Zisserman, A.},
title        = {The Pascal Visual Object Classes (VOC) Challenge},
journal      = {International Journal of Computer Vision},
volume       = {88},
year         = {2012},
number       = {2},
month        = {jun},
pages        = {303--338},
biburl       = {http://host.robots.ox.ac.uk/pascal/VOC/pubs/everingham10.html#bibtex},
howpublished = {http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html}
}

get_class_indexing()[源代码]¶

Get the class index.

返回: dict, a str-to-int mapping from label name to index.

实际案例

>>> voc_dataset_dir = "/path/to/voc_dataset_directory"
>>>
>>> dataset = ds.VOCDataset(dataset_dir=voc_dataset_dir)
>>> class_indexing = dataset.get_class_indexing()

class tinyms.data.DistributedSampler(dataset_size, num_replicas=None, rank=None, shuffle=True)[源代码]

Distributed sampler.

参数

dataset_size (int) – Dataset list length
num_replicas (int) – Replicas num.
rank (int) – Device rank.
shuffle (bool) – Whether the dataset needs to be shuffled. Default: True.

返回

DistributedSampler instance.

class tinyms.data.PKSampler(num_val, num_class=None, shuffle=False, class_column='label', num_samples=None)[源代码]¶

Samples K elements for each P class in the dataset.

参数

num_val (int) – Number of elements to sample for each class.
num_class (int, optional) – Number of classes to sample (default=None, sample all classes). The parameter does not supported to specify currently.
shuffle (bool, optional) – If True, the class IDs are shuffled, otherwise it will not be shuffled (default=False).
class_column (str, optional) – Name of column with class labels for MindDataset (default=’label’).
num_samples (int, optional) – The number of samples to draw (default=None, which means sample all elements).

实际案例

>>> # creates a PKSampler that will get 3 samples from every class.
>>> sampler = ds.PKSampler(3)
>>> dataset = ds.ImageFolderDataset(image_folder_dataset_dir,
...                                 num_parallel_workers=8,
...                                 sampler=sampler)

引发

TypeError – If shuffle is not a boolean value.
TypeError – If class_column is not a str value.
TypeError – If num_samples is not an integer value.
NotImplementedError – If num_class is not None.
RuntimeError – If num_val is not a positive value.
ValueError – If num_samples is a negative value.

parse()[源代码]¶: Parse the sampler.

parse_for_minddataset()[源代码]¶: Parse the sampler for MindRecord.

class tinyms.data.RandomSampler(replacement=False, num_samples=None)[源代码]¶

Samples the elements randomly.

参数

replacement (bool, optional) – If True, put the sample ID back for the next draw (default=False).
num_samples (int, optional) – Number of elements to sample (default=None, which means sample all elements).

实际案例

>>> # creates a RandomSampler
>>> sampler = ds.RandomSampler()
>>> dataset = ds.ImageFolderDataset(image_folder_dataset_dir,
...                                 num_parallel_workers=8,
...                                 sampler=sampler)

引发

TypeError – If replacement is not a boolean value.
TypeError – If num_samples is not an integer value.
ValueError – If num_samples is a negative value.

parse()[源代码]¶: Parse the sampler.

parse_for_minddataset()[源代码]¶: Parse the sampler for MindRecord.

class tinyms.data.SequentialSampler(start_index=None, num_samples=None)[源代码]¶

Samples the dataset elements sequentially that is equivalent to not using a sampler.

参数

start_index (int, optional) – Index to start sampling at. (default=None, start at first ID)
num_samples (int, optional) – Number of elements to sample (default=None, which means sample all elements).

实际案例

>>> # creates a SequentialSampler
>>> sampler = ds.SequentialSampler()
>>> dataset = ds.ImageFolderDataset(image_folder_dataset_dir,
...                                 num_parallel_workers=8,
...                                 sampler=sampler)

引发

TypeError – If start_index is not an integer value.
TypeError – If num_samples is not an integer value.
RuntimeError – If start_index is a negative value.
ValueError – If num_samples is a negative value.

parse()[源代码]¶: Parse the sampler.

parse_for_minddataset()[源代码]¶: Parse the sampler for MindRecord.

class tinyms.data.SubsetRandomSampler(indices, num_samples=None)[源代码]¶

Samples the elements randomly from a sequence of indices.

参数

indices (Any iterable Python object but string) – A sequence of indices.
num_samples (int, optional) – Number of elements to sample (default=None, which means sample all elements).

实际案例

>>> indices = [0, 1, 2, 3, 7, 88, 119]
>>>
>>> # create a SubsetRandomSampler, will sample from the provided indices
>>> sampler = ds.SubsetRandomSampler(indices)
>>> data = ds.ImageFolderDataset(image_folder_dataset_dir, num_parallel_workers=8, sampler=sampler)

引发

TypeError – If type of indices element is not a number.
TypeError – If num_samples is not an integer value.
ValueError – If num_samples is a negative value.

parse()[源代码]¶: Parse the sampler.

parse_for_minddataset()[源代码]¶: Parse the sampler for MindRecord.

class tinyms.data.WeightedRandomSampler(weights, num_samples=None, replacement=True)[源代码]¶

Samples the elements from [0, len(weights) - 1] randomly with the given weights (probabilities).

参数

weights (list[float, int]) – A sequence of weights, not necessarily summing up to 1.
num_samples (int, optional) – Number of elements to sample (default=None, which means sample all elements).
replacement (bool) – If True, put the sample ID back for the next draw (default=True).

实际案例

>>> weights = [0.9, 0.01, 0.4, 0.8, 0.1, 0.1, 0.3]
>>>
>>> # creates a WeightedRandomSampler that will sample 4 elements without replacement
>>> sampler = ds.WeightedRandomSampler(weights, 4)
>>> dataset = ds.ImageFolderDataset(image_folder_dataset_dir,
...                                 num_parallel_workers=8,
...                                 sampler=sampler)

引发

TypeError – If type of weights element is not a number.
TypeError – If num_samples is not an integer value.
TypeError – If replacement is not a boolean value.
RuntimeError – If weights is empty or all zero.
ValueError – If num_samples is a negative value.

parse()[源代码]¶: Parse the sampler.

class tinyms.data.SubsetSampler(indices, num_samples=None)[源代码]¶

Samples the elements from a sequence of indices.

参数

indices (Any iterable Python object but string) – A sequence of indices.
num_samples (int, optional) – Number of elements to sample (default=None, which means sample all elements).

实际案例

>>> indices = [0, 1, 2, 3, 4, 5]
>>>
>>> # creates a SubsetSampler, will sample from the provided indices
>>> sampler = ds.SubsetSampler(indices)
>>> dataset = ds.ImageFolderDataset(image_folder_dataset_dir,
...                                 num_parallel_workers=8,
...                                 sampler=sampler)

引发

TypeError – If type of indices element is not a number.
TypeError – If num_samples is not an integer value.
ValueError – If num_samples is a negative value.

parse()[源代码]¶: Parse the sampler.

parse_for_minddataset()[源代码]¶: Parse the sampler for MindRecord.

class tinyms.data.DatasetCache(session_id, size=0, spilling=False, hostname=None, port=None, num_connections=None, prefetch_size=None)[源代码]¶

A client to interface with tensor caching service.

For details, please check Tutorial, Programming guide.

参数

session_id (int) – A user assigned session id for the current pipeline.
size (int, optional) – Size of the memory set aside for the row caching (default=0, which means unlimited, note that it might bring in the risk of running out of memory on the machine).
spilling (bool, optional) – Whether or not spilling to disk if out of memory (default=False).
hostname (str, optional) – Host name (default=None, use default hostname ‘127.0.0.1’).
port (int, optional) – Port to connect to server (default=None, use default port 50052).
num_connections (int, optional) – Number of tcp/ip connections (default=None, use default value 12).
prefetch_size (int, optional) – The size of the cache queue between operations (default=None, use default value 20).

实际案例

>>> import mindspore.dataset as ds
>>>
>>> # create a cache instance, in which session_id is generated from command line `cache_admin -g`
>>> some_cache = ds.DatasetCache(session_id=session_id, size=0)
>>>
>>> dataset_dir = "path/to/imagefolder_directory"
>>> ds1 = ds.ImageFolderDataset(dataset_dir, cache=some_cache)

get_stat()[源代码]¶: Get the statistics from a cache.

class tinyms.data.DSCallback(step_size=1)[源代码]¶

Abstract base class used to build a dataset callback class.

参数: step_size (int, optional) – The number of steps between the step_begin and step_end are called (Default=1).

实际案例

>>> from mindspore.dataset import DSCallback
>>>
>>> class PrintInfo(DSCallback):
...     def ds_epoch_end(self, ds_run_context):
...         print(cb_params.cur_epoch_num)
...         print(cb_params.cur_step_num)
>>>
>>> # dataset is an instance of Dataset object
>>> dataset = dataset.map(operations=op, callbacks=PrintInfo())

create_runtime_obj()[源代码]¶

Creates a runtime (C++) object from the callback methods defined by the user.

返回: _c_dataengine.PyDSCallback.

ds_begin(ds_run_context)[源代码]¶

Called before the data pipeline is started.

参数: ds_run_context (RunContext) – Include some information of the pipeline.

ds_epoch_begin(ds_run_context)[源代码]¶

Called before a new epoch is started.

参数: ds_run_context (RunContext) – Include some information of the pipeline.

ds_epoch_end(ds_run_context)[源代码]¶

Called after an epoch is finished.

参数: ds_run_context (RunContext) – Include some information of the pipeline.

ds_step_begin(ds_run_context)[源代码]¶

Called before each step start.

参数: ds_run_context (RunContext) – Include some information of the pipeline.

ds_step_end(ds_run_context)[源代码]¶

Called after each step finished.

参数: ds_run_context (RunContext) – Include some information of the pipeline.

class tinyms.data.Schema(schema_file=None)[源代码]¶

Class to represent a schema of a dataset.

参数: schema_file (str) – Path of the schema file (default=None).
返回: Schema object, schema info about dataset.
引发: RuntimeError – If schema file failed to load.

实际案例

>>> from mindspore import dtype as mstype
>>>
>>> # Create schema; specify column name, mindspore.dtype and shape of the column
>>> schema = ds.Schema()
>>> schema.add_column(name='col1', de_type=mstype.int64, shape=[2])

add_column(name, de_type, shape=None)[源代码]¶

Add new column to the schema.

参数

name (str) – The new name of the column.
de_type (str) – Data type of the column.
shape (list[int], optional) – Shape of the column (default=None, [-1] which is an unknown shape of rank 1).

引发

ValueError – If column type is unknown.

from_json(json_obj)[源代码]¶

Get schema file from JSON object.

参数

json_obj (dictionary) – Object of JSON parsed.

引发

RuntimeError – if there is unknown item in the object.
RuntimeError – if dataset type is missing in the object.
RuntimeError – if columns are missing in the object.

parse_columns(columns)[源代码]¶

Parse the columns and add it to self.

参数

columns (Union[dict, list[dict], tuple[dict]]) –

Dataset attribute information, decoded from schema file.

list[dict], ‘name’ and ‘type’ must be in keys, ‘shape’ optional.
dict, columns.keys() as name, columns.values() is dict, and ‘type’ inside, ‘shape’ optional.

引发

RuntimeError – If failed to parse columns.
RuntimeError – If column’s name field is missing.
RuntimeError – If column’s type field is missing.

实际案例

>>> schema = Schema()
>>> columns1 = [{'name': 'image', 'type': 'int8', 'shape': [3, 3]},
>>>             {'name': 'label', 'type': 'int8', 'shape': [1]}]
>>> schema.parse_columns(columns1)
>>> columns2 = {'image': {'shape': [3, 3], 'type': 'int8'}, 'label': {'shape': [1], 'type': 'int8'}}
>>> schema.parse_columns(columns2)

to_json()[源代码]¶

Get a JSON string of the schema.

返回: str, JSON string of the schema.

class tinyms.data.WaitedDSCallback(step_size=1)[源代码]¶

Abstract base class used to build a dataset callback class that is synchronized with the training callback.

This class can be used to execute a user defined logic right after the previous step or epoch. For example, one augmentation needs the loss from the previous trained epoch to update some of its parameters.

参数: step_size (int, optional) – The number of rows in each step. Usually the step size will be equal to the batch size (Default=1).

实际案例

>>> from mindspore.dataset import WaitedDSCallback
>>>
>>> my_cb = WaitedDSCallback(32)
>>> # dataset is an instance of Dataset object
>>> dataset = dataset.map(operations=AugOp(), callbacks=my_cb)
>>> dataset = dataset.batch(32)
>>> # define the model
>>> model.train(epochs, data, callbacks=[my_cb])

create_runtime_obj()[源代码]¶

Creates a runtime (C++) object from the callback methods defined by the user. This method is internal.

返回: _c_dataengine.PyDSCallback.

ds_epoch_begin(ds_run_context)[源代码]¶

Internal method, do not call/override. Defines ds_epoch_begin of DSCallback to wait for MS epoch_end callback.

参数: ds_run_context – Include some information of the pipeline.

ds_step_begin(ds_run_context)[源代码]¶

Internal method, do not call/override. Defines ds_step_begin of DSCallback to wait for MS step_end callback.

参数: ds_run_context – Include some information of the pipeline.

end(run_context)[源代码]¶

Internal method, release the wait if training is ended.

参数: run_context – Include some information of the model.

epoch_end(run_context)[源代码]¶

Internal method, do not call/override. Defines epoch_end of Callback to release the wait in ds_epoch_begin.

参数: run_context – Include some information of the model.

step_end(run_context)[源代码]¶

Internal method, do not call/override. Defines step_end of Callback to release the wait in ds_step_begin.

参数: run_context – Include some information of the model.

sync_epoch_begin(train_run_context, ds_run_context)[源代码]¶

Called before a new dataset epoch is started and after the previous training epoch is ended.

参数

train_run_context – Include some information of the model with feedback from the previous epoch.
ds_run_context – Include some information of the dataset pipeline.

sync_step_begin(train_run_context, ds_run_context)[源代码]¶

Called before a new dataset step is started and after the previous training step is ended.

参数

train_run_context – Include some information of the model with feedback from the previous step.
ds_run_context – Include some information of the dataset pipeline.

tinyms.data.compare(pipeline1, pipeline2)[源代码]¶

Compare if two dataset pipelines are the same.

参数

pipeline1 (Dataset) – a dataset pipeline.
pipeline2 (Dataset) – a dataset pipeline.

返回

Whether pipeline1 is equal to pipeline2.

实际案例

>>> pipeline1 = ds.MnistDataset(mnist_dataset_dir, 100)
>>> pipeline2 = ds.Cifar10Dataset(cifar_dataset_dir, 100)
>>> ds.compare(pipeline1, pipeline2)

tinyms.data.deserialize(input_dict=None, json_filepath=None)[源代码]¶

Construct dataset pipeline from a JSON file produced by de.serialize().

注解

Currently Python function deserialization of map operator are not supported.

参数

input_dict (dict) – A Python dictionary containing a serialized dataset graph.
json_filepath (str) – A path to the JSON file.

返回

de.Dataset or None if error occurs.

引发

OSError – Can not open the JSON file.

实际案例

>>> dataset = ds.MnistDataset(mnist_dataset_dir, 100)
>>> one_hot_encode = c_transforms.OneHot(10)  # num_classes is input argument
>>> dataset = dataset.map(operation=one_hot_encode, input_column_names="label")
>>> dataset = dataset.batch(batch_size=10, drop_remainder=True)
>>> # Use case 1: to/from JSON file
>>> ds.engine.serialize(dataset, json_filepath="/path/to/mnist_dataset_pipeline.json")
>>> dataset = ds.engine.deserialize(json_filepath="/path/to/mnist_dataset_pipeline.json")
>>> # Use case 2: to/from Python dictionary
>>> serialized_data = ds.engine.serialize(dataset)
>>> dataset = ds.engine.deserialize(input_dict=serialized_data)

tinyms.data.serialize(dataset, json_filepath='')[源代码]¶

Serialize dataset pipeline into a JSON file.

注解

Currently some Python objects are not supported to be serialized. For Python function serialization of map operator, de.serialize will only return its function name.

参数

dataset (Dataset) – The starting node.
json_filepath (str) – The filepath where a serialized JSON file will be generated.

返回

Dict, The dictionary contains the serialized dataset graph.

引发

OSError – Can not open a file

实际案例

>>> dataset = ds.MnistDataset(mnist_dataset_dir, 100)
>>> one_hot_encode = c_transforms.OneHot(10)  # num_classes is input argument
>>> dataset = dataset.map(operation=one_hot_encode, input_column_names="label")
>>> dataset = dataset.batch(batch_size=10, drop_remainder=True)
>>> # serialize it to JSON file
>>> ds.engine.serialize(dataset, json_filepath="/path/to/mnist_dataset_pipeline.json")
>>> serialized_data = ds.engine.serialize(dataset)  # serialize it to Python dict

tinyms.data.show(dataset, indentation=2)[源代码]¶

Write the dataset pipeline graph to logger.info file.

参数

dataset (Dataset) – The starting node.
indentation (int, optional) – The indentation used by the JSON print. Do not indent if indentation is None.

实际案例

>>> dataset = ds.MnistDataset(mnist_dataset_dir, 100)
>>> one_hot_encode = c_transforms.OneHot(10)
>>> dataset = dataset.map(operation=one_hot_encode, input_column_names="label")
>>> dataset = dataset.batch(batch_size=10, drop_remainder=True)
>>> ds.show(dataset)

tinyms.data.zip(datasets)[源代码]¶

Zip the datasets in the input tuple of datasets.

参数

datasets (tuple of class Dataset) – A tuple of datasets to be zipped together. The number of datasets must be more than 1.

返回

ZipDataset, dataset zipped.

引发

ValueError – If the number of datasets is 1.
TypeError – If datasets is not a tuple.

实际案例

>>> # Create a dataset which is the combination of dataset_1 and dataset_2
>>> dataset = ds.zip((dataset_1, dataset_2))

class tinyms.data.FileWriter(file_name, shard_num=1)[源代码]¶

Class to write user defined raw data into MindRecord files.

注解

After the MindRecord file is generated, if the file name is changed, the file may fail to be read.

参数

file_name (str) – File name of MindRecord file.
shard_num (int, optional) – The Number of MindRecord file. Default: 1. It should be between [1, 1000].

引发

ParamValueError – If file_name or shard_num is invalid.

实际案例

>>> from mindspore.mindrecord import FileWriter
>>> schema_json = {"file_name": {"type": "string"}, "label": {"type": "int32"}, "data": {"type": "bytes"}}
>>> indexes = ["file_name", "label"]
>>> data = [{"file_name": "1.jpg", "label": 0,
...          "data": b"\x10c\xb3w\xa8\xee$o&<q\x8c\x8e(\xa2\x90\x90\x96\xbc\xb1\x1e\xd4QER\x13?\xff"},
...         {"file_name": "2.jpg", "label": 56,
...          "data": b"\xe6\xda\xd1\xae\x07\xb8>\xd4\x00\xf8\x129\x15\xd9\xf2q\xc0\xa2\x91YFUO\x1dsE1"},
...         {"file_name": "3.jpg", "label": 99,
...          "data": b"\xaf\xafU<\xb8|6\xbd}\xc1\x99[\xeaj+\x8f\x84\xd3\xcc\xa0,i\xbb\xb9-\xcdz\xecp{T\xb1"}]
>>> writer = FileWriter(file_name="test.mindrecord", shard_num=1)
>>> writer.add_schema(schema_json, "test_schema")
0
>>> writer.add_index(indexes)
MSRStatus.SUCCESS
>>> writer.write_raw_data(data)
MSRStatus.SUCCESS
>>> writer.commit()
MSRStatus.SUCCESS

add_index(index_fields)[源代码]¶

Select index fields from schema to accelerate reading.

注解

The index fields should be primitive type. e.g. int/float/str. If the function is not called, the fields of the primitive type in schema are set as indexes by default.

Please refer to the Examples of class: mindspore.mindrecord.FileWriter.

参数

index_fields (list[str]) – fields from schema.

返回

MSRStatus, SUCCESS or FAILED.

引发

ParamTypeError – If index field is invalid.
MRMDefineIndexError – If index field is not primitive type.
MRMAddIndexError – If failed to add index field.
MRMGetMetaError – If the schema is not set or failed to get meta.

add_schema(content, desc=None)[源代码]¶

The schema is added to describe the raw data to be written.

注解

Please refer to the Examples of class: mindspore.mindrecord.FileWriter.

参数

content (dict) – Dictionary of schema content.
desc (str, optional) – String of schema description, Default: None.

返回

int, schema id.

引发

MRMInvalidSchemaError – If schema is invalid.
MRMBuildSchemaError – If failed to build schema.
MRMAddSchemaError – If failed to add schema.

commit()[源代码]¶

Flush data in memory to disk and generate the corresponding database files.

注解

Please refer to the Examples of class: mindspore.mindrecord.FileWriter.

返回

MSRStatus, SUCCESS or FAILED.

引发

MRMOpenError – If failed to open MindRecord file.
MRMSetHeaderError – If failed to set header.
MRMIndexGeneratorError – If failed to create index generator.
MRMGenerateIndexError – If failed to write to database.
MRMCommitError – If failed to flush data to disk.

open_and_set_header()[源代码]¶

Open writer and set header. The function is only used for parallel writing and is called before the write_raw_data.

返回

MSRStatus, SUCCESS or FAILED.

引发

MRMOpenError – If failed to open MindRecord file.
MRMSetHeaderError – If failed to set header.

classmethod open_for_append(file_name)[源代码]¶

Open MindRecord file and get ready to append data.

参数

file_name (str) – String of MindRecord file name.

返回

FileWriter, file writer object for the opened MindRecord file.

引发

ParamValueError – If file_name is invalid.
FileNameError – If path contains invalid characters.
MRMOpenError – If failed to open MindRecord file.
MRMOpenForAppendError – If failed to open file for appending data.

实际案例

>>> from mindspore.mindrecord import FileWriter
>>> schema_json = {"file_name": {"type": "string"}, "label": {"type": "int32"}, "data": {"type": "bytes"}}
>>> data = [{"file_name": "1.jpg", "label": 0,
...          "data": b"\x10c\xb3w\xa8\xee$o&<q\x8c\x8e(\xa2\x90\x90\x96\xbc\xb1\x1e\xd4QER\x13?\xff"}]
>>> writer = FileWriter(file_name="test.mindrecord", shard_num=1)
>>> writer.add_schema(schema_json, "test_schema")
0
>>> writer.write_raw_data(data)
MSRStatus.SUCCESS
>>> writer.commit()
MSRStatus.SUCCESS
>>> write_append = FileWriter.open_for_append("test.mindrecord")
>>> write_append.write_raw_data(data)
MSRStatus.SUCCESS
>>> write_append.commit()
MSRStatus.SUCCESS

set_header_size(header_size)[源代码]¶

Set the size of header which contains shard information, schema information, page meta information, etc. The larger a header, the more data the MindRecord file can store. If the size of header is larger than the default size (16MB), users need to call the API to set a proper size.

参数: header_size (int) – Size of header, between 16*1024(16KB) and 128*1024*1024(128MB).(default=16MB)
返回: MSRStatus, SUCCESS or FAILED.
引发: MRMInvalidHeaderSizeError – If failed to set header size.

实际案例

>>> from mindspore.mindrecord import FileWriter
>>> writer = FileWriter(file_name="test.mindrecord", shard_num=1)
>>> writer.set_header_size(1 << 25) # 32MB
MSRStatus.SUCCESS

set_page_size(page_size)[源代码]¶

Set the size of page that represents the area where data is stored, and the areas are divided into two types: raw page and blob page. The larger a page, the more data the page can store. If the size of a sample is larger than the default size (32MB), users need to call the API to set a proper size.

参数: page_size (int) – Size of page, between 32*1024(32KB) and 256*1024*1024(256MB).(default=32MB)
返回: MSRStatus, SUCCESS or FAILED.
引发: MRMInvalidPageSizeError – If failed to set page size.

实际案例

>>> from mindspore.mindrecord import FileWriter
>>> writer = FileWriter(file_name="test.mindrecord", shard_num=1)
>>> writer.set_page_size(1 << 26) # 128MB
MSRStatus.SUCCESS

write_raw_data(raw_data, parallel_writer=False)[源代码]¶

Convert raw data into a series of consecutive MindRecord files after the raw data is verified against the schema.

注解

Please refer to the Examples of class: mindspore.mindrecord.FileWriter.

参数

raw_data (list[dict]) – List of raw data.
parallel_writer (bool, optional) – Write raw data in parallel if it equals to True. Default: False.

返回

MSRStatus, SUCCESS or FAILED.

引发

ParamTypeError – If index field is invalid.
MRMOpenError – If failed to open MindRecord file.
MRMValidateDataError – If data does not match blob fields.
MRMSetHeaderError – If failed to set header.
MRMWriteDatasetError – If failed to write dataset.

class tinyms.data.FileReader(file_name, num_consumer=4, columns=None, operator=None)[源代码]¶

Class to read MindRecord files.

注解

If file_name is a filename string, it tries to load all MindRecord files generated in a conversion, and throws an exceptions if a MindRecord file is missing. If file_name is a filename list, only the MindRecord files in the list are loaded.

参数

file_name (str, list[str]) – One of MindRecord file or a file list.
num_consumer (int, optional) – Number of reader workers which load data. Default: 4. It should not be smaller than 1 or larger than the number of processor cores.
columns (list[str], optional) – A list of fields where corresponding data would be read. Default: None.
operator (int, optional) – Reserved parameter for operators. Default: None.

引发

ParamValueError – If file_name, num_consumer or columns is invalid.

close()[源代码]¶: Stop reader worker and close File.

get_next()[源代码]¶

Yield a batch of data according to columns at a time.

生成器: Dict – a batch whose keys are the same as columns.
引发: MRMUnsupportedSchemaError – If schema is invalid.

class tinyms.data.MindPage(file_name, num_consumer=4)[源代码]¶

Class to read MindRecord files in pagination.

参数

file_name (str) – One of MindRecord files or a file list.
num_consumer (int, optional) – The number of reader workers which load data. Default: 4. It should not be smaller than 1 or larger than the number of processor cores.

引发

ParamValueError – If file_name, num_consumer or columns is invalid.
MRMInitSegmentError – If failed to initialize ShardSegment.

property candidate_fields¶

Return candidate category fields.

返回: list[str], by which data could be grouped.

property category_field¶

Getter function for category fields.

返回: list[str], by which data could be grouped.

get_category_fields()[源代码]¶

Return candidate category fields.

返回: list[str], by which data could be grouped.

read_at_page_by_id(category_id, page, num_row)[源代码]¶

Query by category id in pagination.

参数

category_id (int) – Category id, referred to the return of read_category_info.
page (int) – Index of page.
num_row (int) – Number of rows in a page.

返回

list[dict], data queried by category id.

引发

ParamValueError – If any parameter is invalid.
MRMFetchDataError – If failed to fetch data by category.
MRMUnsupportedSchemaError – If schema is invalid.

read_at_page_by_name(category_name, page, num_row)[源代码]¶

Query by category name in pagination.

参数

category_name (str) – String of category field’s value, referred to the return of read_category_info.
page (int) – Index of page.
num_row (int) – Number of row in a page.

返回

list[dict], data queried by category name.

read_category_info()[源代码]¶

Return category information when data is grouped by indicated category field.

返回: str, description of group information.
引发: MRMReadCategoryInfoError – If failed to read category information.

set_category_field(category_field)[源代码]¶

Set category field for reading.

注解

Should be a candidate category field.

参数: category_field (str) – String of category field name.
返回: MSRStatus, SUCCESS or FAILED.

class tinyms.data.Cifar10ToMR(source, destination)[源代码]¶

A class to transform from cifar10 to MindRecord.

注解

For details about Examples, please refer to Converting the CIFAR-10 Dataset.

参数

source (str) – the cifar10 directory to be transformed.
destination (str) – the MindRecord file path to transform into.

引发

ValueError – If source or destination is invalid.

run(fields=None)[源代码]¶

Execute transformation from cifar10 to MindRecord.

参数: fields (list[str], optional) – A list of index fields. Default: None.
返回: MSRStatus, whether cifar10 is successfully transformed to MindRecord.

transform(fields=None)[源代码]¶

Encapsulate the run function to exit normally

参数: fields (list[str], optional) – A list of index fields. Default: None.
返回: MSRStatus, whether cifar10 is successfully transformed to MindRecord.

class tinyms.data.Cifar100ToMR(source, destination)[源代码]¶

A class to transform from cifar100 to MindRecord.

注解

For details about Examples, please refer to Converting the CIFAR-10 Dataset.

参数

source (str) – the cifar100 directory to be transformed.
destination (str) – the MindRecord file path to transform into.

引发

ValueError – If source or destination is invalid.

run(fields=None)[源代码]¶

Execute transformation from cifar100 to MindRecord.

参数: fields (list[str]) – A list of index field, e.g.[“fine_label”, “coarse_label”].
返回: MSRStatus, whether cifar100 is successfully transformed to MindRecord.

transform(fields=None)[源代码]¶

Encapsulate the run function to exit normally

参数: fields (list[str]) – A list of index field, e.g.[“fine_label”, “coarse_label”].
返回: MSRStatus, whether cifar100 is successfully transformed to MindRecord.

class tinyms.data.CsvToMR(source, destination, columns_list=None, partition_number=1)[源代码]¶

A class to transform from csv to MindRecord.

注解

For details about Examples, please refer to Converting CSV Dataset.

参数

source (str) – the file path of csv.
destination (str) – the MindRecord file path to transform into.
columns_list (list[str], optional) – A list of columns to be read. Default: None.
partition_number (int, optional) – partition size, Default: 1.

引发

ValueError – If source, destination, partition_number is invalid.
RuntimeError – If columns_list is invalid.

run()[源代码]¶

Execute transformation from csv to MindRecord.

返回: MSRStatus, whether csv is successfully transformed to MindRecord.

transform()[源代码]¶: Encapsulate the run function to exit normally

class tinyms.data.ImageNetToMR(map_file, image_dir, destination, partition_number=1)[源代码]¶

A class to transform from imagenet to MindRecord.

注解

For details about Examples, please refer to Converting the ImageNet Dataset.

参数

map_file (str) –
the map file that indicates label. The map file content should be like this:
```
n02119789 0
n02100735 1
n02110185 2
n02096294 3
```
image_dir (str) – image directory contains n02119789, n02100735, n02110185 and n02096294 directory.
destination (str) – the MindRecord file path to transform into.
partition_number (int, optional) – partition size. Default: 1.

引发

ValueError – If map_file, image_dir or destination is invalid.

run()[源代码]¶

Execute transformation from imagenet to MindRecord.

返回: MSRStatus, whether imagenet is successfully transformed to MindRecord.

transform()[源代码]¶: Encapsulate the run function to exit normally

class tinyms.data.MnistToMR(source, destination, partition_number=1)[源代码]¶

A class to transform from Mnist to MindRecord.

参数

source (str) – directory that contains t10k-images-idx3-ubyte.gz, train-images-idx3-ubyte.gz, t10k-labels-idx1-ubyte.gz and train-labels-idx1-ubyte.gz.
destination (str) – the MindRecord file directory to transform into.
partition_number (int, optional) – partition size. Default: 1.

引发

ValueError – If source, destination, partition_number is invalid.

run()[源代码]¶

Execute transformation from Mnist to MindRecord.

返回: MSRStatus, whether successfully written into MindRecord.

transform()[源代码]¶: Encapsulate the run function to exit normally

class tinyms.data.TFRecordToMR(source, destination, feature_dict, bytes_fields=None)[源代码]¶

A class to transform from TFRecord to MindRecord.

注解

For details about Examples, please refer to Converting TFRecord Dataset.

参数

source (str) – the TFRecord file to be transformed.
destination (str) – the MindRecord file path to transform into.
feature_dict (dict) – a dictionary that states the feature type, and VarLenFeature is not supported.
bytes_fields (list, optional) – the bytes fields which are in feature_dict and can be images bytes.

引发

ValueError – If parameter is invalid.
Exception – when tensorflow module is not found or version is not correct.

run()[源代码]¶

Execute transformation from TFRecord to MindRecord.

返回: MSRStatus, whether TFRecord is successfully transformed to MindRecord.

tfrecord_iterator()[源代码]¶

Yield a dictionary whose keys are fields in schema.

生成器: dict, data dictionary whose keys are the same as columns.

tfrecord_iterator_oldversion()[源代码]¶

Yield a dict with key to be fields in schema, and value to be data. This function is for old version tensorflow whose version number < 2.1.0

生成器: dict, data dictionary whose keys are the same as columns.

transform()[源代码]¶: Encapsulate the run function to exit normally

tinyms.data.download_dataset(dataset_name, local_path='.')[源代码]¶

This function is defined to easily download any public dataset without specifing much details.

参数

dataset_name (str) – The official name of dataset, currently supports mnist, cifar10 and cifar100.
local_path (str) – Specifies the local location of dataset to be downloaded. Default: ..

返回

str, the source location of dataset downloaded.

实际案例

>>> from tinyms.data import download_dataset
>>>
>>> ds_path = download_dataset('mnist')

tinyms.data.generate_image_list(dir_path, max_dataset_size=inf)[源代码]¶

Traverse the directory to generate a list of images path.

参数

dir_path (str) – image directory.
max_dataset_size (int) – Maximum number of return image paths.

返回

Image path list.

tinyms.data.load_resized_img(path, width=256, height=256)[源代码]¶

Load image with RGB and resize to (256, 256).

参数

path (str) – image path.
width (int) – image width, default: 256.
height (int) – image height, default: 256.

返回

PIL image class.

tinyms.data.load_img(path)[源代码]¶

Load image with RGB.

参数: path (str) – image path.
返回: PIL image class.