tinyms.data

class tinyms.data.UnalignedDataset(dataset_path, phase, max_dataset_size=inf, shuffle=True)[source]

This dataset class can load unaligned/unpaired datasets.

Parameters:
  • dataset_path (str) – The path of images (should have subfolders trainA, trainB, testA, testB, etc).

  • phase (str) – Train or test. It requires two directories in dataset_path, like trainA and trainB to. host training images from domain A ‘{dataset_path}/trainA’ and from domain B ‘{dataset_path}/trainB’ respectively.

  • max_dataset_size (int) – Maximum number of return image paths.

Returns:

Two domain image path list.

class tinyms.data.GanImageFolderDataset(dataset_path, max_dataset_size=inf)[source]

This dataset class can load images from image folder.

Parameters:
  • dataset_path (str) – ‘{dataset_path}/testA’, ‘{dataset_path}/testB’, etc.

  • max_dataset_size (int) – Maximum number of return image paths.

Returns:

Image path list.

class tinyms.data.ImdbDataset(imdb_path, glove_path, embed_size=300)[source]

parse aclImdb data to features and labels. sentence->tokenized->encoded->padding->features

Parameters:
  • imdb_path (str) – The path where the aclImdb dataset stored.

  • glove_path (str) – The path where the GloVe stored.

  • embed_size (int) – Embed_size. Default: 300.

Examples

>>> from tinyms.data import ImdbDataset
>>>
>>> imdb_ds = ImdbDataset('./aclImdb', './glove')
convert_to_mindrecord(preprocess_path, shard_num=1)[source]

convert imdb dataset to mindrecoed dataset

get_datas(seg)[source]

get features, labels, and weight by gensim.

parse()[source]

parse imdb data to memory

class tinyms.data.BertDataset(data_dir, schema_dir=None, shuffle=True, num_parallel_workers=None)[source]

This dataset class can load bert from data folder.

Parameters:
  • data_dir (str) – ‘{data_dir}/result1.tfrecord’, ‘{data_dir}/result2.tfrecord’, etc.

  • num_parallel_workers (int) – The number of concurrent workers. Default: None.

  • shuffle (Union[bool, Shuffle level], optional) –

    Perform reshuffling of the data every epoch (default=Shuffle.GLOBAL). If shuffle is False, no shuffling will be performed; If shuffle is True, the behavior is the same as setting shuffle to be Shuffle.GLOBAL Otherwise, there are two levels of shuffling:

    • Shuffle.GLOBAL: Shuffle both the files and samples.

    • Shuffle.FILES: Shuffle files only.

  • schema (Union[str, Schema], optional) – Path to the JSON schema file or schema object (default=None). If the schema is not provided, the meta data from the TFData file is considered the schema.

Examples

>>> from tinyms.data import BertDataset
>>>
>>> bert_ds = BertDataset('data')
class tinyms.data.KaggleDisplayAdvertisingDataset(data_dir, num_parallel_workers=None, shuffle=True)[source]

parse aclImdb data to features and labels. sentence->tokenized->encoded->padding->features

Parameters:
  • data_dir (str) – The path where the uncompressed dataset stored.

  • num_parallel_workers (int) – The number of concurrent workers. Default: None.

  • shuffle (bol) – Whether the dataset needs to be shuffled. Default: True.

Examples

>>> from tinyms.data import KaggleDisplayAdvertisingDataset
>>>
>>> kaggle_display_advertising_ds = KaggleDisplayAdvertisingDataset('data')
>>> kaggle_display_advertising_ds.stats_data()
>>> kaggle_display_advertising_ds.convert_to_mindrecord()
>>> train_ds = kaggle_display_advertising_ds.load_mindreocrd_dataset(usage='train')
>>> test_ds = kaggle_display_advertising_ds.load_mindreocrd_dataset(usage='test')
load_mindreocrd_dataset(usage='train', batch_size=1000)[source]

load mindrecord dataset. :param usage: Dataset mode. Default: ‘train’. :type usage: str :param batch_size: batch size. Default: 1000. :type batch_size: int

Returns:

MindDataset

stats_data()[source]

stats data

class tinyms.data.DistributedSampler(dataset_size, num_replicas=None, rank=None, shuffle=True)[source]

Distributed sampler.

Parameters:
  • dataset_size (int) – Dataset list length

  • num_replicas (int) – Replicas num.

  • rank (int) – Device rank.

  • shuffle (bool) – Whether the dataset needs to be shuffled. Default: True.

Returns:

DistributedSampler instance.

class tinyms.data.Caltech101Dataset(dataset_dir, target_type=None, num_samples=None, num_parallel_workers=1, shuffle=None, decode=False, sampler=None, num_shards=None, shard_id=None)[source]

Caltech 101 dataset.

The columns of the generated dataset depend on the value of target_type .

  • When target_type is ‘category’, the columns are [image, category] .

  • When target_type is ‘annotation’, the columns are [image, annotation] .

  • When target_type is ‘all’, the columns are [image, category, annotation] .

The tensor of column image is of the uint8 type. The tensor of column category is of the uint32 type. The tensor of column annotation is a 2-dimensional ndarray that stores the contour of the image and consists of a series of points.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset. This root directory contains two subdirectories, one is called 101_ObjectCategories, which stores images, and the other is called Annotations, which stores annotations.

  • target_type (str, optional) – Target of the image. If target_type is ‘category’, return category represents the target class. If target_type is ‘annotation’, return annotation. If target_type is ‘all’, return category and annotation. Default: None, means ‘category’.

  • num_samples (int, optional) – The number of images to be included in the dataset. Default: None, all images.

  • num_parallel_workers (int, optional) – Number of worker subprocesses to read the data. Default: 1.

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • decode (bool, optional) – Whether or not to decode the images after reading. Default: False.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the maximum sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and num_shards/shard_id are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If shard_id is not in range of [0, num_shards ).

  • ValueError – If target_type is not set correctly.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

Note

  • This dataset can take in a sampler . sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter sampler

Parameter shuffle

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> caltech101_dataset_directory = "/path/to/caltech101_dataset_directory"
>>>
>>> # 1) Read all samples (image files) in caltech101_dataset_directory with 8 threads
>>> dataset = ds.Caltech101Dataset(dataset_dir=caltech101_dataset_directory, num_parallel_workers=8)
>>>
>>> # 2) Read all samples (image files) with the target_type "annotation"
>>> dataset = ds.Caltech101Dataset(dataset_dir=caltech101_dataset_directory, target_type="annotation")

About Caltech101Dataset:

Pictures of objects belonging to 101 categories, about 40 to 800 images per category. Most categories have about 50 images. The size of each image is roughly 300 x 200 pixels. The official provides the contour data of each object in each picture, which is the annotation.

Here is the original Caltech101 dataset structure, and you can unzip the dataset files into the following directory structure, which are read by MindSpore API.

.
└── caltech101_dataset_directory
    ├── 101_ObjectCategories
    │    ├── Faces
    │    │    ├── image_0001.jpg
    │    │    ├── image_0002.jpg
    │    │    ...
    │    ├── Faces_easy
    │    │    ├── image_0001.jpg
    │    │    ├── image_0002.jpg
    │    │    ...
    │    ├── ...
    └── Annotations
         ├── Airplanes_Side_2
         │    ├── annotation_0001.mat
         │    ├── annotation_0002.mat
         │    ...
         ├── Faces_2
         │    ├── annotation_0001.mat
         │    ├── annotation_0002.mat
         │    ...
         ├── ...

Citation:

@article{FeiFei2004LearningGV,
author    = {Li Fei-Fei and Rob Fergus and Pietro Perona},
title     = {Learning Generative Visual Models from Few Training Examples:
            An Incremental Bayesian Approach Tested on 101 Object Categories},
journal   = {Computer Vision and Pattern Recognition Workshop},
year      = {2004},
url       = {https://data.caltech.edu/records/mzrjq-6wc02},
}
get_class_indexing()[source]

Get the class index.

Returns:

dict, a str-to-int mapping from label name to index.

class tinyms.data.Caltech256Dataset(dataset_dir, num_samples=None, num_parallel_workers=None, shuffle=None, decode=False, sampler=None, num_shards=None, shard_id=None, cache=None)[source]

Caltech 256 dataset.

The generated dataset has two columns: [image, label] . The tensor of column image is of the uint8 type. The tensor of column label is of the uint32 type.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • num_samples (int, optional) – The number of images to be included in the dataset. Default: None, all images.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • decode (bool, optional) – Whether or not to decode the images after reading. Default: False.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the maximum sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) – Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and num_shards/shard_id are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If shard_id is not in range of [0, num_shards ).

  • ValueError – If target_type is not ‘category’, ‘annotation’ or ‘all’.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

Note

  • This dataset can take in a sampler . sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter sampler

Parameter shuffle

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> caltech256_dataset_dir = "/path/to/caltech256_dataset_directory"
>>>
>>> # 1) Read all samples (image files) in caltech256_dataset_dir with 8 threads
>>> dataset = ds.Caltech256Dataset(dataset_dir=caltech256_dataset_dir, num_parallel_workers=8)

About Caltech256Dataset:

Caltech-256 is an object recognition dataset containing 30,607 real-world images, of different sizes, spanning 257 classes (256 object classes and an additional clutter class). Each class is represented by at least 80 images. The dataset is a superset of the Caltech-101 dataset.

.
└── caltech256_dataset_directory
     ├── 001.ak47
     │    ├── 001_0001.jpg
     │    ├── 001_0002.jpg
     │    ...
     ├── 002.american-flag
     │    ├── 002_0001.jpg
     │    ├── 002_0002.jpg
     │    ...
     ├── 003.backpack
     │    ├── 003_0001.jpg
     │    ├── 003_0002.jpg
     │    ...
     ├── ...

Citation:

@article{griffin2007caltech,
title     = {Caltech-256 object category dataset},
added-at  = {2021-01-21T02:54:42.000+0100},
author    = {Griffin, Gregory and Holub, Alex and Perona, Pietro},
biburl    = {https://www.bibsonomy.org/bibtex/21f746f23ff0307826cca3e3be45f8de7/s364315},
interhash = {bfe1e648c1778c04baa60f23d1223375},
intrahash = {1f746f23ff0307826cca3e3be45f8de7},
publisher = {California Institute of Technology},
timestamp = {2021-01-21T02:54:42.000+0100},
year      = {2007}
}
class tinyms.data.CelebADataset(dataset_dir, num_parallel_workers=None, shuffle=None, usage='all', sampler=None, decode=False, extensions=None, num_samples=None, num_shards=None, shard_id=None, cache=None, decrypt=None)[source]

CelebA(CelebFaces Attributes) dataset.

Only support to read list_attr_celeba.txt currently, which is the attribute annotations of the dataset. The generated dataset has two columns: [image, attr] . The tensor of column image is of the uint8 type. The tensor of column attr is of the uint32 type and one hot encoded.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (bool, optional) – Whether to perform shuffle on the dataset. Default: None.

  • usage (str, optional) – Specify the ‘train’, ‘valid’, ‘test’ part or ‘all’ parts of dataset. Default: ‘all’, will read all samples.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None.

  • decode (bool, optional) – Whether to decode the images after reading. Default: False.

  • extensions (list[str], optional) – List of file extensions to be included in the dataset. Default: None.

  • num_samples (int, optional) – The number of images to be included in the dataset. Default: None, will include all images.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the maximum sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

  • decrypt (callable, optional) – Image decryption function, which accepts the path of the encrypted image file and returns the decrypted bytes data. Default: None, no decryption.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and num_shards/shard_id are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If shard_id is not in range of [0, num_shards ).

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • ValueError – If usage is not ‘train’, ‘valid’, ‘test’ or ‘all’.

Note

  • This dataset can take in a sampler . sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter sampler

Parameter shuffle

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> celeba_dataset_dir = "/path/to/celeba_dataset_directory"
>>>
>>> # Read 5 samples from CelebA dataset
>>> dataset = ds.CelebADataset(dataset_dir=celeba_dataset_dir, usage='train', num_samples=5)
>>>
>>> # Note: In celeba dataset, each data dictionary owns keys "image" and "attr"

About CelebA dataset:

CelebFaces Attributes Dataset (CelebA) is a large-scale dataset with more than 200K celebrity images, each with 40 attribute annotations.

The images in this dataset cover large pose variations and background clutter. CelebA has large diversities, large quantities, and rich annotations, including

  • 10,177 number of identities,

  • 202,599 number of images,

  • 5 landmark locations, 40 binary attributes annotations per image.

The dataset can be employed as the training and test sets for the following computer vision tasks: attribute recognition, detection, landmark (or facial part) and localization.

Original CelebA dataset structure:

.
└── CelebA
     ├── README.md
     ├── Img
     │    ├── img_celeba.7z
     │    ├── img_align_celeba_png.7z
     │    └── img_align_celeba.zip
     ├── Eval
     │    └── list_eval_partition.txt
     └── Anno
          ├── list_landmarks_celeba.txt
          ├── list_landmarks_align_celeba.txt
          ├── list_bbox_celeba.txt
          ├── list_attr_celeba.txt
          └── identity_CelebA.txt

You can unzip the dataset files into the following structure and read by MindSpore’s API.

.
└── celeba_dataset_directory
    ├── list_attr_celeba.txt
    ├── 000001.jpg
    ├── 000002.jpg
    ├── 000003.jpg
    ├── ...

Citation:

@article{DBLP:journals/corr/LiuLWT14,
author        = {Ziwei Liu and Ping Luo and Xiaogang Wang and Xiaoou Tang},
title         = {Deep Learning Attributes in the Wild},
journal       = {CoRR},
volume        = {abs/1411.7766},
year          = {2014},
url           = {http://arxiv.org/abs/1411.7766},
archivePrefix = {arXiv},
eprint        = {1411.7766},
timestamp     = {Tue, 10 Dec 2019 15:37:26 +0100},
biburl        = {https://dblp.org/rec/journals/corr/LiuLWT14.bib},
bibsource     = {dblp computer science bibliography, https://dblp.org},
howpublished  = {http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html}
}
class tinyms.data.Cifar10Dataset(dataset_dir, usage=None, num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, num_shards=None, shard_id=None, cache=None)[source]

CIFAR-10 dataset.

This api only supports parsing CIFAR-10 file in binary version now. The generated dataset has two columns [image, label] . The tensor of column image is of the uint8 type. The tensor of column label is a scalar of the uint32 type.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • usage (str, optional) – Usage of this dataset, can be ‘train’, ‘test’ or ‘all’ . ‘train’ will read from 50,000 train samples, ‘test’ will read from 10,000 test samples, ‘all’ will read from all 60,000 samples. Default: None, all samples.

  • num_samples (int, optional) – The number of images to be included in the dataset. Default: None, all images.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (bool, optional) – Whether to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the maximum sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and num_shards/shard_id are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If shard_id is not in range of [0, num_shards ).

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • ValueError – If usage is not ‘train’, ‘test’ or ‘all’.

Note

  • This dataset can take in a sampler . sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter sampler

Parameter shuffle

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> cifar10_dataset_dir = "/path/to/cifar10_dataset_directory"
>>>
>>> # 1) Get all samples from CIFAR10 dataset in sequence
>>> dataset = ds.Cifar10Dataset(dataset_dir=cifar10_dataset_dir, shuffle=False)
>>>
>>> # 2) Randomly select 350 samples from CIFAR10 dataset
>>> dataset = ds.Cifar10Dataset(dataset_dir=cifar10_dataset_dir, num_samples=350, shuffle=True)
>>>
>>> # 3) Get samples from CIFAR10 dataset for shard 0 in a 2-way distributed training
>>> dataset = ds.Cifar10Dataset(dataset_dir=cifar10_dataset_dir, num_shards=2, shard_id=0)
>>>
>>> # In CIFAR10 dataset, each dictionary has keys "image" and "label"

About CIFAR-10 dataset:

The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The 10 different classes represent airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks.

Here is the original CIFAR-10 dataset structure. You can unzip the dataset files into the following directory structure and read by MindSpore’s API.

.
└── cifar-10-batches-bin
     ├── data_batch_1.bin
     ├── data_batch_2.bin
     ├── data_batch_3.bin
     ├── data_batch_4.bin
     ├── data_batch_5.bin
     ├── test_batch.bin
     ├── readme.html
     └── batches.meta.txt

Citation:

@techreport{Krizhevsky09,
author       = {Alex Krizhevsky},
title        = {Learning multiple layers of features from tiny images},
institution  = {},
year         = {2009},
howpublished = {http://www.cs.toronto.edu/~kriz/cifar.html}
}
class tinyms.data.Cifar100Dataset(dataset_dir, usage=None, num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, num_shards=None, shard_id=None, cache=None)[source]

CIFAR-100 dataset.

The generated dataset has three columns [image, coarse_label, fine_label] . The tensor of column image is of the uint8 type. The tensor of column coarse_label and fine_labels are each a scalar of uint32 type.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • usage (str, optional) – Usage of this dataset, can be ‘train’, ‘test’ or ‘all’ . ‘train’ will read from 50,000 train samples, ‘test’ will read from 10,000 test samples, ‘all’ will read from all 60,000 samples. Default: None, all samples.

  • num_samples (int, optional) – The number of images to be included in the dataset. Default: None, all images.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (bool, optional) – Whether to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the maximum sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and num_shards/shard_id are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If shard_id is not in range of [0, num_shards ).

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • ValueError – If usage is not ‘train’, ‘test’ or ‘all’.

Note

  • This dataset can take in a sampler . sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter sampler

Parameter shuffle

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> cifar100_dataset_dir = "/path/to/cifar100_dataset_directory"
>>>
>>> # 1) Get all samples from CIFAR100 dataset in sequence
>>> dataset = ds.Cifar100Dataset(dataset_dir=cifar100_dataset_dir, shuffle=False)
>>>
>>> # 2) Randomly select 350 samples from CIFAR100 dataset
>>> dataset = ds.Cifar100Dataset(dataset_dir=cifar100_dataset_dir, num_samples=350, shuffle=True)
>>>
>>> # In CIFAR100 dataset, each dictionary has 3 keys: "image", "fine_label" and "coarse_label"

About CIFAR-100 dataset:

This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a “fine” label (the class to which it belongs) and a “coarse” label (the superclass to which it belongs).

Here is the original CIFAR-100 dataset structure. You can unzip the dataset files into the following directory structure and read by MindSpore’s API.

.
└── cifar-100-binary
    ├── train.bin
    ├── test.bin
    ├── fine_label_names.txt
    └── coarse_label_names.txt

Citation:

@techreport{Krizhevsky09,
author       = {Alex Krizhevsky},
title        = {Learning multiple layers of features from tiny images},
institution  = {},
year         = {2009},
howpublished = {http://www.cs.toronto.edu/~kriz/cifar.html}
}
class tinyms.data.CityscapesDataset(dataset_dir, usage='train', quality_mode='fine', task='instance', num_samples=None, num_parallel_workers=None, shuffle=None, decode=None, sampler=None, num_shards=None, shard_id=None, cache=None)[source]

Cityscapes dataset.

The generated dataset has two columns [image, task] . The tensor of column image is of the uint8 type. The tensor of column task is of the uint8 type if task is not ‘polygon’ otherwise task is a string tensor with serialize json.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • usage (str, optional) – Acceptable usages include ‘train’, ‘test’, ‘val’ or ‘all’ if quality_mode is ‘fine’ otherwise ‘train’, ‘train_extra’, ‘val’ or ‘all’. Default: ‘train’.

  • quality_mode (str, optional) – Acceptable quality_modes include ‘fine’ or ‘coarse’. Default: ‘fine’.

  • task (str, optional) – Acceptable tasks include ‘instance’, ‘semantic’, ‘polygon’ or ‘color’. Default: ‘instance’.

  • num_samples (int, optional) – The number of images to be included in the dataset. Default: None, all images.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (bool, optional) – Whether to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • decode (bool, optional) – Decode the images after reading. Default: False.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the max sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir is invalid or does not contain data files.

  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and num_shards/shard_id are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • ValueError – If dataset_dir is not exist.

  • ValueError – If task is invalid.

  • ValueError – If quality_mode is invalid.

  • ValueError – If usage is invalid.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Note

  • This dataset can take in a sampler . sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter sampler

Parameter shuffle

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> cityscapes_dataset_dir = "/path/to/cityscapes_dataset_directory"
>>>
>>> # 1) Get all samples from Cityscapes dataset in sequence
>>> dataset = ds.CityscapesDataset(dataset_dir=cityscapes_dataset_dir, task="instance", quality_mode="fine",
...                                usage="train", shuffle=False, num_parallel_workers=1)
>>>
>>> # 2) Randomly select 350 samples from Cityscapes dataset
>>> dataset = ds.CityscapesDataset(dataset_dir=cityscapes_dataset_dir, num_samples=350, shuffle=True,
...                                num_parallel_workers=1)
>>>
>>> # 3) Get samples from Cityscapes dataset for shard 0 in a 2-way distributed training
>>> dataset = ds.CityscapesDataset(dataset_dir=cityscapes_dataset_dir, num_shards=2, shard_id=0,
...                                num_parallel_workers=1)
>>>
>>> # In Cityscapes dataset, each dictionary has keys "image" and "task"

About Cityscapes dataset:

The Cityscapes dataset consists of 5000 color images with high quality dense pixel annotations and 19998 color images with coarser polygonal annotations in 50 cities. There are 30 classes in this dataset and the polygonal annotations include dense semantic segmentation and instance segmentation for vehicle and people.

You can unzip the dataset files into the following directory structure and read by MindSpore’s API.

Taking the quality_mode of fine as an example.

.
└── Cityscapes
     ├── leftImg8bit
     |    ├── train
     |    |    ├── aachen
     |    |    |    ├── aachen_000000_000019_leftImg8bit.png
     |    |    |    ├── aachen_000001_000019_leftImg8bit.png
     |    |    |    ├── ...
     |    |    ├── bochum
     |    |    |    ├── ...
     |    |    ├── ...
     |    ├── test
     |    |    ├── ...
     |    ├── val
     |    |    ├── ...
     └── gtFine
          ├── train
          |    ├── aachen
          |    |    ├── aachen_000000_000019_gtFine_color.png
          |    |    ├── aachen_000000_000019_gtFine_instanceIds.png
          |    |    ├── aachen_000000_000019_gtFine_labelIds.png
          |    |    ├── aachen_000000_000019_gtFine_polygons.json
          |    |    ├── aachen_000001_000019_gtFine_color.png
          |    |    ├── aachen_000001_000019_gtFine_instanceIds.png
          |    |    ├── aachen_000001_000019_gtFine_labelIds.png
          |    |    ├── aachen_000001_000019_gtFine_polygons.json
          |    |    ├── ...
          |    ├── bochum
          |    |    ├── ...
          |    ├── ...
          ├── test
          |    ├── ...
          └── val
               ├── ...

Citation:

@inproceedings{Cordts2016Cityscapes,
title       = {The Cityscapes Dataset for Semantic Urban Scene Understanding},
author      = {Cordts, Marius and Omran, Mohamed and Ramos, Sebastian and Rehfeld, Timo and Enzweiler,
                Markus and Benenson, Rodrigo and Franke, Uwe and Roth, Stefan and Schiele, Bernt},
booktitle   = {Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year        = {2016}
}
class tinyms.data.CocoDataset(dataset_dir, annotation_file, task='Detection', num_samples=None, num_parallel_workers=None, shuffle=None, decode=False, sampler=None, num_shards=None, shard_id=None, cache=None, extra_metadata=False, decrypt=None)[source]

COCO(Common Objects in Context) dataset.

CocoDataset supports five kinds of tasks, which are Object Detection, Keypoint Detection, Stuff Segmentation, Panoptic Segmentation and Captioning of 2017 Train/Val/Test dataset.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • annotation_file (str) – Path to the annotation JSON file.

  • task (str, optional) – Set the task type for reading COCO data. Supported task types: ‘Detection’, ‘Stuff’, ‘Panoptic’, ‘Keypoint’ and ‘Captioning’. Default: ‘Detection’.

  • num_samples (int, optional) – The number of images to be included in the dataset. Default: None, all images.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (bool, optional) – Whether to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • decode (bool, optional) – Decode the images after reading. Default: False.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the maximum sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

  • extra_metadata (bool, optional) – Flag to add extra meta-data to row. If True, an additional column will be output at the end [_meta-filename, dtype=string] . Default: False.

  • decrypt (callable, optional) – Image decryption function, which accepts the path of the encrypted image file and returns the decrypted bytes data. Default: None, no decryption.

The generated dataset with different task setting has different output columns:

task

Output column

Detection

[image, dtype=uint8]

[bbox, dtype=float32]

[category_id, dtype=uint32]

[iscrowd, dtype=uint32]

Stuff

[image, dtype=uint8]

[segmentation, dtype=float32]

[iscrowd, dtype=uint32]

Keypoint

[image, dtype=uint8]

[keypoints, dtype=float32]

[num_keypoints, dtype=uint32]

Panoptic

[image, dtype=uint8]

[bbox, dtype=float32]

[category_id, dtype=uint32]

[iscrowd, dtype=uint32]

[area, dtype=uint32]

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and num_shards/shard_id are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • RuntimeError – If parse JSON file failed.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • ValueError – If task is not in [‘Detection’, ‘Stuff’, ‘Panoptic’, ‘Keypoint’, ‘Captioning’].

  • ValueError – If annotation_file is not exist.

  • ValueError – If dataset_dir is not exist.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Note

  • Column ‘[_meta-filename, dtype=string]’ won’t be output unless an explicit rename dataset op is added to remove the prefix(‘_meta-‘).

  • Not support mindspore.dataset.PKSampler for sampler parameter yet.

  • This dataset can take in a sampler . sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter sampler

Parameter shuffle

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> coco_dataset_dir = "/path/to/coco_dataset_directory/images"
>>> coco_annotation_file = "/path/to/coco_dataset_directory/annotation_file"
>>>
>>> # 1) Read COCO data for Detection task
>>> dataset = ds.CocoDataset(dataset_dir=coco_dataset_dir,
...                          annotation_file=coco_annotation_file,
...                          task='Detection')
>>>
>>> # 2) Read COCO data for Stuff task
>>> dataset = ds.CocoDataset(dataset_dir=coco_dataset_dir,
...                          annotation_file=coco_annotation_file,
...                          task='Stuff')
>>>
>>> # 3) Read COCO data for Panoptic task
>>> dataset = ds.CocoDataset(dataset_dir=coco_dataset_dir,
...                          annotation_file=coco_annotation_file,
...                          task='Panoptic')
>>>
>>> # 4) Read COCO data for Keypoint task
>>> dataset = ds.CocoDataset(dataset_dir=coco_dataset_dir,
...                          annotation_file=coco_annotation_file,
...                          task='Keypoint')
>>>
>>> # 5) Read COCO data for Captioning task
>>> dataset = ds.CocoDataset(dataset_dir=coco_dataset_dir,
...                          annotation_file=coco_annotation_file,
...                          task='Captioning')
>>>
>>> # In COCO dataset, each dictionary has keys "image" and "annotation"

About COCO dataset:

COCO(Microsoft Common Objects in Context) is a large-scale object detection, segmentation, and captioning dataset with several features: Object segmentation, Recognition in context, Superpixel stuff segmentation, 330K images (>200K labeled), 1.5 million object instances, 80 object categories, 91 stuff categories, 5 captions per image, 250,000 people with keypoints. In contrast to the popular ImageNet dataset, COCO has fewer categories but more instances in per category.

You can unzip the original COCO-2017 dataset files into this directory structure and read by MindSpore’s API.

.
└── coco_dataset_directory
     ├── train2017
     │    ├── 000000000009.jpg
     │    ├── 000000000025.jpg
     │    ├── ...
     ├── test2017
     │    ├── 000000000001.jpg
     │    ├── 000000058136.jpg
     │    ├── ...
     ├── val2017
     │    ├── 000000000139.jpg
     │    ├── 000000057027.jpg
     │    ├── ...
     └── annotations
          ├── captions_train2017.json
          ├── captions_val2017.json
          ├── instances_train2017.json
          ├── instances_val2017.json
          ├── person_keypoints_train2017.json
          └── person_keypoints_val2017.json

Citation:

@article{DBLP:journals/corr/LinMBHPRDZ14,
author        = {Tsung{-}Yi Lin and Michael Maire and Serge J. Belongie and
                Lubomir D. Bourdev and  Ross B. Girshick and James Hays and
                Pietro Perona and Deva Ramanan and Piotr Doll{'{a}}r and C. Lawrence Zitnick},
title         = {Microsoft {COCO:} Common Objects in Context},
journal       = {CoRR},
volume        = {abs/1405.0312},
year          = {2014},
url           = {http://arxiv.org/abs/1405.0312},
archivePrefix = {arXiv},
eprint        = {1405.0312},
timestamp     = {Mon, 13 Aug 2018 16:48:13 +0200},
biburl        = {https://dblp.org/rec/journals/corr/LinMBHPRDZ14.bib},
bibsource     = {dblp computer science bibliography, https://dblp.org}
}
get_class_indexing()[source]

Get the class index.

Returns:

dict, a str-to-list<int> mapping from label name to index.

Examples

>>> coco_dataset_dir = "/path/to/coco_dataset_directory/images"
>>> coco_annotation_file = "/path/to/coco_dataset_directory/annotation_file"
>>>
>>> # Read COCO data for Detection task
>>> dataset = ds.CocoDataset(dataset_dir=coco_dataset_dir,
...                          annotation_file=coco_annotation_file,
...                          task='Detection')
>>>
>>> class_indexing = dataset.get_class_indexing()
class tinyms.data.DIV2KDataset(dataset_dir, usage='train', downgrade='bicubic', scale=2, num_samples=None, num_parallel_workers=None, shuffle=None, decode=None, sampler=None, num_shards=None, shard_id=None, cache=None)[source]

DIV2K(DIVerse 2K resolution image) dataset.

The generated dataset has two columns [hr_image, lr_image] . The tensor of column hr_image and the tensor of column lr_image are of the uint8 type.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • usage (str, optional) – Acceptable usages include ‘train’, ‘valid’ or ‘all’. Default: ‘train’.

  • downgrade (str, optional) – Acceptable downgrades include ‘bicubic’, ‘unknown’, ‘mild’, ‘difficult’ or ‘wild’. Default: ‘bicubic’.

  • scale (str, optional) – Acceptable scales include 2, 3, 4 or 8. Default: 2. When downgrade is ‘bicubic’, scale can be 2, 3, 4, 8. When downgrade is ‘unknown’, scale can only be 2, 3, 4. When downgrade is ‘mild’, ‘difficult’ or ‘wild’, scale can only be 4.

  • num_samples (int, optional) – The number of images to be included in the dataset. Default: None, all images.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (bool, optional) – Whether to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • decode (bool, optional) – Decode the images after reading. Default: False.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the max sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir is invalid or does not contain data files.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and num_shards/shard_id are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If dataset_dir is not exist.

  • ValueError – If usage is invalid.

  • ValueError – If downgrade is invalid.

  • ValueError – If scale is invalid.

  • ValueError – If scale equal to 8 and downgrade not equal to ‘bicubic’.

  • ValueError – If downgrade in [‘mild’, ‘difficult’, ‘wild’] and scale not equal to 4.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Note

  • This dataset can take in a sampler . sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter sampler

Parameter shuffle

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> div2k_dataset_dir = "/path/to/div2k_dataset_directory"
>>>
>>> # 1) Get all samples from DIV2K dataset in sequence
>>> dataset = ds.DIV2KDataset(dataset_dir=div2k_dataset_dir, usage="train", scale=2, downgrade="bicubic",
...                           shuffle=False)
>>>
>>> # 2) Randomly select 350 samples from DIV2K dataset
>>> dataset = ds.DIV2KDataset(dataset_dir=div2k_dataset_dir, usage="train", scale=2, downgrade="bicubic",
...                           num_samples=350, shuffle=True)
>>>
>>> # 3) Get samples from DIV2K dataset for shard 0 in a 2-way distributed training
>>> dataset = ds.DIV2KDataset(dataset_dir=div2k_dataset_dir, usage="train", scale=2, downgrade="bicubic",
...                           num_shards=2, shard_id=0)
>>>
>>> # In DIV2K dataset, each dictionary has keys "hr_image" and "lr_image"

About DIV2K dataset:

The DIV2K dataset consists of 1000 2K resolution images, among which 800 images are for training, 100 images are for validation and 100 images are for testing. NTIRE 2017 and NTIRE 2018 include only training dataset and validation dataset.

You can unzip the dataset files into the following directory structure and read by MindSpore’s API.

Take the training set as an example.

.
└── DIV2K
     ├── DIV2K_train_HR
     |    ├── 0001.png
     |    ├── 0002.png
     |    ├── ...
     ├── DIV2K_train_LR_bicubic
     |    ├── X2
     |    |    ├── 0001x2.png
     |    |    ├── 0002x2.png
     |    |    ├── ...
     |    ├── X3
     |    |    ├── 0001x3.png
     |    |    ├── 0002x3.png
     |    |    ├── ...
     |    └── X4
     |         ├── 0001x4.png
     |         ├── 0002x4.png
     |         ├── ...
     ├── DIV2K_train_LR_unknown
     |    ├── X2
     |    |    ├── 0001x2.png
     |    |    ├── 0002x2.png
     |    |    ├── ...
     |    ├── X3
     |    |    ├── 0001x3.png
     |    |    ├── 0002x3.png
     |    |    ├── ...
     |    └── X4
     |         ├── 0001x4.png
     |         ├── 0002x4.png
     |         ├── ...
     ├── DIV2K_train_LR_mild
     |    ├── 0001x4m.png
     |    ├── 0002x4m.png
     |    ├── ...
     ├── DIV2K_train_LR_difficult
     |    ├── 0001x4d.png
     |    ├── 0002x4d.png
     |    ├── ...
     ├── DIV2K_train_LR_wild
     |    ├── 0001x4w.png
     |    ├── 0002x4w.png
     |    ├── ...
     └── DIV2K_train_LR_x8
          ├── 0001x8.png
          ├── 0002x8.png
          ├── ...

Citation:

@InProceedings{Agustsson_2017_CVPR_Workshops,
author    = {Agustsson, Eirikur and Timofte, Radu},
title     = {NTIRE 2017 Challenge on Single Image Super-Resolution: Dataset and Study},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
url       = "http://www.vision.ee.ethz.ch/~timofter/publications/Agustsson-CVPRW-2017.pdf",
month     = {July},
year      = {2017}
}
class tinyms.data.EMnistDataset(dataset_dir, name, usage=None, num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, num_shards=None, shard_id=None, cache=None)[source]

EMNIST(Extended MNIST) dataset.

The generated dataset has two columns [image, label] . The tensor of column image is of the uint8 type. The tensor of column label is a scalar of the uint32 type.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • name (str) – Name of splits for this dataset, can be ‘byclass’, ‘bymerge’, ‘balanced’, ‘letters’, ‘digits’ or ‘mnist’.

  • usage (str, optional) – Usage of this dataset, can be ‘train’, ‘test’ or ‘all’.’train’ will read from 60,000 train samples, ‘test’ will read from 10,000 test samples, ‘all’ will read from all 70,000 samples. Default: None, will read all samples.

  • num_samples (int, optional) – The number of images to be included in the dataset. Default: None, will read all images.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the max sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and num_shards/shard_id are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Note

  • This dataset can take in a sampler . sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter sampler

Parameter shuffle

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> emnist_dataset_dir = "/path/to/emnist_dataset_directory"
>>>
>>> # Read 3 samples from EMNIST dataset
>>> dataset = ds.EMnistDataset(dataset_dir=emnist_dataset_dir, name="mnist", num_samples=3)
>>>
>>> # Note: In emnist_dataset dataset, each dictionary has keys "image" and "label"

About EMNIST dataset:

The EMNIST dataset is a set of handwritten character digits derived from the NIST Special Database 19 and converted to a 28x28 pixel image format and dataset structure that directly matches the MNIST dataset. Further information on the dataset contents and conversion process can be found in the paper available at https://arxiv.org/abs/1702.05373v1.

The numbers of characters and classes of each split of EMNIST are as follows:

By Class: 814,255 characters and 62 unbalanced classes. By Merge: 814,255 characters and 47 unbalanced classes. Balanced: 131,600 characters and 47 balanced classes. Letters: 145,600 characters and 26 balanced classes. Digits: 280,000 characters and 10 balanced classes. MNIST: 70,000 characters and 10 balanced classes.

Here is the original EMNIST dataset structure. You can unzip the dataset files into this directory structure and read by MindSpore’s API.

.
└── mnist_dataset_dir
     ├── emnist-mnist-train-images-idx3-ubyte
     ├── emnist-mnist-train-labels-idx1-ubyte
     ├── emnist-mnist-test-images-idx3-ubyte
     ├── emnist-mnist-test-labels-idx1-ubyte
     ├── ...

Citation:

@article{cohen_afshar_tapson_schaik_2017,
title        = {EMNIST: Extending MNIST to handwritten letters},
DOI          = {10.1109/ijcnn.2017.7966217},
journal      = {2017 International Joint Conference on Neural Networks (IJCNN)},
author       = {Cohen, Gregory and Afshar, Saeed and Tapson, Jonathan and Schaik, Andre Van},
year         = {2017},
howpublished = {https://www.westernsydney.edu.au/icns/reproducible_research/
                publication_support_materials/emnist}
}
class tinyms.data.FakeImageDataset(num_images=1000, image_size=(224, 224, 3), num_classes=10, base_seed=0, num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, num_shards=None, shard_id=None, cache=None)[source]

A source dataset for generating fake images.

The generated dataset has two columns [image, label] . The tensor of column image is of the uint8 type. The column label is a scalar of the uint32 type.

Parameters:
  • num_images (int, optional) – Number of images to generate in the dataset. Default: 1000.

  • image_size (tuple, optional) – Size of the fake image. Default: (224, 224, 3).

  • num_classes (int, optional) – Number of classes in the dataset. Default: 10.

  • base_seed (int, optional) – Offsets the index-based random seed used to generate each image. Default: 0.

  • num_samples (int, optional) – The number of images to be included in the dataset. Default: None, will read all images.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the max sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and num_shards/shard_id are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Note

  • This dataset can take in a sampler . sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter ‘sampler’

Parameter ‘shuffle’

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> # Read 3 samples from FakeImage dataset
>>> dataset = ds.FakeImageDataset(num_images=1000, image_size=(224,224,3),
...                               num_classes=10, base_seed=0, num_samples=3)
class tinyms.data.FashionMnistDataset(dataset_dir, usage=None, num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, num_shards=None, shard_id=None, cache=None)[source]

Fashion-MNIST dataset.

The generated dataset has two columns [image, label] . The tensor of column image is of the uint8 type. The column label is a scalar of the uint32 type.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • usage (str, optional) – Usage of this dataset, can be ‘train’, ‘test’ or ‘all’. ‘train’ will read from 60,000 train samples, ‘test’ will read from 10,000 test samples, ‘all’ will read from all 70,000 samples. Default: None, will read all samples.

  • num_samples (int, optional) – The number of images to be included in the dataset. Default: None, will read all images.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the maximum sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and num_shards/shard_id are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Note

  • This dataset can take in a sampler . sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter sampler

Parameter shuffle

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> fashion_mnist_dataset_dir = "/path/to/fashion_mnist_dataset_directory"
>>>
>>> # Read 3 samples from FASHIONMNIST dataset
>>> dataset = ds.FashionMnistDataset(dataset_dir=fashion_mnist_dataset_dir, num_samples=3)
>>>
>>> # Note: In FASHIONMNIST dataset, each dictionary has keys "image" and "label"

About Fashion-MNIST dataset:

Fashion-MNIST is a dataset of Zalando’s article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits.

You can unzip the dataset files into this directory structure and read by MindSpore’s API.

.
└── fashionmnist_dataset_dir
     ├── t10k-images-idx3-ubyte
     ├── t10k-labels-idx1-ubyte
     ├── train-images-idx3-ubyte
     └── train-labels-idx1-ubyte

Citation:

@online{xiao2017/online,
  author       = {Han Xiao and Kashif Rasul and Roland Vollgraf},
  title        = {Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms},
  date         = {2017-08-28},
  year         = {2017},
  eprintclass  = {cs.LG},
  eprinttype   = {arXiv},
  eprint       = {cs.LG/1708.07747},
}
class tinyms.data.FlickrDataset(dataset_dir, annotation_file, num_samples=None, num_parallel_workers=None, shuffle=None, decode=None, sampler=None, num_shards=None, shard_id=None, cache=None)[source]

Flickr8k and Flickr30k datasets.

The generated dataset has two columns [image, annotation] . The tensor of column image is of the uint8 type. The tensor of column annotation is a tensor which contains 5 annotations string, such as [“a”, “b”, “c”, “d”, “e”].

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • annotation_file (str) – Path to the root directory that contains the annotation.

  • num_samples (int, optional) – The number of images to be included in the dataset. Default: None, all images.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (bool, optional) – Whether to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • decode (bool, optional) – Decode the images after reading. Default: None.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the max sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir is not valid or does not contain data files.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and num_shards/shard_id are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If dataset_dir is not exist.

  • ValueError – If annotation_file is not exist.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Note

  • This dataset can take in a sampler . sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter sampler

Parameter shuffle

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> flickr_dataset_dir = "/path/to/flickr_dataset_directory"
>>> annotation_file = "/path/to/flickr_annotation_file"
>>>
>>> # 1) Get all samples from FLICKR dataset in sequence
>>> dataset = ds.FlickrDataset(dataset_dir=flickr_dataset_dir,
...                            annotation_file=annotation_file,
...                            shuffle=False)
>>>
>>> # 2) Randomly select 350 samples from FLICKR dataset
>>> dataset = ds.FlickrDataset(dataset_dir=flickr_dataset_dir,
...                            annotation_file=annotation_file,
...                            num_samples=350,
...                            shuffle=True)
>>>
>>> # 3) Get samples from FLICKR dataset for shard 0 in a 2-way distributed training
>>> dataset = ds.FlickrDataset(dataset_dir=flickr_dataset_dir,
...                            annotation_file=annotation_file,
...                            num_shards=2,
...                            shard_id=0)
>>>
>>> # In FLICKR dataset, each dictionary has keys "image" and "annotation"

About Flickr8k dataset:

The Flickr8k dataset consists of 8092 color images. There are 40460 annotations in the Flickr8k.token.txt, each image has 5 annotations.

You can unzip the dataset files into the following directory structure and read by MindSpore’s API.

.
└── Flickr8k
     ├── Flickr8k_Dataset
     │    ├── 1000268201_693b08cb0e.jpg
     │    ├── 1001773457_577c3a7d70.jpg
     │    ├── ...
     └── Flickr8k.token.txt

Citation:

@article{DBLP:journals/jair/HodoshYH13,
author    = {Micah Hodosh and Peter Young and Julia Hockenmaier},
title     = {Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics},
journal   = {J. Artif. Intell. Res.},
volume    = {47},
pages     = {853--899},
year      = {2013},
url       = {https://doi.org/10.1613/jair.3994},
doi       = {10.1613/jair.3994},
timestamp = {Mon, 21 Jan 2019 15:01:17 +0100},
biburl    = {https://dblp.org/rec/journals/jair/HodoshYH13.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}

About Flickr30k dataset:

The Flickr30k dataset consists of 31783 color images. There are 158915 annotations in the results_20130124.token, each image has 5 annotations.

You can unzip the dataset files into the following directory structure and read by MindSpore’s API.

.
└── Flickr30k
     ├── flickr30k-images
     │    ├── 1000092795.jpg
     │    ├── 10002456.jpg
     │    ├── ...
     └── results_20130124.token

Citation:

@article{DBLP:journals/tacl/YoungLHH14,
author    = {Peter Young and Alice Lai and Micah Hodosh and Julia Hockenmaier},
title     = {From image descriptions to visual denotations: New similarity metrics
             for semantic inference over event descriptions},
journal   = {Trans. Assoc. Comput. Linguistics},
volume    = {2},
pages     = {67--78},
year      = {2014},
url       = {https://tacl2013.cs.columbia.edu/ojs/index.php/tacl/article/view/229},
timestamp = {Wed, 17 Feb 2021 21:55:25 +0100},
biburl    = {https://dblp.org/rec/journals/tacl/YoungLHH14.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
class tinyms.data.Flowers102Dataset(dataset_dir, task='Classification', usage='all', num_samples=None, num_parallel_workers=1, shuffle=None, decode=False, sampler=None, num_shards=None, shard_id=None)[source]

Oxfird 102 Flower dataset.

According to the given task configuration, the generated dataset has different output columns: - task = ‘Classification’, output columns: [image, dtype=uint8] , [label, dtype=uint32] . - task = ‘Segmentation’, output columns: [image, dtype=uint8] , [segmentation, dtype=uint8] , [label, dtype=uint32] .

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • task (str, optional) – Specify the ‘Classification’ or ‘Segmentation’ task. Default: ‘Classification’.

  • usage (str, optional) – Specify the ‘train’, ‘valid’, ‘test’ part or ‘all’ parts of dataset. Default: ‘all’, will read all samples.

  • num_samples (int, optional) – The number of samples to be included in the dataset. Default: None, all images.

  • num_parallel_workers (int, optional) – Number of worker subprocesses used to fetch the dataset in parallel. Default: 1.

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • decode (bool, optional) – Whether or not to decode the images and segmentations after reading. Default: False.

  • sampler (Union[Sampler, Iterable], optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the max sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument must be specified only when num_shards is also specified.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and num_shards/shard_id are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Note

  • This dataset can take in a sampler . sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter ‘sampler’

Parameter ‘shuffle’

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> flowers102_dataset_dir = "/path/to/flowers102_dataset_directory"
>>> dataset = ds.Flowers102Dataset(dataset_dir=flowers102_dataset_dir,
...                                task="Classification",
...                                usage="all",
...                                decode=True)

About Flowers102 dataset:

Flowers102 dataset consists of 102 flower categories. The flowers commonly occur in the United Kingdom. Each class consists of between 40 and 258 images.

Here is the original Flowers102 dataset structure. You can unzip the dataset files into this directory structure and read by MindSpore’s API.

.
└── flowes102_dataset_dir
     ├── imagelabels.mat
     ├── setid.mat
     ├── jpg
          ├── image_00001.jpg
          ├── image_00002.jpg
          ├── ...
     ├── segmim
          ├── segmim_00001.jpg
          ├── segmim_00002.jpg
          ├── ...

Citation:

@InProceedings{Nilsback08,
  author       = "Maria-Elena Nilsback and Andrew Zisserman",
  title        = "Automated Flower Classification over a Large Number of Classes",
  booktitle    = "Indian Conference on Computer Vision, Graphics and Image Processing",
  month        = "Dec",
  year         = "2008",
}
get_class_indexing()[source]

Get the class index.

Returns:

dict, a str-to-int mapping from label name to index.

class tinyms.data.Food101Dataset(dataset_dir, usage=None, num_samples=None, num_parallel_workers=None, shuffle=None, decode=False, sampler=None, num_shards=None, shard_id=None, cache=None)[source]

Food101 dataset.

The generated dataset has two columns [image, label] . The tensor of column image is of the uint8 type. The tensor of column label is of the string type.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • usage (str, optional) – Usage of this dataset, can be ‘train’, ‘test’, or ‘all’. ‘train’ will read from 75,750 samples, ‘test’ will read from 25,250 samples, and ‘all’ will read all ‘train’ and ‘test’ samples. Default: None, will be set to ‘all’.

  • num_samples (int, optional) – The number of images to be included in the dataset. Default: None, will read all images.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • decode (bool, optional) – Decode the images after reading. Default: False.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. When this argument is specified, num_samples reflects the maximum sample number of per shard. Default: None.

  • shard_id (int, optional) – The shard ID within num_shards . This argument can only be specified when num_shards is also specified. Default: None.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and num_shards/shard_id are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If shard_id is not in range of [0, num_shards ).

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • ValueError – If the value of usage is not ‘train’, ‘test’, or ‘all’.

  • ValueError – If dataset_dir is not exist.

Note

  • This dataset can take in a sampler . sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter sampler

Parameter shuffle

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> food101_dataset_dir = "/path/to/food101_dataset_directory"
>>>
>>> # Read 3 samples from Food101 dataset
>>> dataset = ds.Food101Dataset(dataset_dir=food101_dataset_dir, num_samples=3)

About Food101 dataset:

The Food101 is a dataset of 101 food categories, with 101,000 images. There are 250 test imgaes and 750 training images in each class. All images were rescaled to have a maximum side length of 512 pixels.

The following is the original Food101 dataset structure. You can unzip the dataset files into this directory structure and read by MindSpore’s API.

.
└── food101_dir
     ├── images
     │    ├── apple_pie
     │    │    ├── 1005649.jpg
     │    │    ├── 1014775.jpg
     │    │    ├──...
     │    ├── baby_back_rips
     │    │    ├── 1005293.jpg
     │    │    ├── 1007102.jpg
     │    │    ├──...
     │    └──...
     └── meta
          ├── train.txt
          ├── test.txt
          ├── classes.txt
          ├── train.json
          ├── test.json
          └── train.txt

Citation:

@inproceedings{bossard14,
title     = {Food-101 -- Mining Discriminative Components with Random Forests},
author    = {Bossard, Lukas and Guillaumin, Matthieu and Van Gool, Luc},
booktitle = {European Conference on Computer Vision},
year      = {2014}
}
class tinyms.data.ImageFolderDataset(dataset_dir, num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, extensions=None, class_indexing=None, decode=False, num_shards=None, shard_id=None, cache=None, decrypt=None)[source]

A source dataset that reads images from a tree of directories. All images within one folder have the same label.

The generated dataset has two columns: [image, label] . The tensor of column image is of the uint8 type. The tensor of column label is of a scalar of uint32 type.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • num_samples (int, optional) – The number of images to be included in the dataset. Default: None, all images.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • extensions (list[str], optional) – List of file extensions to be included in the dataset. Default: None.

  • class_indexing (dict, optional) – A str-to-int mapping from folder name to index Default: None, the folder names will be sorted alphabetically and each class will be given a unique index starting from 0.

  • decode (bool, optional) – Decode the images after reading. Default: False.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the maximum sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

  • decrypt (callable, optional) – Image decryption function, which accepts the path of the encrypted image file and returns the decrypted bytes data. Default: None, no decryption.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and num_shards/shard_id are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • RuntimeError – If class_indexing is not a dictionary.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Note

  • The shape of the image column is [image_size] if decode flag is False, or [H,W,C] otherwise.

  • This dataset can take in a sampler . sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter sampler

Parameter shuffle

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> image_folder_dataset_dir = "/path/to/image_folder_dataset_directory"
>>>
>>> # 1) Read all samples (image files) in image_folder_dataset_dir with 8 threads
>>> dataset = ds.ImageFolderDataset(dataset_dir=image_folder_dataset_dir,
...                                 num_parallel_workers=8)
>>>
>>> # 2) Read all samples (image files) from folder cat and folder dog with label 0 and 1
>>> dataset = ds.ImageFolderDataset(dataset_dir=image_folder_dataset_dir,
...                                 class_indexing={"cat":0, "dog":1})
>>>
>>> # 3) Read all samples (image files) in image_folder_dataset_dir with extensions .JPEG
>>> #    and .png (case sensitive)
>>> dataset = ds.ImageFolderDataset(dataset_dir=image_folder_dataset_dir,
...                                 extensions=[".JPEG", ".png"])

About ImageFolderDataset:

You can construct the following directory structure from your dataset files and read by MindSpore’s API.

.
└── image_folder_dataset_directory
     ├── class1
     │    ├── 000000000001.jpg
     │    ├── 000000000002.jpg
     │    ├── ...
     ├── class2
     │    ├── 000000000001.jpg
     │    ├── 000000000002.jpg
     │    ├── ...
     ├── class3
     │    ├── 000000000001.jpg
     │    ├── 000000000002.jpg
     │    ├── ...
     ├── classN
     ├── ...
get_class_indexing()[source]

Get the class index.

Returns:

dict, a str-to-int mapping from label name to index.

Examples

>>> image_folder_dataset_dir = "/path/to/image_folder_dataset_directory"
>>>
>>> dataset = ds.ImageFolderDataset(dataset_dir=image_folder_dataset_dir)
>>> class_indexing = dataset.get_class_indexing()
class tinyms.data.KITTIDataset(dataset_dir, usage=None, num_samples=None, num_parallel_workers=None, shuffle=None, decode=False, sampler=None, num_shards=None, shard_id=None, cache=None)[source]

KITTI dataset.

When usage is “train”, the generated dataset has multiple columns: [image, label, truncated, occluded, alpha, bbox, dimensions, location, rotation_y] ; When usage is “test”, the generated dataset has only one column: [image] . The tensor of column image is of the uint8 type. The tensor of column label is of the uint32 type. The tensor of column truncated is of the float32 type. The tensor of column occluded is of the uint32 type. The tensor of column alpha is of the float32 type. The tensor of column bbox is of the float32 type. The tensor of column dimensions is of the float32 type. The tensor of column location is of the float32 type. The tensor of column rotation_y is of the float32 type.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • usage (str, optional) – Usage of this dataset, can be train or test . train will read 7481 train samples, test will read from 7518 test samples without label. Default: None, will use train .

  • num_samples (int, optional) – The number of images to be included in the dataset. Default: None, will include all images.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (bool, optional) – Whether to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • decode (bool, optional) – Decode the images after reading. Default: False.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the max sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards. Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and num_shards/shard_id are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If dataset_dir is not exist.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Note

  • This dataset can take in a sampler . sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter sampler

Parameter shuffle

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> kitti_dataset_dir = "/path/to/kitti_dataset_directory"
>>>
>>> # 1) Read all KITTI train dataset samples in kitti_dataset_dir in sequence
>>> dataset = ds.KITTIDataset(dataset_dir=kitti_dataset_dir, usage="train")
>>>
>>> # 2) Read then decode all KITTI test dataset samples in kitti_dataset_dir in sequence
>>> dataset = ds.KITTIDataset(dataset_dir=kitti_dataset_dir, usage="test",
...                           decode=True, shuffle=False)

About KITTI dataset:

KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) is one of the most popular datasets for use in mobile robotics and autonomous driving. It consists of hours of traffic scenarios recorded with a variety of sensor modalities, including high-resolution RGB, grayscale stereo cameras, and a 3D laser scanner. Despite its popularity, the dataset itself does not contain ground truth for semantic segmentation. However, various researchers have manually annotated parts of the dataset to fit their necessities. Álvarez et al. generated ground truth for 323 images from the road detection challenge with three classes: road, vehicles and sky. Zhang et al. annotated 252 (140 for training and 112 for testing) acquisitions – RGB and Velodyne scans – from the tracking challenge for ten object categories: building, sky, road, vegetation, sidewalk, car, pedestrian, cyclist, sign/pole, and fence.

You can unzip the original KITTI dataset files into this directory structure and read by MindSpore’s API.

.
└── kitti_dataset_directory
    ├── data_object_image_2
    │    ├──training
    │    │    ├──image_2
    │    │    │    ├── 000000000001.jpg
    │    │    │    ├── 000000000002.jpg
    │    │    │    ├── ...
    │    ├──testing
    │    │    ├── image_2
    │    │    │    ├── 000000000001.jpg
    │    │    │    ├── 000000000002.jpg
    │    │    │    ├── ...
    ├── data_object_label_2
    │    ├──training
    │    │    ├──label_2
    │    │    │    ├── 000000000001.jpg
    │    │    │    ├── 000000000002.jpg
    │    │    │    ├── ...

Citation:

@INPROCEEDINGS{Geiger2012CVPR,
author={Andreas Geiger and Philip Lenz and Raquel Urtasun},
title={Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite},
booktitle={Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2012}
}
class tinyms.data.KMnistDataset(dataset_dir, usage=None, num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, num_shards=None, shard_id=None, cache=None)[source]

KMNIST(Kuzushiji-MNIST) dataset.

The generated dataset has two columns [image, label] . The tensor of column image is of the uint8 type. The column label is a scalar of the uint32 type.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • usage (str, optional) – Usage of this dataset, can be ‘train’, ‘test’ or ‘all’ . ‘train’ will read from 60,000 train samples, ‘test’ will read from 10,000 test samples, ‘all’ will read from all 70,000 samples. Default: None, will read all samples.

  • num_samples (int, optional) – The number of images to be included in the dataset. Default: None, will read all images.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the maximum sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and sharding are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Note

  • This dataset can take in a sampler . sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter sampler

Parameter shuffle

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> kmnist_dataset_dir = "/path/to/kmnist_dataset_directory"
>>>
>>> # Read 3 samples from KMNIST dataset
>>> dataset = ds.KMnistDataset(dataset_dir=kmnist_dataset_dir, num_samples=3)

About KMNIST dataset:

KMNIST is a dataset, adapted from Kuzushiji Dataset, as a drop-in replacement for MNIST dataset, which is the most famous dataset in the machine learning community.

Here is the original KMNIST dataset structure. You can unzip the dataset files into this directory structure and read by MindSpore’s API.

.
└── kmnist_dataset_dir
     ├── t10k-images-idx3-ubyte
     ├── t10k-labels-idx1-ubyte
     ├── train-images-idx3-ubyte
     └── train-labels-idx1-ubyte

Citation:

@online{clanuwat2018deep,
  author       = {Tarin Clanuwat and Mikel Bober-Irizar and Asanobu Kitamoto and
                   Alex Lamb and Kazuaki Yamamoto and David Ha},
  title        = {Deep Learning for Classical Japanese Literature},
  date         = {2018-12-03},
  year         = {2018},
  eprintclass  = {cs.CV},
  eprinttype   = {arXiv},
  eprint       = {cs.CV/1812.01718},
}
class tinyms.data.LFWDataset(dataset_dir, task=None, usage=None, image_set=None, num_samples=None, num_parallel_workers=None, shuffle=None, decode=False, sampler=None, num_shards=None, shard_id=None, cache=None)[source]

LFW(Labeled Faces in the Wild) dataset.

When task is ‘people’, the generated dataset has two columns: [image, label]; When task is ‘pairs’, the generated dataset has three columns: [image1, image2, label] . The tensor of column image is of the uint8 type. The tensor of column image1 is of the uint8 type. The tensor of column image2 is of the uint8 type. The tensor of column label is a scalar of the uint32 type.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • task (str, optional) – Set the task type of reading lfw data, support ‘people’ and ‘pairs’. Default: None, means ‘people’.

  • usage (str, optional) – The image split to use, support ‘10fold’, ‘train’, ‘test’ and ‘all’. Default: None, will read samples including train and test.

  • image_set (str, optional) – Type of image funneling to use, support ‘original’, ‘funneled’ or ‘deepfunneled’. Default: None, will use ‘funneled’.

  • num_samples (int, optional) – The number of images to be included in the dataset. Default: None, all images.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • decode (bool, optional) – Decode the images after reading. Default: False.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the max sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards. Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and sharding are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If shard_id is invalid (< 0 or >= num_shards ).

Note

  • This dataset can take in a sampler . sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter ‘sampler’

Parameter ‘shuffle’

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> # 1) Read LFW People dataset
>>> lfw_people_dataset_dir = "/path/to/lfw_people_dataset_directory"
>>> dataset = ds.LFWDataset(dataset_dir=lfw_people_dataset_dir, task="people", usage="10fold",
...                         image_set="original")
>>>
>>> # 2) Read LFW Pairs dataset
>>> lfw_pairs_dataset_dir = "/path/to/lfw_pairs_dataset_directory"
>>> dataset = ds.LFWDataset(dataset_dir=lfw_pairs_dataset_dir, task="pairs", usage="test", image_set="funneled")

About LFW dataset:

LFW (Labelled Faces in the Wild) dataset is one of the most commonly used and widely open datasets in the field of face recognition. It was released by Gary B. Huang and his team at Massachusetts Institute of Technology in 2007. The dataset includes nearly 50,000 images of 13,233 individuals, which are sourced from various internet platforms and contain diverse environmental factors such as different poses, lighting conditions, and angles. Most of the images in the dataset are frontal and cover a wide range of ages, genders, and ethnicities.

You can unzip the original LFW dataset files into this directory structure and read by MindSpore’s API.

.
└── lfw_dataset_directory
    ├── lfw
    │    ├──Aaron_Eckhart
    │    │    ├──Aaron_Eckhart_0001.jpg
    │    │    ├──...
    │    ├──Abbas_Kiarostami
    │    │    ├── Abbas_Kiarostami_0001.jpg
    │    │    ├──...
    │    ├──...
    ├── lfw-deepfunneled
    │    ├──Aaron_Eckhart
    │    │    ├──Aaron_Eckhart_0001.jpg
    │    │    ├──...
    │    ├──Abbas_Kiarostami
    │    │    ├── Abbas_Kiarostami_0001.jpg
    │    │    ├──...
    │    ├──...
    ├── lfw_funneled
    │    ├──Aaron_Eckhart
    │    │    ├──Aaron_Eckhart_0001.jpg
    │    │    ├──...
    │    ├──Abbas_Kiarostami
    │    │    ├── Abbas_Kiarostami_0001.jpg
    │    │    ├──...
    │    ├──...
    ├── lfw-names.txt
    ├── pairs.txt
    ├── pairsDevTest.txt
    ├── pairsDevTrain.txt
    ├── people.txt
    ├── peopleDevTest.txt
    ├── peopleDevTrain.txt

Citation:

@TechReport{LFWTech,
    title={LFW: A Database for Studying Recognition in Unconstrained Environments},
    author={Gary B. Huang and Manu Ramesh and Tamara Berg and Erik Learned-Miller},
    institution ={University of Massachusetts, Amherst},
    year={2007}
    number={07-49},
    month={October},
    howpublished = {http://vis-www.cs.umass.edu/lfw}
}
class tinyms.data.LSUNDataset(dataset_dir, usage=None, classes=None, num_samples=None, num_parallel_workers=None, shuffle=None, decode=False, sampler=None, num_shards=None, shard_id=None, cache=None)[source]

LSUN(Large-scale Scene UNderstarding) dataset.

The generated dataset has two columns: [image, label] . The tensor of column image is of the uint8 type. The tensor of column label is of a scalar of uint32 type.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • usage (str, optional) – Usage of this dataset, can be train , test , valid or all Default: None, will be set to all .

  • classes (Union[str, list[str]], optional) – Choose the specific classes to load. Default: None, means loading all classes in root directory.

  • num_samples (int, optional) – The number of images to be included in the dataset. Default: None, all images.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • decode (bool, optional) – Decode the images after reading. Default: False.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the max sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards. Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and sharding are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If shard_id is invalid (< 0 or >= num_shards ).

  • ValueError – If usage or classes is invalid (not in specific types).

Note

  • This dataset can take in a sampler . sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter ‘sampler’

Parameter ‘shuffle’

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> lsun_dataset_dir = "/path/to/lsun_dataset_directory"
>>>
>>> # 1) Read all samples (image files) in lsun_dataset_dir with 8 threads
>>> dataset = ds.LSUNDataset(dataset_dir=lsun_dataset_dir,
...                          num_parallel_workers=8)
>>>
>>> # 2) Read all train samples (image files) from folder "bedroom" and "classroom"
>>> dataset = ds.LSUNDataset(dataset_dir=lsun_dataset_dir, usage="train",
...                          classes=["bedroom", "classroom"])

About LSUN dataset:

The LSUN (Large-Scale Scene Understanding) is a large-scale dataset used for indoors scene understanding. It was originally launched by Stanford University in 2015 with the aim of providing a challenging and diverse dataset for research in computer vision and machine learning. The main application of this dataset for research is indoor scene analysis.

This dataset contains ten different categories of scenes, including bedrooms, living rooms, restaurants, lounges, studies, kitchens, bathrooms, corridors, children’s room, and outdoors. Each category contains tens of thousands of images from different perspectives, and these images are high-quality, high-resolusion real-world images.

You can unzip the dataset files into this directory structure and read by MindSpore’s API.

.
└── lsun_dataset_directory
    ├── test
    │    ├── ...
    ├── bedroom_train
    │    ├── 1_1.jpg
    │    ├── 1_2.jpg
    ├── bedroom_val
    │    ├── ...
    ├── classroom_train
    │    ├── ...
    ├── classroom_val
    │    ├── ...

Citation:

article{yu15lsun,
    title={LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop},
    author={Yu, Fisher and Zhang, Yinda and Song, Shuran and Seff, Ari and Xiao, Jianxiong},
    journal={arXiv preprint arXiv:1506.03365},
    year={2015}
}
class tinyms.data.ManifestDataset(dataset_file, usage='train', num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, class_indexing=None, decode=False, num_shards=None, shard_id=None, cache=None)[source]

A source dataset for reading images from a Manifest file.

The generated dataset has two columns: [image, label] . The tensor of column image is of the uint8 type. The tensor of column label is of a scalar of uint64 type.

Parameters:
  • dataset_file (str) – File to be read.

  • usage (str, optional) – Acceptable usages include ‘train’, ‘eval’ and ‘inference’. Default: ‘train’.

  • num_samples (int, optional) – The number of images to be included in the dataset. Default: None, will include all images.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (bool, optional) – Whether to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • class_indexing (dict, optional) – A str-to-int mapping from label name to index. Default: None, the folder names will be sorted alphabetically and each class will be given a unique index starting from 0.

  • decode (bool, optional) – decode the images after reading. Default: False.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the max number of samples per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_files are not valid or do not exist.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and num_shards/shard_id are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • RuntimeError – If class_indexing is not a dictionary.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Note

  • The shape of the image column is [image_size] if decode flag is False, or [H,W,C] otherwise.

  • This dataset can take in a sampler . sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter sampler

Parameter shuffle

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> manifest_dataset_dir = "/path/to/manifest_dataset_file"
>>>
>>> # 1) Read all samples specified in manifest_dataset_dir dataset with 8 threads for training
>>> dataset = ds.ManifestDataset(dataset_file=manifest_dataset_dir, usage="train", num_parallel_workers=8)
>>>
>>> # 2) Read samples (specified in manifest_file.manifest) for shard 0 in a 2-way distributed training setup
>>> dataset = ds.ManifestDataset(dataset_file=manifest_dataset_dir, num_shards=2, shard_id=0)

About Manifest dataset:

Manifest file contains a list of files included in a dataset, including basic file info such as File name and File ID, along with extended file metadata. Manifest is a data format file supported by Huawei Modelarts. For details, see Specifications for Importing the Manifest File .

.
└── manifest_dataset_directory
    ├── train
    │    ├── 1.JPEG
    │    ├── 2.JPEG
    │    ├── ...
    ├── eval
    │    ├── 1.JPEG
    │    ├── 2.JPEG
    │    ├── ...
get_class_indexing()[source]

Get the class index.

Returns:

dict, a str-to-int mapping from label name to index.

Examples

>>> manifest_dataset_dir = "/path/to/manifest_dataset_file"
>>>
>>> dataset = ds.ManifestDataset(dataset_file=manifest_dataset_dir)
>>> class_indexing = dataset.get_class_indexing()
class tinyms.data.MnistDataset(dataset_dir, usage=None, num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, num_shards=None, shard_id=None, cache=None)[source]

MNIST dataset.

The generated dataset has two columns [image, label] . The tensor of column image is of the uint8 type. The tensor of column label is a scalar of the uint32 type.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • usage (str, optional) – Usage of this dataset, can be ‘train’, ‘test’ or ‘all’ . ‘train’ will read from 60,000 train samples, ‘test’ will read from 10,000 test samples, ‘all’ will read from all 70,000 samples. Default: None, will read all samples.

  • num_samples (int, optional) – The number of images to be included in the dataset. Default: None, will read all images.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the maximum sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • ValueError – If usage is not ‘train’、’test’ or ‘all’.

  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and num_shards/shard_id are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Note

  • This dataset can take in a sampler . sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter sampler

Parameter shuffle

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> mnist_dataset_dir = "/path/to/mnist_dataset_directory"
>>>
>>> # Read 3 samples from MNIST dataset
>>> dataset = ds.MnistDataset(dataset_dir=mnist_dataset_dir, num_samples=3)
>>>
>>> # Note: In mnist_dataset dataset, each dictionary has keys "image" and "label"

About MNIST dataset:

The MNIST database of handwritten digits has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.

Here is the original MNIST dataset structure. You can unzip the dataset files into this directory structure and read by MindSpore’s API.

.
└── mnist_dataset_dir
     ├── t10k-images-idx3-ubyte
     ├── t10k-labels-idx1-ubyte
     ├── train-images-idx3-ubyte
     └── train-labels-idx1-ubyte

Citation:

@article{lecun2010mnist,
title        = {MNIST handwritten digit database},
author       = {LeCun, Yann and Cortes, Corinna and Burges, CJ},
journal      = {ATT Labs [Online]},
volume       = {2},
year         = {2010},
howpublished = {http://yann.lecun.com/exdb/mnist}
}
class tinyms.data.OmniglotDataset(dataset_dir, background=None, num_samples=None, num_parallel_workers=None, shuffle=None, decode=False, sampler=None, num_shards=None, shard_id=None, cache=None)[source]

Omniglot dataset.

The generated dataset has two columns [image, label] . The tensor of column image is of the uint8 type. The tensor of column label is a scalar of the uint32 type.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • background (bool, optional) – Whether to create dataset from the “background” set. Otherwise create from the “evaluation” set. Default: None, set to True.

  • num_samples (int, optional) – The number of images to be included in the dataset. Default: None, all images.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • decode (bool, optional) – Decode the images after reading. Default: False.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the max sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards. Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and sharding are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Note

  • This dataset can take in a sampler. sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter sampler

Parameter shuffle

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> omniglot_dataset_dir = "/path/to/omniglot_dataset_directory"
>>> dataset = ds.OmniglotDataset(dataset_dir=omniglot_dataset_dir,
...                              num_parallel_workers=8)

About Omniglot dataset:

The Omniglot dataset is designed for developing more human-like learning algorithms. It contains 1623 different handwritten characters from 50 different alphabets. Each of the 1623 characters was drawn online via Amazon’s Mechanical Turk by 20 different people. Each image is paired with stroke data, a sequences of [x, y, t] coordinates with time in milliseconds.

You can unzip the original Omniglot dataset files into this directory structure and read by MindSpore’s API.

.
└── omniglot_dataset_directory
     ├── images_background/
     │    ├── character_class1/
     ├    ├──── 01.jpg
     │    ├──── 02.jpg
     │    ├── character_class2/
     ├    ├──── 01.jpg
     │    ├──── 02.jpg
     │    ├── ...
     ├── images_evaluation/
     │    ├── character_class1/
     ├    ├──── 01.jpg
     │    ├──── 02.jpg
     │    ├── character_class2/
     ├    ├──── 01.jpg
     │    ├──── 02.jpg
     │    ├── ...

Citation:

@article{lake2015human,
    title={Human-level concept learning through probabilistic program induction},
    author={Lake, Brenden M and Salakhutdinov, Ruslan and Tenenbaum, Joshua B},
    journal={Science},
    volume={350},
    number={6266},
    pages={1332--1338},
    year={2015},
    publisher={American Association for the Advancement of Science}
}
class tinyms.data.PhotoTourDataset(dataset_dir, name, usage=None, num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, num_shards=None, shard_id=None, cache=None)[source]

PhotoTour dataset.

According to the given usage configuration, the generated dataset has different output columns: - usage = ‘train’, output columns: [image, dtype=uint8] . - usage ≠ ‘train’, output columns: [image1, dtype=uint8] , [image2, dtype=uint8] , [matches, dtype=uint32] .

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • name (str) – Name of the dataset to load, should be one of ‘notredame’, ‘yosemite’, ‘liberty’, ‘notredame_harris’, ‘yosemite_harris’ or ‘liberty_harris’.

  • usage (str, optional) – Usage of the dataset, can be ‘train’ or ‘test’. Default: None, will be set to ‘train’. When usage is ‘train’, number of samples for each name is {‘notredame’: 468159, ‘yosemite’: 633587, ‘liberty’: 450092, ‘liberty_harris’: 379587, ‘yosemite_harris’: 450912, ‘notredame_harris’: 325295}. When usage is ‘test’, will read 100,000 samples for testing.

  • num_samples (int, optional) – The number of images to be included in the dataset. Default: None, will read all images.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the max sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and num_shards/shard_id are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If dataset_dir is not exist.

  • ValueError – If usage is not in [“train”, “test”].

  • ValueError – If name is not in [“notredame”, “yosemite”, “liberty”, “notredame_harris”, “yosemite_harris”, “liberty_harris”].

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Note

  • This dataset can take in a sampler . sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter sampler

Parameter shuffle

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> # Read 3 samples from PhotoTour dataset.
>>> dataset = ds.PhotoTourDataset(dataset_dir="/path/to/photo_tour_dataset_directory",
...                               name='liberty', usage='train', num_samples=3)

About PhotoTour dataset:

The data is taken from Photo Tourism reconstructions from Trevi Fountain (Rome), Notre Dame (Paris) and Half Dome (Yosemite). Each dataset consists of a series of corresponding patches, which are obtained by projecting 3D points from Photo Tourism reconstructions back into the original images.

The dataset consists of 1024 x 1024 bitmap (.bmp) images, each containing a 16 x 16 array of image patches. Each patch is sampled as 64 x 64 grayscale, with a canonical scale and orientation. For details of how the scale and orientation is established, please see the paper. An associated metadata file info.txt contains the match information. Each row of info.txt corresponds to a separate patch, with the patches ordered from left to right and top to bottom in each bitmap image. The first number on each row of info.txt is the 3D point ID from which that patch was sampled – patches with the same 3D point ID are projected from the same 3D point (into different images). The second number in info.txt corresponds to the image from which the patch was sampled, and is not used at present.

You can unzip the original PhotoTour dataset files into this directory structure and read by MindSpore’s API.

.
└── photo_tour_dataset_directory
    ├── liberty/
    │    ├── info.txt                 // two columns: 3D_point_ID, unused
    │    ├── m50_100000_100000_0.txt  // seven columns: patch_ID1, 3D_point_ID1, unused1,
    │    │                            // patch_ID2, 3D_point_ID2, unused2, unused3
    │    ├── patches0000.bmp          // 1024*1024 pixels, with 16 * 16 patches.
    │    ├── patches0001.bmp
    │    ├── ...
    ├── yosemite/
    │    ├── ...
    ├── notredame/
    │    ├── ...
    ├── liberty_harris/
    │    ├── ...
    ├── yosemite_harris/
    │    ├── ...
    ├── notredame_harris/
    │    ├── ...

Citation:

@INPROCEEDINGS{4269996,
    author={Winder, Simon A. J. and Brown, Matthew},
    booktitle={2007 IEEE Conference on Computer Vision and Pattern Recognition},
    title={Learning Local Image Descriptors},
    year={2007},
    volume={},
    number={},
    pages={1-8},
    doi={10.1109/CVPR.2007.382971}
}
class tinyms.data.Places365Dataset(dataset_dir, usage=None, small=True, decode=False, num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, num_shards=None, shard_id=None, cache=None)[source]

Places365 dataset.

The generated dataset has two columns [image, label] . The tensor of column image is of the uint8 type. The tensor of column label is of the uint32 type.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • usage (str, optional) – Usage of this dataset, can be ‘train-standard’, ‘train-challenge’ or ‘val’. Default: None, will be set to ‘train-standard’.

  • small (bool, optional) – Use 256 * 256 images (True) or high resolution images (False). Default: False.

  • decode (bool, optional) – Decode the images after reading. Default: False.

  • num_samples (int, optional) – The number of images to be included in the dataset. Default: None, will read all images.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the max sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and num_shards/shard_id are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • ValueError – If shard_id is not in range of [0, num_shards ).

  • ValueError – If usage is not in [“train-standard”, “train-challenge”, “val”].

Note

  • This dataset can take in a sampler. ‘sampler’ and ‘shuffle’ are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter sampler

Parameter shuffle

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> place365_dataset_dir = "/path/to/place365_dataset_directory"
>>>
>>> # Read 3 samples from Places365 dataset
>>> dataset = ds.Places365Dataset(dataset_dir=place365_dataset_dir, usage='train-standard',
...                               small=True, decode=True, num_samples=3)

About Places365 dataset:

Convolutional neural networks (CNNs) trained on the Places2 Database can be used for scene recognition as well as generic deep scene features for visual recognition.

The author releases the data of Places365-Standard and the data of Places365-Challenge to the public. Places365-Standard is the core set of Places2 Database, which has been used to train the Places365-CNNs. The author will add other kinds of annotation on the Places365-Standard in the future. Places365-Challenge is the competition set of Places2 Database, which has 6.2 million extra images compared to the Places365-Standard. The Places365-Challenge will be used for the Places Challenge 2016.

You can unzip the original Places365 dataset files into this directory structure and read by MindSpore’s API.

.
└── categories_places365
    ├── places365_train-standard.txt
    ├── places365_train-challenge.txt
    ├── val_large/
    │    ├── Places365_val_00000001.jpg
    │    ├── Places365_val_00000002.jpg
    │    ├── Places365_val_00000003.jpg
    │    ├── ...
    ├── val_256/
    │    ├── ...
    ├── data_large_standard/
    │    ├── ...
    ├── data_256_standard/
    │    ├── ...
    ├── data_large_challenge/
    │    ├── ...
    ├── data_256_challenge /
    │    ├── ...

Citation:

article{zhou2017places,
    title={Places: A 10 million Image Database for Scene Recognition},
    author={Zhou, Bolei and Lapedriza, Agata and Khosla, Aditya and Oliva, Aude and Torralba, Antonio},
    journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
    year={2017},
    publisher={IEEE}
}
class tinyms.data.QMnistDataset(dataset_dir, usage=None, compat=True, num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, num_shards=None, shard_id=None, cache=None)[source]

QMNIST dataset.

The generated dataset has two columns [image, label] . The tensor of column image is of the uint8 type. The tensor of column label is of the uint32 type.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • usage (str, optional) – Usage of this dataset, can be ‘train’, ‘test’, ‘test10k’, ‘test50k’, ‘nist’ or ‘all’. Default: None, will read all samples.

  • compat (bool, optional) – Whether the label for each example is class number (compat=True) or the full QMNIST information (compat=False). Default: True.

  • num_samples (int, optional) – The number of images to be included in the dataset. Default: None, will read all images.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the maximum sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and num_shards/shard_id are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If shard_id is not in range of [0, num_shards ).

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

Note

  • This dataset can take in a sampler . sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter sampler

Parameter shuffle

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> qmnist_dataset_dir = "/path/to/qmnist_dataset_directory"
>>>
>>> # Read 3 samples from QMNIST train dataset
>>> dataset = ds.QMnistDataset(dataset_dir=qmnist_dataset_dir, num_samples=3)
>>>
>>> # Note: In QMNIST dataset, each dictionary has keys "image" and "label"

About QMNIST dataset:

The QMNIST dataset was generated from the original data found in the NIST Special Database 19 with the goal to match the MNIST preprocessing as closely as possible. Through an iterative process, researchers tried to generate an additional 50k images of MNIST-like data. They started with a reconstruction process given in the paper and used the Hungarian algorithm to find the best matches between the original MNIST samples and their reconstructed samples.

Here is the original QMNIST dataset structure. You can unzip the dataset files into this directory structure and read by MindSpore’s API.

.
└── qmnist_dataset_dir
     ├── qmnist-train-images-idx3-ubyte
     ├── qmnist-train-labels-idx2-int
     ├── qmnist-test-images-idx3-ubyte
     ├── qmnist-test-labels-idx2-int
     ├── xnist-images-idx3-ubyte
     └── xnist-labels-idx2-int

Citation:

@incollection{qmnist-2019,
   title = "Cold Case: The Lost MNIST Digits",
   author = "Chhavi Yadav and L'{e}on Bottou",           booktitle = {Advances in Neural Information Processing Systems 32},
   year = {2019},
   publisher = {Curran Associates, Inc.},
}
class tinyms.data.RandomDataset(total_rows=None, schema=None, columns_list=None, num_samples=None, num_parallel_workers=None, cache=None, shuffle=None, num_shards=None, shard_id=None)[source]

A source dataset that generates random data.

Parameters:
  • total_rows (int, optional) – Number of samples for the dataset to generate. Default: None, number of samples is random.

  • schema (Union[str, Schema], optional) – Data format policy, which specifies the data types and shapes of the data column to be read. Both JSON file path and objects constructed by mindspore.dataset.Schema are acceptable. Default: None.

  • columns_list (list[str], optional) – List of column names of the dataset. Default: None, the columns will be named like this “c0”, “c1”, “c2” etc.

  • num_samples (int, optional) – The number of samples to be included in the dataset. Default: None, all samples.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the maximum sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

Raises:
  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • ValueError – If shard_id is not in range of [0, num_shards ).

  • TypeError – If total_rows is not of type int.

  • TypeError – If num_shards is not of type int.

  • TypeError – If num_parallel_workers is not of type int.

  • TypeError – If shuffle is not of type bool.

  • TypeError – If columns_list is not of type list.

Examples

>>> from mindspore import dtype as mstype
>>> import mindspore.dataset as ds
>>>
>>> schema = ds.Schema()
>>> schema.add_column('image', de_type=mstype.uint8, shape=[2])
>>> schema.add_column('label', de_type=mstype.uint8, shape=[1])
>>> # apply dataset operations
>>> ds1 = ds.RandomDataset(schema=schema, total_rows=50, num_parallel_workers=4)
class tinyms.data.RenderedSST2Dataset(dataset_dir, usage=None, num_samples=None, num_parallel_workers=None, shuffle=None, decode=False, sampler=None, num_shards=None, shard_id=None, cache=None)[source]

RenderedSST2(Rendered Stanford Sentiment Treebank v2) dataset.

The generated dataset has two columns: [image, label]. The tensor of column image is of the uint8 type. The tensor of column label is of the uint32 type.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • usage (str, optional) – Usage of this dataset, can be ‘train’, ‘val’, ‘test’ or ‘all’. Default: None, will read all samples.

  • num_samples (int, optional) – The number of images to be included in the dataset. Default: None, will include all images.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • decode (bool, optional) – Whether or not to decode the images after reading. Default: False.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. When this argument is specified, num_samples reflects the maximum sample number of per shard. Default: None.

  • shard_id (int, optional) – The shard ID within num_shards . This argument can only be specified when num_shards is also specified. Default: None.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • ValueError – If usage is not ‘train’, ‘test’, ‘val’ or ‘all’.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and num_shards/shard_id are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Note

  • This dataset can take in a sampler . sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter sampler

Parameter shuffle

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> rendered_sst2_dataset_dir = "/path/to/rendered_sst2_dataset_directory"
>>>
>>> # 1) Read all samples (image files) in rendered_sst2_dataset_dir with 8 threads
>>> dataset = ds.RenderedSST2Dataset(dataset_dir=rendered_sst2_dataset_dir,
...                                  usage="all", num_parallel_workers=8)

About RenderedSST2Dataset:

Rendered SST2 is an image classification dataset which was generated by rendering sentences in the Standford Sentiment Treebank v2 dataset. There are three splits in this dataset and each split contains two classes (positive and negative): a train split containing 6920 images (3610 positive and 3310 negative), a validation split containing 872 images (444 positive and 428 negative), and a test split containing 1821 images (909 positive and 912 negative).

Here is the original RenderedSST2 dataset structure. You can unzip the dataset files into the following directory structure and read by MindSpore’s API.

.
└── rendered_sst2_dataset_directory
     ├── train
     │    ├── negative
     │    │    ├── 0001.jpg
     │    │    ├── 0002.jpg
     │    │    ...
     │    └── positive
     │         ├── 0001.jpg
     │         ├── 0002.jpg
     │         ...
     ├── test
     │    ├── negative
     │    │    ├── 0001.jpg
     │    │    ├── 0002.jpg
     │    │    ...
     │    └── positive
     │         ├── 0001.jpg
     │         ├── 0002.jpg
     │         ...
     └── valid
          ├── negative
          │    ├── 0001.jpg
          │    ├── 0002.jpg
          │    ...
          └── positive
               ├── 0001.jpg
               ├── 0002.jpg
               ...

Citation:

@inproceedings{socher-etal-2013-recursive,
    title     = {Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank},
    author    = {Socher, Richard and Perelygin, Alex and Wu, Jean and Chuang, Jason and Manning,
                  Christopher D. and Ng, Andrew and Potts, Christopher},
    booktitle = {Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing},
    month     = oct,
    year      = {2013},
    address   = {Seattle, Washington, USA},
    publisher = {Association for Computational Linguistics},
    url       = {https://www.aclweb.org/anthology/D13-1170},
    pages     = {1631--1642},
}
class tinyms.data.SBDataset(dataset_dir, task='Boundaries', usage='all', num_samples=None, num_parallel_workers=1, shuffle=None, decode=None, sampler=None, num_shards=None, shard_id=None)[source]

SB(Semantic Boundaries) Dataset.

By configuring the ‘Task’ parameter, the generated dataset has different output columns.

  • ‘task’ = ‘Boundaries’ , there are two output columns: the ‘image’ column has the data type uint8 and the ‘label’ column contains one image of the data type uint8.

  • ‘task’ = ‘Segmentation’ , there are two output columns: the ‘image’ column has the data type uint8 and the ‘label’ column contains 20 images of the data type uint8.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • task (str, optional) – Acceptable tasks include ‘Boundaries’ or ‘Segmentation’. Default: ‘Boundaries’.

  • usage (str, optional) – Acceptable usages include ‘train’, ‘val’, ‘train_noval’ and ‘all’. Default: ‘all’.

  • num_samples (int, optional) – The number of images to be included in the dataset. Default: None, all images.

  • num_parallel_workers (int, optional) – Number of worker subprocesses to read the data. Default: 1.

  • shuffle (bool, optional) – Whether to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • decode (bool, optional) – Decode the images after reading. Default: None.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the max sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

Raises:
  • RuntimeError – If dataset_dir is not valid or does not contain data files.

  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and num_shards/shard_id are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If dataset_dir is not exist.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • ValueError – If task is not in [‘Boundaries’, ‘Segmentation’].

  • ValueError – If usage is not in [‘train’, ‘val’, ‘train_noval’, ‘all’].

  • ValueError – If shard_id is not in range of [0, num_shards ).

Note

  • This dataset can take in a sampler. sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter sampler

Parameter shuffle

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> sb_dataset_dir = "/path/to/sb_dataset_directory"
>>>
>>> # 1) Get all samples from Semantic Boundaries Dataset in sequence
>>> dataset = ds.SBDataset(dataset_dir=sb_dataset_dir, shuffle=False)
>>>
>>> # 2) Randomly select 350 samples from Semantic Boundaries Dataset
>>> dataset = ds.SBDataset(dataset_dir=sb_dataset_dir, num_samples=350, shuffle=True)
>>>
>>> # 3) Get samples from Semantic Boundaries Dataset for shard 0 in a 2-way distributed training
>>> dataset = ds.SBDataset(dataset_dir=sb_dataset_dir, num_shards=2, shard_id=0)
>>>
>>> # In Semantic Boundaries Dataset, each dictionary has keys "image" and "task"

About Semantic Boundaries Dataset:

The Semantic Boundaries Dataset consists of 11355 color images. There are 8498 images’ name in the train.txt, 2857 images’ name in the val.txt and 5623 images’ name in the train_noval.txt. The category cls/ contains the Segmentation and Boundaries results of category-level, the category inst/ contains the Segmentation and Boundaries results of instance-level.

You can unzip the dataset files into the following structure and read by MindSpore’s API:

.
└── benchmark_RELEASE
     ├── dataset
     ├── img
     │    ├── 2008_000002.jpg
     │    ├── 2008_000003.jpg
     │    ├── ...
     ├── cls
     │    ├── 2008_000002.mat
     │    ├── 2008_000003.mat
     │    ├── ...
     ├── inst
     │    ├── 2008_000002.mat
     │    ├── 2008_000003.mat
     │    ├── ...
     ├── train.txt
     └── val.txt
@InProceedings{BharathICCV2011,
    author       = "Bharath Hariharan and Pablo Arbelaez and Lubomir Bourdev and
                    Subhransu Maji and Jitendra Malik",
    title        = "Semantic Contours from Inverse Detectors",
    booktitle    = "International Conference on Computer Vision (ICCV)",
    year         = "2011",
}
class tinyms.data.SBUDataset(dataset_dir, num_samples=None, num_parallel_workers=None, shuffle=None, decode=False, sampler=None, num_shards=None, shard_id=None, cache=None)[source]

SBU(SBU Captioned Photo) dataset.

The generated dataset has two columns [image, caption] . The tensor of column image is of the uint8 type. The tensor of column caption is of the string type.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • num_samples (int, optional) – The number of images to be included in the dataset. Default: None, will read all images.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • decode (bool, optional) – Decode the images after reading. Default: False.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the max sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and num_shards/shard_id are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Note

  • This dataset can take in a sampler. ‘sampler’ and ‘shuffle’ are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter ‘sampler’

Parameter ‘shuffle’

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> sbu_dataset_dir = "/path/to/sbu_dataset_directory"
>>> # Read 3 samples from SBU dataset
>>> dataset = ds.SBUDataset(dataset_dir=sbu_dataset_dir, num_samples=3)

About SBU dataset:

SBU dataset is a large captioned photo collection. It contains one million images with associated visually relevant captions.

You should manually download the images using official download.m by replacing ‘urls{i}(24, end)’ with ‘urls{i}(24:1:end)’ and keep the directory as below.

.
└─ dataset_dir
   ├── SBU_captioned_photo_dataset_captions.txt
   ├── SBU_captioned_photo_dataset_urls.txt
   └── sbu_images
       ├── m_3326_3596303505_3ce4c20529.jpg
       ├── ......
       └── m_2522_4182181099_c3c23ab1cc.jpg

Citation:

@inproceedings{Ordonez:2011:im2text,
  Author    = {Vicente Ordonez and Girish Kulkarni and Tamara L. Berg},
  Title     = {Im2Text: Describing Images Using 1 Million Captioned Photographs},
  Booktitle = {Neural Information Processing Systems ({NIPS})},
  Year      = {2011},
}
class tinyms.data.SemeionDataset(dataset_dir, num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, num_shards=None, shard_id=None, cache=None)[source]

Semeion dataset.

The generated dataset has two columns [image, label] . The tensor of column image is of the uint8 type. The tensor of column label is a scalar of the uint32 type.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • num_samples (int, optional) – The number of samples to be included in the dataset. Default: None, will read all images.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (bool, optional) – Whether to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the maximum sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and num_shards/shard_id are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Note

  • This dataset can take in a sampler . sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter sampler

Parameter shuffle

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> semeion_dataset_dir = "/path/to/semeion_dataset_directory"
>>>
>>> # 1) Get all samples from SEMEION dataset in sequence
>>> dataset = ds.SemeionDataset(dataset_dir=semeion_dataset_dir, shuffle=False)
>>>
>>> # 2) Randomly select 10 samples from SEMEION dataset
>>> dataset = ds.SemeionDataset(dataset_dir=semeion_dataset_dir, num_samples=10, shuffle=True)
>>>
>>> # 3) Get samples from SEMEION dataset for shard 0 in a 2-way distributed training
>>> dataset = ds.SemeionDataset(dataset_dir=semeion_dataset_dir, num_shards=2, shard_id=0)
>>>
>>> # In SEMEION dataset, each dictionary has keys: image, label.

About SEMEION dataset:

The dataset was created by Tactile Srl, Brescia, Italy (http://www.tattile.it) and donated in 1994 to Semeion Research Center of Sciences of Communication, Rome, Italy (http://www.semeion.it), for machine learning research.

This dataset consists of 1593 records (rows) and 256 attributes (columns). Each record represents a handwritten digit, originally scanned with a resolution of 256 grey scale. Each pixel of the each original scanned image was first stretched, and after scaled between 0 and 1 (setting to 0 every pixel whose value was under the value 127 of the grey scale (127 included) and setting to 1 each pixel whose original value in the grey scale was over 127). Finally, each binary image was scaled again into a 16x16 square box (the final 256 binary attributes).

.
└── semeion_dataset_dir
    └──semeion.data
    └──semeion.names

Citation:

@article{
  title={The Theory of Independent Judges, in Substance Use & Misuse 33(2)1998, pp 439-461},
  author={M Buscema, MetaNet},
}
class tinyms.data.STL10Dataset(dataset_dir, usage=None, num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, num_shards=None, shard_id=None, cache=None)[source]

STL-10 dataset.

The generated dataset has two columns: [image, label] . The tensor of column image is of the uint8 type. The tensor of column label is of a scalar of int32 type.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • usage (str, optional) – Usage of this dataset, can be ‘train’, ‘test’, ‘unlabeled’, ‘train+unlabeled’ or ‘all’ . ‘train’ will read from 5,000 train samples, ‘test’ will read from 8,000 test samples, ‘unlabeled’ will read from all 100,000 samples, and ‘train+unlabeled’ will read from 105000 samples, ‘all’ will read all the samples Default: None, all samples.

  • num_samples (int, optional) – The number of images to be included in the dataset. Default: None, all images.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (bool, optional) – Whether to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the max sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir is not valid or does not exist or does not contain data files.

  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and num_shards/shard_id are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If usage is invalid.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Note

  • This dataset can take in a sampler. ‘sampler’ and ‘shuffle’ are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter ‘sampler’

Parameter ‘shuffle’

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> stl10_dataset_dir = "/path/to/stl10_dataset_directory"
>>>
>>> # 1) Get all samples from STL10 dataset in sequence
>>> dataset = ds.STL10Dataset(dataset_dir=stl10_dataset_dir, shuffle=False)
>>>
>>> # 2) Randomly select 350 samples from STL10 dataset
>>> dataset = ds.STL10Dataset(dataset_dir=stl10_dataset_dir, num_samples=350, shuffle=True)
>>>
>>> # 3) Get samples from STL10 dataset for shard 0 in a 2-way distributed training
>>> dataset = ds.STL10Dataset(dataset_dir=stl10_dataset_dir, num_shards=2, shard_id=0)

About STL10 dataset:

STL10 dataset consists of 10 classes: airplane, bird, car, cat, deer, dog, horse, monkey, ship, truck. Images are 96x96 pixels, color. 500 training images, 800 test images per class and 100000 unlabeled images. Labels are 0-indexed, and unlabeled images have -1 as their labels.

Here is the original STL10 dataset structure. You can unzip the dataset files into this directory structure and read by MindSpore’s API.

.
└── stl10_dataset_dir
     ├── train_X.bin
     ├── train_y.bin
     ├── test_X.bin
     ├── test_y.bin
     └── unlabeled_X.bin

Citation of STL10 dataset:

@techreport{Coates10,
author       = {Adam Coates},
title        = {Learning multiple layers of features from tiny images},
year         = {20010},
howpublished = {https://cs.stanford.edu/~acoates/stl10/},
description  = {The STL-10 dataset consists of 96x96 RGB images in 10 classes,
                with 500 training images and 800 testing images per class.
                There are 5000 training images and 8000 test images.
                It also has 100000 unlabeled images for unsupervised learning.
                These examples are extracted from a similar but broader distribution of images.
                }
}
class tinyms.data.SUN397Dataset(dataset_dir, num_samples=None, num_parallel_workers=None, shuffle=None, decode=False, sampler=None, num_shards=None, shard_id=None, cache=None)[source]

SUN397(Scene UNderstanding) dataset.

The generated dataset has two columns: [image, label]. The tensor of column image is of the uint8 type. The tensor of column label is of the uint32 type.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • num_samples (int, optional) – The number of images to be included in the dataset. Default: None, all images.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • decode (bool, optional) – Whether or not to decode the images after reading. Default: False.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. When this argument is specified, num_samples reflects the maximum sample number of per shard. Default: None.

  • shard_id (int, optional) – The shard ID within num_shards . This argument can only be specified when num_shards is also specified. Default: None.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and num_shards/shard_id are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Note

  • This dataset can take in a sampler . sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter sampler

Parameter shuffle

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> sun397_dataset_dir = "/path/to/sun397_dataset_directory"
>>>
>>> # 1) Read all samples (image files) in sun397_dataset_dir with 8 threads
>>> dataset = ds.SUN397Dataset(dataset_dir=sun397_dataset_dir, num_parallel_workers=8)

About SUN397Dataset:

The SUN397 or Scene UNderstanding (SUN) is a dataset for scene recognition consisting of 397 categories with 108,754 images. The number of images varies across categories, but there are at least 100 images per category. Images are in jpg, png, or gif format.

Here is the original SUN397 dataset structure. You can unzip the dataset files into this directory structure and read by MindSpore’s API.

.
└── sun397_dataset_directory
    ├── ClassName.txt
    ├── README.txt
    ├── a
    │   ├── abbey
    │   │   ├── sun_aaaulhwrhqgejnyt.jpg
    │   │   ├── sun_aacphuqehdodwawg.jpg
    │   │   ├── ...
    │   ├── apartment_building
    │   │   └── outdoor
    │   │       ├── sun_aamyhslnsnomjzue.jpg
    │   │       ├── sun_abbjzfrsalhqivis.jpg
    │   │       ├── ...
    │   ├── ...
    ├── b
    │   ├── badlands
    │   │   ├── sun_aabtemlmesogqbbp.jpg
    │   │   ├── sun_afbsfeexggdhzshd.jpg
    │   │   ├── ...
    │   ├── balcony
    │   │   ├── exterior
    │   │   │   ├── sun_aaxzaiuznwquburq.jpg
    │   │   │   ├── sun_baajuldidvlcyzhv.jpg
    │   │   │   ├── ...
    │   │   └── interior
    │   │       ├── sun_babkzjntjfarengi.jpg
    │   │       ├── sun_bagjvjynskmonnbv.jpg
    │   │       ├── ...
    │   └── ...
    ├── ...

Citation:

@inproceedings{xiao2010sun,
title        = {Sun database: Large-scale scene recognition from abbey to zoo},
author       = {Xiao, Jianxiong and Hays, James and Ehinger, Krista A and Oliva, Aude and Torralba, Antonio},
booktitle    = {2010 IEEE computer society conference on computer vision and pattern recognition},
pages        = {3485--3492},
year         = {2010},
organization = {IEEE}
}
class tinyms.data.SVHNDataset(dataset_dir, usage=None, num_samples=None, num_parallel_workers=1, shuffle=None, sampler=None, num_shards=None, shard_id=None)[source]

SVHN(Street View House Numbers) dataset.

The generated dataset has two columns: [image, label] . The tensor of column image is of the uint8 type. The tensor of column label is of a scalar of uint32 type.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • usage (str, optional) – Specify the ‘train’, ‘test’, ‘extra’ or ‘all’ parts of dataset. Default: None, will read all samples.

  • num_samples (int, optional) – The number of samples to be included in the dataset. Default: None, all images.

  • num_parallel_workers (int, optional) – Number of worker subprocesses used to fetch the dataset in parallel. Default: 1.

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Random accessible input is required. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the max sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument must be specified only when num_shards is also specified.

Raises:
  • RuntimeError – If dataset_dir is not valid or does not exist or does not contain data files.

  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and num_shards/shard_id are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If usage is invalid.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Note

  • This dataset can take in a sampler. ‘sampler’ and ‘shuffle’ are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter ‘sampler’

Parameter ‘shuffle’

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> svhn_dataset_dir = "/path/to/svhn_dataset_directory"
>>> dataset = ds.SVHNDataset(dataset_dir=svhn_dataset_dir, usage="train")

About SVHN dataset:

SVHN dataset consists of 10 digit classes and is obtained from house numbers in Google Street View images.

Here is the original SVHN dataset structure. You can unzip the dataset files into this directory structure and read by MindSpore’s API.

.
└── svhn_dataset_dir
     ├── train_32x32.mat
     ├── test_32x32.mat
     └── extra_32x32.mat

Citation:

@article{
  title={Reading Digits in Natural Images with Unsupervised Feature Learning},
  author={Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, Andrew Y. Ng},
  conference={NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011.},
  year={2011},
  publisher={NIPS}
  url={http://ufldl.stanford.edu/housenumbers}
}
class tinyms.data.USPSDataset(dataset_dir, usage=None, num_samples=None, num_parallel_workers=None, shuffle=<Shuffle.GLOBAL: 'global'>, num_shards=None, shard_id=None, cache=None)[source]

USPS(U.S. Postal Service) dataset.

The generated dataset has two columns: [image, label] . The tensor of column image is of the uint8 type. The tensor of column label is of the uint32 type.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • usage (str, optional) – Usage of this dataset, can be ‘train’, ‘test’ or ‘all’. ‘train’ will read from 7,291 train samples, ‘test’ will read from 2,007 test samples, ‘all’ will read from all 9,298 samples. Default: None, will read all samples.

  • num_samples (int, optional) – The number of images to be included in the dataset. Default: None, will read all images.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (Union[bool, Shuffle], optional) –

    Perform reshuffling of the data every epoch. Bool type and Shuffle enum are both supported to pass in. Default: Shuffle.GLOBAL . If shuffle is False, no shuffling will be performed. If shuffle is True, it is equivalent to setting shuffle to mindspore.dataset.Shuffle.GLOBAL. Set the mode of data shuffling by passing in enumeration variables:

    • Shuffle.GLOBAL: Shuffle both the files and samples.

    • Shuffle.FILES: Shuffle files only.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the max sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir is not valid or does not exist or does not contain data files.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If usage is invalid.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Examples

>>> usps_dataset_dir = "/path/to/usps_dataset_directory"
>>>
>>> # Read 3 samples from USPS dataset
>>> dataset = ds.USPSDataset(dataset_dir=usps_dataset_dir, num_samples=3)

About USPS dataset:

USPS is a digit dataset automatically scanned from envelopes by the U.S. Postal Service containing a total of 9,298 16×16 pixel grayscale samples. The images are centered, normalized and show a broad range of font styles.

Here is the original USPS dataset structure. You can download and unzip the dataset files into this directory structure and read by MindSpore’s API.

.
└── usps_dataset_dir
     ├── usps
     ├── usps.t

Citation:

@article{hull1994database,
  title={A database for handwritten text recognition research},
  author={Hull, Jonathan J.},
  journal={IEEE Transactions on pattern analysis and machine intelligence},
  volume={16},
  number={5},
  pages={550--554},
  year={1994},
  publisher={IEEE}
}
class tinyms.data.VOCDataset(dataset_dir, task='Segmentation', usage='train', class_indexing=None, num_samples=None, num_parallel_workers=None, shuffle=None, decode=False, sampler=None, num_shards=None, shard_id=None, cache=None, extra_metadata=False, decrypt=None)[source]

VOC(Visual Object Classes) dataset.

The generated dataset with different task setting has different output columns:

  • task = Detection , output columns: [image, dtype=uint8] , [bbox, dtype=float32] , [label, dtype=uint32] , [difficult, dtype=uint32] , [truncate, dtype=uint32] .

  • task = Segmentation , output columns: [image, dtype=uint8] , [target,dtype=uint8] .

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • task (str, optional) – Set the task type of reading voc data, now only support ‘Segmentation’ or ‘Detection’. Default: ‘Segmentation’.

  • usage (str, optional) – Set the task type of ImageSets. Default: ‘train’. If task is ‘Segmentation’, image and annotation list will be loaded in ./ImageSets/Segmentation/usage + “.txt”; If task is ‘Detection’, image and annotation list will be loaded in ./ImageSets/Main/usage + “.txt”; if task and usage are not set, image and annotation list will be loaded in ./ImageSets/Segmentation/train.txt as default.

  • class_indexing (dict, optional) – A str-to-int mapping from label name to index, only valid in ‘Detection’ task. Default: None, the folder names will be sorted alphabetically and each class will be given a unique index starting from 0.

  • num_samples (int, optional) – The number of images to be included in the dataset. Default: None, all images.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (bool, optional) – Whether to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • decode (bool, optional) – Decode the images after reading. Default: False.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the maximum sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

  • extra_metadata (bool, optional) – Flag to add extra meta-data to row. If True, an additional column named [_meta-filename, dtype=string] will be output at the end. Default: False.

  • decrypt (callable, optional) – Image decryption function, which accepts the path of the encrypted image file and returns the decrypted bytes data. Default: None, no decryption.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • RuntimeError – If xml of Annotations is an invalid format.

  • RuntimeError – If xml of Annotations loss attribution of object .

  • RuntimeError – If xml of Annotations loss attribution of bndbox .

  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and num_shards/shard_id are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • ValueError – If task is not equal ‘Segmentation’ or ‘Detection’.

  • ValueError – If task equal ‘Segmentation’ but class_indexing is not None.

  • ValueError – If txt related to mode is not exist.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Note

  • Column ‘[_meta-filename, dtype=string]’ won’t be output unless an explicit rename dataset op is added to remove the prefix(‘_meta-‘).

  • This dataset can take in a sampler . sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter sampler

Parameter shuffle

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> voc_dataset_dir = "/path/to/voc_dataset_directory"
>>>
>>> # 1) Read VOC data for segmentation training
>>> dataset = ds.VOCDataset(dataset_dir=voc_dataset_dir, task="Segmentation", usage="train")
>>>
>>> # 2) Read VOC data for detection training
>>> dataset = ds.VOCDataset(dataset_dir=voc_dataset_dir, task="Detection", usage="train")
>>>
>>> # 3) Read all VOC dataset samples in voc_dataset_dir with 8 threads in random order
>>> dataset = ds.VOCDataset(dataset_dir=voc_dataset_dir, task="Detection", usage="train",
...                         num_parallel_workers=8)
>>>
>>> # 4) Read then decode all VOC dataset samples in voc_dataset_dir in sequence
>>> dataset = ds.VOCDataset(dataset_dir=voc_dataset_dir, task="Detection", usage="train",
...                         decode=True, shuffle=False)
>>>
>>> # In VOC dataset, if task='Segmentation', each dictionary has keys "image" and "target"
>>> # In VOC dataset, if task='Detection', each dictionary has keys "image" and "annotation"

About VOC dataset:

The PASCAL Visual Object Classes (VOC) challenge is a benchmark in visual object category recognition and detection, providing the vision and machine learning communities with a standard dataset of images and annotation, and standard evaluation procedures.

You can unzip the original VOC-2012 dataset files into this directory structure and read by MindSpore’s API.

.
└── voc2012_dataset_dir
    ├── Annotations
    │    ├── 2007_000027.xml
    │    ├── 2007_000032.xml
    │    ├── ...
    ├── ImageSets
    │    ├── Action
    │    ├── Layout
    │    ├── Main
    │    └── Segmentation
    ├── JPEGImages
    │    ├── 2007_000027.jpg
    │    ├── 2007_000032.jpg
    │    ├── ...
    ├── SegmentationClass
    │    ├── 2007_000032.png
    │    ├── 2007_000033.png
    │    ├── ...
    └── SegmentationObject
         ├── 2007_000032.png
         ├── 2007_000033.png
         ├── ...

Citation:

@article{Everingham10,
author       = {Everingham, M. and Van~Gool, L. and Williams, C. K. I. and Winn, J. and Zisserman, A.},
title        = {The Pascal Visual Object Classes (VOC) Challenge},
journal      = {International Journal of Computer Vision},
volume       = {88},
year         = {2012},
number       = {2},
month        = {jun},
pages        = {303--338},
biburl       = {http://host.robots.ox.ac.uk/pascal/VOC/pubs/everingham10.html#bibtex},
howpublished = {http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html}
}
get_class_indexing()[source]

Get the class index.

Returns:

dict, a str-to-int mapping from label name to index.

Examples

>>> voc_dataset_dir = "/path/to/voc_dataset_directory"
>>>
>>> dataset = ds.VOCDataset(dataset_dir=voc_dataset_dir, task="Detection")
>>> class_indexing = dataset.get_class_indexing()
class tinyms.data.WIDERFaceDataset(dataset_dir, usage=None, num_samples=None, num_parallel_workers=None, shuffle=None, decode=False, sampler=None, num_shards=None, shard_id=None, cache=None)[source]

WIDERFace dataset.

When usage is “train”, “valid” or “all”, the generated dataset has eight columns [“image”, “bbox”, “blur”, “expression”, “illumination”, “occlusion”, “pose”, “invalid”]. The data type of the image column is uint8, and all other columns are uint32. When usage is “test”, it only has one column [“image”], with uint8 data type.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • usage (str, optional) – Usage of this dataset, can be ‘train’, ‘test’, ‘valid’ or ‘all’. ‘train’ will read from 12,880 samples, ‘test’ will read from 16,097 samples, ‘valid’ will read from 3,226 test samples and ‘all’ will read all ‘train’ and ‘valid’ samples. Default: None, will be set to ‘all’.

  • num_samples (int, optional) – The number of images to be included in the dataset. Default: None, will read all images.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • decode (bool, optional) – Decode the images after reading. Default: False.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the maximum sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and num_shards/shard_id are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If shard_id is not in range of [0, num_shards ).

  • ValueError – If usage is not in [‘train’, ‘test’, ‘valid’, ‘all’].

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • ValueError – If annotation_file is not exist.

  • ValueError – If dataset_dir is not exist.

Note

  • This dataset can take in a sampler . sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter sampler

Parameter shuffle

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> wider_face_dir = "/path/to/wider_face_dataset"
>>>
>>> # Read 3 samples from WIDERFace dataset
>>> dataset = ds.WIDERFaceDataset(dataset_dir=wider_face_dir, num_samples=3)

About WIDERFace dataset:

The WIDERFace database has a training set of 12,880 samples, a testing set of 16,097 examples and a validating set of 3,226 examples. It is a subset of a larger set available from WIDER. The digits have been size-normalized and centered in a fixed-size image.

The following is the original WIDERFace dataset structure. You can unzip the dataset files into this directory structure and read by MindSpore’s API.

.
└── wider_face_dir
     ├── WIDER_test
     │    └── images
     │         ├── 0--Parade
     │         │     ├── 0_Parade_marchingband_1_9.jpg
     │         │     ├── ...
     │         ├──1--Handshaking
     │         ├──...
     ├── WIDER_train
     │    └── images
     │         ├── 0--Parade
     │         │     ├── 0_Parade_marchingband_1_11.jpg
     │         │     ├── ...
     │         ├──1--Handshaking
     │         ├──...
     ├── WIDER_val
     │    └── images
     │         ├── 0--Parade
     │         │     ├── 0_Parade_marchingband_1_102.jpg
     │         │     ├── ...
     │         ├──1--Handshaking
     │         ├──...
     └── wider_face_split
          ├── wider_face_test_filelist.txt
          ├── wider_face_train_bbx_gt.txt
          └── wider_face_val_bbx_gt.txt

Citation:

@inproceedings{2016WIDER,
  title={WIDERFACE: A Detection Benchmark},
  author={Yang, S. and Luo, P. and Loy, C. C. and Tang, X.},
  booktitle={IEEE},
  pages={5525-5533},
  year={2016},
}
class tinyms.data.AGNewsDataset(dataset_dir, usage=None, num_samples=None, num_parallel_workers=None, shuffle=<Shuffle.GLOBAL: 'global'>, num_shards=None, shard_id=None, cache=None)[source]

AG News dataset.

The generated dataset has three columns: [index, title, description] , and the data type of three columns is string type.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • usage (str, optional) – Acceptable usages include ‘train’, ‘test’ and ‘all’. Default: None, all samples.

  • num_samples (int, optional) – Number of samples (rows) to read. Default: None, reads the full dataset.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (Union[bool, Shuffle], optional) –

    Perform reshuffling of the data every epoch. Bool type and Shuffle enum are both supported to pass in. Default: Shuffle.GLOBAL . If shuffle is False, no shuffling will be performed. If shuffle is True, it is equivalent to setting shuffle to mindspore.dataset.Shuffle.GLOBAL. Set the mode of data shuffling by passing in enumeration variables:

    • Shuffle.GLOBAL: Shuffle both the files and samples.

    • Shuffle.FILES: Shuffle files only.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the max sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . This argument can only be specified when num_shards is also specified. Default: None.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

Examples

>>> ag_news_dataset_dir = "/path/to/ag_news_dataset_file"
>>> dataset = ds.AGNewsDataset(dataset_dir=ag_news_dataset_dir, usage='all')

About AGNews dataset:

AG is a collection of over 1 million news articles. The news articles were collected by ComeToMyHead from over 2,000 news sources in over 1 year of activity. ComeToMyHead is an academic news search engine that has been in operation since July 2004. The dataset is provided by academics for research purposes such as data mining (clustering, classification, etc.), information retrieval (ranking, searching, etc.), xml, data compression, data streaming, and any other non-commercial activities. AG’s news topic classification dataset was constructed by selecting the four largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 test samples. The total number of training samples in train.csv is 120,000 and the number of test samples in test.csv is 7,600.

You can unzip the dataset files into the following structure and read by MindSpore’s API:

.
└── ag_news_dataset_dir
    ├── classes.txt
    ├── train.csv
    ├── test.csv
    └── readme.txt

Citation:

@misc{zhang2015characterlevel,
title={Character-level Convolutional Networks for Text Classification},
author={Xiang Zhang and Junbo Zhao and Yann LeCun},
year={2015},
eprint={1509.01626},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
class tinyms.data.AmazonReviewDataset(dataset_dir, usage=None, num_samples=None, num_parallel_workers=None, shuffle=<Shuffle.GLOBAL: 'global'>, num_shards=None, shard_id=None, cache=None)[source]

Amazon Review Polarity and Amazon Review Full datasets.

The generated dataset has three columns: [label, title, content] , and the data type of three columns is string.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the Amazon Review Polarity dataset or the Amazon Review Full dataset.

  • usage (str, optional) – Usage of this dataset, can be ‘train’, ‘test’ or ‘all’. For Polarity dataset, ‘train’ will read from 3,600,000 train samples, ‘test’ will read from 400,000 test samples, ‘all’ will read from all 4,000,000 samples. For Full dataset, ‘train’ will read from 3,000,000 train samples, ‘test’ will read from 650,000 test samples, ‘all’ will read from all 3,650,000 samples. Default: None, all samples.

  • num_samples (int, optional) – Number of samples (rows) to be read. Default: None, reads the full dataset.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (Union[bool, Shuffle], optional) –

    Perform reshuffling of the data every epoch. Bool type and Shuffle enum are both supported to pass in. Default: Shuffle.GLOBAL . If shuffle is False, no shuffling will be performed. If shuffle is True, it is equivalent to setting shuffle to mindspore.dataset.Shuffle.GLOBAL. Set the mode of data shuffling by passing in enumeration variables:

    • Shuffle.GLOBAL: Shuffle both the files and samples.

    • Shuffle.FILES: Shuffle files only.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the max sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

Examples

>>> amazon_review_dataset_dir = "/path/to/amazon_review_dataset_dir"
>>> dataset = ds.AmazonReviewDataset(dataset_dir=amazon_review_dataset_dir, usage='all')

About AmazonReview Dataset:

The Amazon reviews full dataset consists of reviews from Amazon. The data span a period of 18 years, including ~35 million reviews up to March 2013. Reviews include product and user information, ratings, and a plaintext review. The dataset is mainly used for text classification, given the content and title, predict the correct star rating.

The Amazon reviews polarity dataset is constructed by taking review score 1 and 2 as negative, 4 and 5 as positive. Samples of score 3 is ignored.

The Amazon Reviews Polarity and Amazon Reviews Full datasets have the same directory structures. You can unzip the dataset files into the following structure and read by MindSpore’s API:

.
└── amazon_review_dir
     ├── train.csv
     ├── test.csv
     └── readme.txt

Citation:

@article{zhang2015character,
  title={Character-level convolutional networks for text classification},
  author={Zhang, Xiang and Zhao, Junbo and LeCun, Yann},
  journal={Advances in neural information processing systems},
  volume={28},
  pages={649--657},
  year={2015}
}
class tinyms.data.CLUEDataset(dataset_files, task='AFQMC', usage='train', num_samples=None, num_parallel_workers=None, shuffle=<Shuffle.GLOBAL: 'global'>, num_shards=None, shard_id=None, cache=None)[source]

CLUE(Chinese Language Understanding Evaluation) dataset. Supported CLUE classification tasks: ‘AFQMC’, ‘TNEWS’, ‘IFLYTEK’, ‘CMNLI’, ‘WSC’ and ‘CSL’.

Parameters:
  • dataset_files (Union[str, list[str]]) – String or list of files to be read or glob strings to search for a pattern of files. The list will be sorted in a lexicographical order.

  • task (str, optional) – The kind of task, one of ‘AFQMC’, ‘TNEWS’, ‘IFLYTEK’, ‘CMNLI’, ‘WSC’ and ‘CSL’. Default: ‘AFQMC’.

  • usage (str, optional) – Specify the ‘train’, ‘test’ or ‘eval’ part of dataset. Default: ‘train’.

  • num_samples (int, optional) – The number of samples to be included in the dataset. Default: None, will include all images.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (Union[bool, Shuffle], optional) –

    Perform reshuffling of the data every epoch. Default: Shuffle.GLOBAL. Bool type and Shuffle enum are both supported to pass in. If shuffle is False, no shuffling will be performed. If shuffle is True, performs global shuffle. There are three levels of shuffling, desired shuffle enum defined by mindspore.dataset.Shuffle.

    • Shuffle.GLOBAL: Shuffle both the files and samples, same as setting shuffle to True.

    • Shuffle.FILES: Shuffle files only.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the maximum sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

The generated dataset with different task setting has different output columns:

task

usage

Output column

AFQMC

train

[sentence1, dtype=string]

[sentence2, dtype=string]

[label, dtype=string]

test

[id, dtype=uint32]

[sentence1, dtype=string]

[sentence2, dtype=string]

eval

[sentence1, dtype=string]

[sentence2, dtype=string]

[label, dtype=string]

TNEWS

train

[label, dtype=string]

[label_des, dtype=string]

[sentence, dtype=string]

[keywords, dtype=string]

test

[label, dtype=uint32]

[keywords, dtype=string]

[sentence, dtype=string]

eval

[label, dtype=string]

[label_des, dtype=string]

[sentence, dtype=string]

[keywords, dtype=string]

IFLYTEK

train

[label, dtype=string]

[label_des, dtype=string]

[sentence, dtype=string]

test

[id, dtype=uint32]

[sentence, dtype=string]

eval

[label, dtype=string]

[label_des, dtype=string]

[sentence, dtype=string]

CMNLI

train

[sentence1, dtype=string]

[sentence2, dtype=string]

[label, dtype=string]

test

[id, dtype=uint32]

[sentence1, dtype=string]

[sentence2, dtype=string]

eval

[sentence1, dtype=string]

[sentence2, dtype=string]

[label, dtype=string]

WSC

train

[span1_index, dtype=uint32]

[span2_index, dtype=uint32]

[span1_text, dtype=string]

[span2_text, dtype=string]

[idx, dtype=uint32]

[text, dtype=string]

[label, dtype=string]

test

[span1_index, dtype=uint32]

[span2_index, dtype=uint32]

[span1_text, dtype=string]

[span2_text, dtype=string]

[idx, dtype=uint32]

[text, dtype=string]

eval

[span1_index, dtype=uint32]

[span2_index, dtype=uint32]

[span1_text, dtype=string]

[span2_text, dtype=string]

[idx, dtype=uint32]

[text, dtype=string]

[label, dtype=string]

CSL

train

[id, dtype=uint32]

[abst, dtype=string]

[keyword, dtype=string]

[label, dtype=string]

test

[id, dtype=uint32]

[abst, dtype=string]

[keyword, dtype=string]

eval

[id, dtype=uint32]

[abst, dtype=string]

[keyword, dtype=string]

[label, dtype=string]

Raises:
  • ValueError – If dataset_files are not valid or do not exist.

  • ValueError – task is not in ‘AFQMC’, ‘TNEWS’, ‘IFLYTEK’, ‘CMNLI’, ‘WSC’ or ‘CSL’.

  • ValueError – usage is not in ‘train’, ‘test’ or ‘eval’.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • ValueError – If shard_id is not in range of [0, num_shards ).

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

Examples

>>> clue_dataset_dir = ["/path/to/clue_dataset_file"] # contains 1 or multiple clue files
>>> dataset = ds.CLUEDataset(dataset_files=clue_dataset_dir, task='AFQMC', usage='train')

About CLUE dataset:

CLUE, a Chinese Language Understanding Evaluation benchmark. It contains multiple tasks, including single-sentence classification, sentence pair classification, and machine reading comprehension.

You can unzip the dataset files into the following structure and read by MindSpore’s API, such as afqmc dataset:

.
└── afqmc_public
     ├── train.json
     ├── test.json
     └── dev.json

Citation:

@article{CLUEbenchmark,
title   = {CLUE: A Chinese Language Understanding Evaluation Benchmark},
author  = {Liang Xu, Xuanwei Zhang, Lu Li, Hai Hu, Chenjie Cao, Weitang Liu, Junyi Li, Yudong Li,
        Kai Sun, Yechen Xu, Yiming Cui, Cong Yu, Qianqian Dong, Yin Tian, Dian Yu, Bo Shi, Jun Zeng,
        Rongzhao Wang, Weijian Xie, Yanting Li, Yina Patterson, Zuoyu Tian, Yiwen Zhang, He Zhou,
        Shaoweihua Liu, Qipeng Zhao, Cong Yue, Xinrui Zhang, Zhengliang Yang, Zhenzhong Lan},
journal = {arXiv preprint arXiv:2004.05986},
year    = {2020},
howpublished = {https://github.com/CLUEbenchmark/CLUE}
}
class tinyms.data.CoNLL2000Dataset(dataset_dir, usage=None, num_samples=None, shuffle=<Shuffle.GLOBAL: 'global'>, num_shards=None, shard_id=None, num_parallel_workers=None, cache=None)[source]

CoNLL-2000(Conference on Computational Natural Language Learning) chunking dataset.

The generated dataset has three columns: [word, pos_tag, chunk_tag] . The tensors of column word , column pos_tag , and column chunk_tag are of the string type.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the CoNLL2000 chunking dataset.

  • usage (str, optional) – Usage of dataset, can be ‘train’, ‘test’, or ‘all’. For dataset, ‘train’ will read from 8,936 train samples, ‘test’ will read from 2,012 test samples, ‘all’ will read from all 1,0948 samples. Default: None, read all samples.

  • num_samples (int, optional) – Number of samples (rows) to be read. Default: None, read the full dataset.

  • shuffle (Union[bool, Shuffle], optional) –

    Perform reshuffling of the data every epoch. Default: mindspore.dataset.Shuffle.GLOBAL . If shuffle is False, no shuffling will be performed. If shuffle is True, performs global shuffle. There are three levels of shuffling, desired shuffle enum defined by mindspore.dataset.Shuffle.

    • Shuffle.GLOBAL: Shuffle both the files and samples, same as setting shuffle to True.

    • Shuffle.FILES: Shuffle files only.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. When this argument is specified, num_samples reflects the max sample number of per shard. Default: None.

  • shard_id (int, optional) – The shard ID within num_shards . This argument can only be specified when num_shards is also specified. Default: None.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

Examples

>>> conll2000_dataset_dir = "/path/to/conll2000_dataset_dir"
>>> dataset = ds.CoNLL2000Dataset(dataset_dir=conll2000_dataset_dir, usage='all')

About CoNLL2000 Dataset:

The CoNLL2000 chunking dataset consists of the text from sections 15-20 of the Wall Street Journal corpus. Texts are chunked using IOB notation, and the chunk type has NP, VP, PP, ADJP and ADVP. The dataset consist of three columns separated by spaces. The first column contains the current word, the second is part-of-speech tag as derived by the Brill tagger and the third is chunk tag as derived from the WSJ corpus. Text chunking consists of dividing a text in syntactically correlated parts of words.

You can unzip the dataset files into the following structure and read by MindSpore’s API:

.
└── conll2000_dataset_dir
     ├── train.txt
     ├── test.txt
     └── readme.txt

Citation:

@inproceedings{tksbuchholz2000conll,
author     = {Tjong Kim Sang, Erik F. and Sabine Buchholz},
title      = {Introduction to the CoNLL-2000 Shared Task: Chunking},
editor     = {Claire Cardie and Walter Daelemans and Claire Nedellec and Tjong Kim Sang, Erik},
booktitle  = {Proceedings of CoNLL-2000 and LLL-2000},
publisher  = {Lisbon, Portugal},
pages      = {127--132},
year       = {2000}
}
class tinyms.data.DBpediaDataset(dataset_dir, usage=None, num_samples=None, num_parallel_workers=None, shuffle=<Shuffle.GLOBAL: 'global'>, num_shards=None, shard_id=None, cache=None)[source]

DBpedia dataset.

The generated dataset has three columns [class, title, content] , and the data type of three columns is string.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • usage (str, optional) – Usage of this dataset, can be ‘train’, ‘test’ or ‘all’. ‘train’ will read from 560,000 train samples, ‘test’ will read from 70,000 test samples, ‘all’ will read from all 630,000 samples. Default: None, all samples.

  • num_samples (int, optional) – The number of samples to be included in the dataset. Default: None, will include all text.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (Union[bool, Shuffle], optional) –

    Perform reshuffling of the data every epoch. Bool type and Shuffle enum are both supported to pass in. Default: Shuffle.GLOBAL . If shuffle is False, no shuffling will be performed. If shuffle is True, it is equivalent to setting shuffle to mindspore.dataset.Shuffle.GLOBAL. Set the mode of data shuffling by passing in enumeration variables:

    • Shuffle.GLOBAL: Shuffle both the files and samples.

    • Shuffle.FILES: Shuffle files only.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the maximum sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Examples

>>> dbpedia_dataset_dir = "/path/to/dbpedia_dataset_directory"
>>>
>>> # 1) Read 3 samples from DBpedia dataset
>>> dataset = ds.DBpediaDataset(dataset_dir=dbpedia_dataset_dir, num_samples=3)
>>>
>>> # 2) Read train samples from DBpedia dataset
>>> dataset = ds.DBpediaDataset(dataset_dir=dbpedia_dataset_dir, usage="train")

About DBpedia dataset:

The DBpedia dataset consists of 630,000 text samples in 14 classes, there are 560,000 samples in the train.csv and 70,000 samples in the test.csv. The 14 different classes represent Company, EducationaInstitution, Artist, Athlete, OfficeHolder, MeanOfTransportation, Building, NaturalPlace, Village, Animal, Plant, Album, Film, WrittenWork.

Here is the original DBpedia dataset structure. You can unzip the dataset files into this directory structure and read by Mindspore’s API.

.
└── dbpedia_dataset_dir
    ├── train.csv
    ├── test.csv
    ├── classes.txt
    └── readme.txt

Citation:

@article{DBpedia,
title   = {DBPedia Ontology Classification Dataset},
author  = {Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas,
        Pablo N. Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef,
            Sören Auer, Christian Bizer},
year    = {2015},
howpublished = {http://dbpedia.org}
}
class tinyms.data.EnWik9Dataset(dataset_dir, num_samples=None, num_parallel_workers=None, shuffle=True, num_shards=None, shard_id=None, cache=None)[source]

EnWik9 dataset.

The generated dataset has one column [text] with type string.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • num_samples (int, optional) – The number of samples to be included in the dataset. Default: None, will include all samples.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (Union[bool, Shuffle], optional) –

    Perform reshuffling of the data every epoch. Bool type and Shuffle enum are both supported to pass in. Default: True. If shuffle is False, no shuffling will be performed. If shuffle is True, it is equivalent to setting shuffle to mindspore.dataset.Shuffle.GLOBAL. Set the mode of data shuffling by passing in enumeration variables:

    • Shuffle.GLOBAL: Shuffle both the files and samples.

    • Shuffle.FILES: Shuffle files only.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the maximum sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

Examples

>>> en_wik9_dataset_dir = "/path/to/en_wik9_dataset"
>>> dataset2 = ds.EnWik9Dataset(dataset_dir=en_wik9_dataset_dir, num_samples=2,
...                             shuffle=True)

About EnWik9 dataset:

The data of EnWik9 is UTF-8 encoded XML consisting primarily of English text. It contains 243,426 article titles, of which 85,560 are #REDIRECT to fix broken links, and the rest are regular articles.

The data is UTF-8 clean. All characters are in the range U’0000 to U’10FFFF with valid encodings of 1 to 4 bytes. The byte values 0xC0, 0xC1, and 0xF5-0xFF never occur. Also, in the Wikipedia dumps, there are no control characters in the range 0x00-0x1F except for 0x09 (tab) and 0x0A (linefeed). Linebreaks occur only on paragraph boundaries, so they always have a semantic purpose.

You can unzip the dataset files into the following directory structure and read by MindSpore’s API.

.
└── EnWik9
     ├── enwik9

Citation:

@NetworkResource{Hutter_prize,
author    = {English Wikipedia},
url       = "https://cs.fit.edu/~mmahoney/compression/textdata.html",
month     = {March},
year      = {2006}
}
class tinyms.data.IMDBDataset(dataset_dir, usage=None, num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, num_shards=None, shard_id=None, cache=None)[source]

IMDb(Internet Movie Database) dataset.

The generated dataset has two columns: [text, label] . The tensor of column text is of the string type. The column label is of a scalar of uint32 type.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • usage (str, optional) – Usage of this dataset, can be ‘train’, ‘test’ or ‘all’. Default: None, will read all samples.

  • num_samples (int, optional) – The number of images to be included in the dataset. Default: None, will include all samples.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the maximum sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and num_shards/shard_id are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Note

  • The shape of the test column.

  • This dataset can take in a sampler . sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter sampler

Parameter shuffle

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> imdb_dataset_dir = "/path/to/imdb_dataset_directory"
>>>
>>> # 1) Read all samples (text files) in imdb_dataset_dir with 8 threads
>>> dataset = ds.IMDBDataset(dataset_dir=imdb_dataset_dir, num_parallel_workers=8)
>>>
>>> # 2) Read train samples (text files).
>>> dataset = ds.IMDBDataset(dataset_dir=imdb_dataset_dir, usage="train")

About IMDBDataset:

The IMDB dataset contains 50, 000 highly polarized reviews from the Internet Movie Database (IMDB). The dataset was divided into 25 000 comments for training and 25 000 comments for testing, with both the training set and test set containing 50% positive and 50% negative comments. Train labels and test labels are all lists of 0 and 1, where 0 stands for negative and 1 for positive.

You can unzip the dataset files into this directory structure and read by MindSpore’s API.

.
└── imdb_dataset_directory
     ├── train
     │    ├── pos
     │    │    ├── 0_9.txt
     │    │    ├── 1_7.txt
     │    │    ├── ...
     │    ├── neg
     │    │    ├── 0_3.txt
     │    │    ├── 1_1.txt
     │    │    ├── ...
     ├── test
     │    ├── pos
     │    │    ├── 0_10.txt
     │    │    ├── 1_10.txt
     │    │    ├── ...
     │    ├── neg
     │    │    ├── 0_2.txt
     │    │    ├── 1_3.txt
     │    │    ├── ...

Citation:

@InProceedings{maas-EtAl:2011:ACL-HLT2011,
  author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan
                and  Ng, Andrew Y.  and  Potts, Christopher},
  title     = {Learning Word Vectors for Sentiment Analysis},
  booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:
                Human Language Technologies},
  month     = {June},
  year      = {2011},
  address   = {Portland, Oregon, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {142--150},
  url       = {http://www.aclweb.org/anthology/P11-1015}
}
class tinyms.data.IWSLT2016Dataset(dataset_dir, usage=None, language_pair=None, valid_set=None, test_set=None, num_samples=None, shuffle=<Shuffle.GLOBAL: 'global'>, num_shards=None, shard_id=None, num_parallel_workers=None, cache=None)[source]

IWSLT2016(International Workshop on Spoken Language Translation) dataset.

The generated dataset has two columns: [text, translation] . The tensor of column :py:obj: text is of the string type. The column :py:obj: translation is of the string type.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • usage (str, optional) – Acceptable usages include ‘train’, ‘valid’, ‘test’ and ‘all’. Default: None, all samples.

  • language_pair (sequence, optional) – Sequence containing source and target language, supported values are (‘en’, ‘fr’), (‘en’, ‘de’), (‘en’, ‘cs’), (‘en’, ‘ar’), (‘fr’, ‘en’), (‘de’, ‘en’), (‘cs’, ‘en’), (‘ar’, ‘en’). Default: (‘de’, ‘en’).

  • valid_set (str, optional) – A string to identify validation set, when usage is valid or all, the validation set of valid_set type will be read, supported values are ‘dev2010’, ‘tst2010’, ‘tst2011’, ‘tst2012’, ‘tst2013’ and ‘tst2014’. Default: ‘tst2013’.

  • test_set (str, optional) – A string to identify test set, when usage is test or all, the test set of test_set type will be read, supported values are ‘dev2010’, ‘tst2010’, ‘tst2011’, ‘tst2012’, ‘tst2013’ and ‘tst2014’. Default: ‘tst2014’.

  • num_samples (int, optional) – Number of samples (rows) to read. Default: None, reads the full dataset.

  • shuffle (Union[bool, Shuffle], optional) –

    Perform reshuffling of the data every epoch. Bool type and Shuffle enum are both supported to pass in. Default: Shuffle.GLOBAL . If shuffle is False, no shuffling will be performed. If shuffle is True, it is equivalent to setting shuffle to mindspore.dataset.Shuffle.GLOBAL. Set the mode of data shuffling by passing in enumeration variables:

    • Shuffle.GLOBAL: Shuffle both the files and samples.

    • Shuffle.FILES: Shuffle files only.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the max sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

Examples

>>> iwslt2016_dataset_dir = "/path/to/iwslt2016_dataset_dir"
>>> dataset = ds.IWSLT2016Dataset(dataset_dir=iwslt2016_dataset_dir, usage='all',
...                               language_pair=('de', 'en'), valid_set='tst2013', test_set='tst2014')

About IWSLT2016 dataset:

IWSLT is an international oral translation conference, a major annual scientific conference dedicated to all aspects of oral translation. The MT task of the IWSLT evaluation activity constitutes a dataset, which can be publicly obtained through the WIT3 website wit3 . The IWSLT2016 dataset includes translations from English to Arabic, Czech, French, and German, and translations from Arabic, Czech, French, and German to English.

You can unzip the original IWSLT2016 dataset files into this directory structure and read by MindSpore’s API. After decompression, you also need to decompress the dataset to be read in the specified folder. For example, if you want to read the dataset of de-en, you need to unzip the tgz file in the de/en directory, the dataset is in the unzipped folder.

.
└── iwslt2016_dataset_directory
     ├── subeval_files
     └── texts
          ├── ar
          │    └── en
          │        └── ar-en
          ├── cs
          │    └── en
          │        └── cs-en
          ├── de
          │    └── en
          │        └── de-en
          │            ├── IWSLT16.TED.dev2010.de-en.de.xml
          │            ├── train.tags.de-en.de
          │            ├── ...
          ├── en
          │    ├── ar
          │    │   └── en-ar
          │    ├── cs
          │    │   └── en-cs
          │    ├── de
          │    │   └── en-de
          │    └── fr
          │        └── en-fr
          └── fr
               └── en
                   └── fr-en

Citation:

@inproceedings{cettoloEtAl:EAMT2012,
Address = {Trento, Italy},
Author = {Mauro Cettolo and Christian Girardi and Marcello Federico},
Booktitle = {Proceedings of the 16$^{th}$ Conference of the European Association for Machine Translation
             (EAMT)},
Date = {28-30},
Month = {May},
Pages = {261--268},
Title = {WIT$^3$: Web Inventory of Transcribed and Translated Talks},
Year = {2012}}
class tinyms.data.IWSLT2017Dataset(dataset_dir, usage=None, language_pair=None, num_samples=None, shuffle=<Shuffle.GLOBAL: 'global'>, num_shards=None, shard_id=None, num_parallel_workers=None, cache=None)[source]

IWSLT2017(International Workshop on Spoken Language Translation) dataset.

The generated dataset has two columns: [text, translation] . The tensor of column text and translation are of the string type.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • usage (str, optional) – Acceptable usages include ‘train’, ‘valid’, ‘test’ and ‘all’. Default: None, all samples.

  • language_pair (sequence, optional) – List containing src and tgt language, supported values are (‘en’, ‘nl’), (‘en’, ‘de’), (‘en’, ‘it’), (‘en’, ‘ro’), (‘nl’, ‘en’), (‘nl’, ‘de’), (‘nl’, ‘it’), (‘nl’, ‘ro’), (‘de’, ‘en’), (‘de’, ‘nl’), (‘de’, ‘it’), (‘de’, ‘ro’), (‘it’, ‘en’), (‘it’, ‘nl’), (‘it’, ‘de’), (‘it’, ‘ro’), (‘ro’, ‘en’), (‘ro’, ‘nl’), (‘ro’, ‘de’), (‘ro’, ‘it’). Default: (‘de’, ‘en’).

  • num_samples (int, optional) – Number of samples (rows) to read. Default: None, reads the full dataset.

  • shuffle (Union[bool, Shuffle], optional) –

    Perform reshuffling of the data every epoch. Bool type and Shuffle enum are both supported to pass in. Default: Shuffle.GLOBAL . If shuffle is False, no shuffling will be performed. If shuffle is True, it is equivalent to setting shuffle to mindspore.dataset.Shuffle.GLOBAL. Set the mode of data shuffling by passing in enumeration variables:

    • Shuffle.GLOBAL: Shuffle both the files and samples.

    • Shuffle.FILES: Shuffle files only.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the max sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

Examples

>>> iwslt2017_dataset_dir = "/path/to/iwslt2017_dataset_dir"
>>> dataset = ds.IWSLT2017Dataset(dataset_dir=iwslt2017_dataset_dir, usage='all', language_pair=('de', 'en'))

About IWSLT2017 dataset:

IWSLT is an international oral translation conference, a major annual scientific conference dedicated to all aspects of oral translation. The MT task of the IWSLT evaluation activity constitutes a dataset, which can be publicly obtained through the WIT3 website wit3 . The IWSLT2017 dataset involves German, English, Italian, Dutch, and Romanian. The dataset includes translations in any two different languages.

You can unzip the original IWSLT2017 dataset files into this directory structure and read by MindSpore’s API. You need to decompress the dataset package in texts/DeEnItNlRo/DeEnItNlRo directory to get the DeEnItNlRo-DeEnItNlRo subdirectory.

.
└── iwslt2017_dataset_directory
    └── DeEnItNlRo
        └── DeEnItNlRo
            └── DeEnItNlRo-DeEnItNlRo
                ├── IWSLT17.TED.dev2010.de-en.de.xml
                ├── train.tags.de-en.de
                ├── ...

Citation:

@inproceedings{cettoloEtAl:EAMT2012,
Address = {Trento, Italy},
Author = {Mauro Cettolo and Christian Girardi and Marcello Federico},
Booktitle = {Proceedings of the 16$^{th}$ Conference of the European Association for Machine Translation
             (EAMT)},
Date = {28-30},
Month = {May},
Pages = {261--268},
Title = {WIT$^3$: Web Inventory of Transcribed and Translated Talks},
Year = {2012}}
class tinyms.data.Multi30kDataset(dataset_dir, usage=None, language_pair=None, num_samples=None, num_parallel_workers=None, shuffle=None, num_shards=None, shard_id=None, cache=None)[source]

Multi30k dataset.

The generated dataset has two columns [text, translation] . The tensor of column text is of the string type. The tensor of column translation is of the string type.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • usage (str, optional) – Acceptable usages include ‘train’, ‘test, ‘valid’ or ‘all’. Default: None, will read all samples.

  • language_pair (Sequence[str, str], optional) – Acceptable language_pair include [‘en’, ‘de’], [‘de’, ‘en’]. Default: None, means [‘en’, ‘de’].

  • num_samples (int, optional) – The number of images to be included in the dataset. Default: None, will read all samples.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (Union[bool, Shuffle], optional) –

    Whether to shuffle the dataset. Default: None, means Shuffle.GLOBAL. If False is provided, no shuffling will be performed. If True is provided, it is the same as setting to mindspore.dataset.Shuffle.GLOBAL. If Shuffle is provided, the effect is as follows:

    • Shuffle.GLOBAL: Shuffle both the files and samples.

    • Shuffle.FILES: Shuffle files only.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the max sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • ValueError – If usage is not ‘train’, ‘test’, ‘valid’ or ‘all’.

  • TypeError – If language_pair is not of type Sequence[str, str].

  • RuntimeError – If num_samples is less than 0.

  • RuntimeError – If num_parallel_workers exceeds the max thread numbers.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Examples

>>> multi30k_dataset_dir = "/path/to/multi30k_dataset_directory"
>>> data = ds.Multi30kDataset(dataset_dir=multi30k_dataset_dir, usage='all', language_pair=['de', 'en'])

About Multi30k dataset:

Multi30K is a multilingual dataset that features approximately 31,000 standardized images described in multiple languages. The images are sourced from Flickr and each image comes with sentence descripitions in both English and German, as well as descriptions in other languages. Multi30k is used primarily for training and testing in tasks such as image captioning, machine translation, and visual question answering.

You can unzip the dataset files into the following directory structure and read by MindSpore’s API.

└── multi30k_dataset_directory
      ├── training
      │    ├── train.de
      │    └── train.en
      ├── validation
      │    ├── val.de
      │    └── val.en
      └── mmt16_task1_test
           ├── val.de
           └── val.en

Citation:

@article{elliott-EtAl:2016:VL16,
author    = {{Elliott}, D. and {Frank}, S. and {Sima'an}, K. and {Specia}, L.},
title     = {Multi30K: Multilingual English-German Image Descriptions},
booktitle = {Proceedings of the 5th Workshop on Vision and Language},
year      = {2016},
pages     = {70--74},
year      = 2016
}
class tinyms.data.PennTreebankDataset(dataset_dir, usage=None, num_samples=None, num_parallel_workers=None, shuffle=<Shuffle.GLOBAL: 'global'>, num_shards=None, shard_id=None, cache=None)[source]

PennTreebank dataset.

The generated dataset has one column [text] . The tensor of column text is of the string type.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • usage (str, optional) – Acceptable usages include ‘train’, ‘test’, ‘valid’ and ‘all’. ‘train’ will read from 42,068 train samples of string type, ‘test’ will read from 3,370 test samples of string type, ‘valid’ will read from 3,761 test samples of string type, ‘all’ will read from all 49,199 samples of string type. Default: None, all samples.

  • num_samples (int, optional) – Number of samples (rows) to read. Default: None, reads the full dataset.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (Union[bool, Shuffle], optional) –

    Perform reshuffling of the data every epoch. Bool type and Shuffle enum are both supported to pass in. Default: Shuffle.GLOBAL . If shuffle is False, no shuffling will be performed. If shuffle is True, it is equivalent to setting shuffle to mindspore.dataset.Shuffle.GLOBAL. Set the mode of data shuffling by passing in enumeration variables:

    • Shuffle.GLOBAL: Shuffle both the files and samples.

    • Shuffle.FILES: Shuffle files only.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the max sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

Examples

>>> penn_treebank_dataset_dir = "/path/to/penn_treebank_dataset_directory"
>>> dataset = ds.PennTreebankDataset(dataset_dir=penn_treebank_dataset_dir, usage='all')

About PennTreebank dataset:

Penn Treebank (PTB) dataset, is widely used in machine learning for NLP (Natural Language Processing) research. Word-level PTB does not contain capital letters, numbers, and punctuations, and the vocabulary is capped at 10k unique words, which is relatively small in comparison to most modern datasets which can result in a larger number of out of vocabulary tokens.

Here is the original PennTreebank dataset structure. You can unzip the dataset files into this directory structure and read by MindSpore’s API.

.
└── PennTreebank_dataset_dir
     ├── ptb.test.txt
     ├── ptb.train.txt
     └── ptb.valid.txt

Citation:

@techreport{Santorini1990,
  added-at = {2014-03-26T23:25:56.000+0100},
  author = {Santorini, Beatrice},
  biburl = {https://www.bibsonomy.org/bibtex/234cdf6ddadd89376090e7dada2fc18ec/butonic},
  file = {:Santorini - Penn Treebank tag definitions.pdf:PDF},
  institution = {Department of Computer and Information Science, University of Pennsylvania},
  interhash = {818e72efd9e4b5fae3e51e88848100a0},
  intrahash = {34cdf6ddadd89376090e7dada2fc18ec},
  keywords = {dis pos tagging treebank},
  number = {MS-CIS-90-47},
  timestamp = {2014-03-26T23:25:56.000+0100},
  title = {Part-of-speech tagging guidelines for the {P}enn {T}reebank {P}roject},
  url = {ftp://ftp.cis.upenn.edu/pub/treebank/doc/tagguide.ps.gz},
  year = 1990
}
class tinyms.data.SogouNewsDataset(dataset_dir, usage=None, num_samples=None, shuffle=<Shuffle.GLOBAL: 'global'>, num_shards=None, shard_id=None, num_parallel_workers=None, cache=None)[source]

Sogou News dataset.

The generated dataset has three columns: [index, title, content] , and the data type of three columns is string.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • usage (str, optional) – Usage of this dataset, can be ‘train’, ‘test’ or ‘all’ . ‘train’ will read from 450,000 train samples, ‘test’ will read from 60,000 test samples, ‘all’ will read from all 510,000 samples. Default: None, all samples.

  • num_samples (int, optional) – Number of samples (rows) to read. Default: None, read all samples.

  • shuffle (Union[bool, Shuffle], optional) –

    Perform reshuffling of the data every epoch. Bool type and Shuffle enum are both supported to pass in. Default: Shuffle.GLOBAL . If shuffle is False, no shuffling will be performed. If shuffle is True, it is equivalent to setting shuffle to mindspore.dataset.Shuffle.GLOBAL. Set the mode of data shuffling by passing in enumeration variables:

    • Shuffle.GLOBAL: Shuffle both the files and samples, same as setting shuffle to True.

    • Shuffle.FILES: Shuffle files only.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the max sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

Examples

>>> sogou_news_dataset_dir = "/path/to/sogou_news_dataset_dir"
>>> dataset = ds.SogouNewsDataset(dataset_dir=sogou_news_dataset_dir, usage='all')

About SogouNews Dataset:

SogouNews dataset includes 3 columns, corresponding to class index (1 to 5), title and content. The title and content are escaped using double quotes (“), and any internal double quote is escaped by 2 double quotes (“”). New lines are escaped by a backslash followed with an “n” character, that is “n”.

You can unzip the dataset files into the following structure and read by MindSpore’s API:

.
└── sogou_news_dir
     ├── classes.txt
     ├── readme.txt
     ├── test.csv
     └── train.csv

Citation:

@misc{zhang2015characterlevel,
    title={Character-level Convolutional Networks for Text Classification},
    author={Xiang Zhang and Junbo Zhao and Yann LeCun},
    year={2015},
    eprint={1509.01626},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}
class tinyms.data.SQuADDataset(dataset_dir, usage=None, num_samples=None, num_parallel_workers=None, shuffle=<Shuffle.GLOBAL: 'global'>, num_shards=None, shard_id=None, cache=None)[source]

SQuAD 1.1 and SQuAD 2.0 datasets.

The generated dataset with different versions and usages has the same output columns: [context, question, text, answer_start] . The tensor of column context is of the string type. The tensor of column question is of the string type. The tensor of column text is the answer in the context of the string type. The tensor of column answer_start is the start index of answer in context, which is of the uint32 type.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • usage (str, optional) – Specify the ‘train’, ‘dev’ or ‘all’ part of dataset. Default: None, all samples.

  • num_samples (int, optional) – The number of samples to be included in the dataset. Default: None, will include all samples.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (Union[bool, Shuffle], optional) –

    Whether to shuffle the dataset. Default: Shuffle.GLOBAL. If False is provided, no shuffling will be performed. If True is provided, it is the same as setting to mindspore.dataset.Shuffle.GLOBAL. If Shuffle is provided, the effect is as follows:

    • Shuffle.GLOBAL: Shuffle both the files and samples.

    • Shuffle.FILES: Shuffle files only.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the maximum sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Examples

>>> squad_dataset_dir = "/path/to/squad_dataset_file"
>>> dataset = ds.SQuADDataset(dataset_dir=squad_dataset_dir, usage='all')

About SQuAD dataset:

SQuAD (Stanford Question Answering Dataset) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

SQuAD 1.1, the previous version of the SQuAD dataset, contains 100,000+ question-answer pairs on 500+ articles. SQuAD 2.0 combines the 100,000 questions in SQuAD 1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD 2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.

You can get the dataset files into the following structure and read by MindSpore’s API,

For SQuAD 1.1:

.
└── SQuAD1
     ├── train-v1.1.json
     └── dev-v1.1.json

For SQuAD 2.0:

.
└── SQuAD2
     ├── train-v2.0.json
     └── dev-v2.0.json

Citation:

@misc{rajpurkar2016squad,
    title         = {SQuAD: 100,000+ Questions for Machine Comprehension of Text},
    author        = {Pranav Rajpurkar and Jian Zhang and Konstantin Lopyrev and Percy Liang},
    year          = {2016},
    eprint        = {1606.05250},
    archivePrefix = {arXiv},
    primaryClass  = {cs.CL}
}

@misc{rajpurkar2018know,
    title         = {Know What You Don't Know: Unanswerable Questions for SQuAD},
    author        = {Pranav Rajpurkar and Robin Jia and Percy Liang},
    year          = {2018},
    eprint        = {1806.03822},
    archivePrefix = {arXiv},
    primaryClass  = {cs.CL}
}
class tinyms.data.SST2Dataset(dataset_dir, usage=None, num_samples=None, num_parallel_workers=None, shuffle=<Shuffle.GLOBAL: 'global'>, num_shards=None, shard_id=None, cache=None)[source]

SST2(Stanford Sentiment Treebank v2) dataset.

The generated dataset’s train.tsv and dev.tsv have two columns [sentence, label] . The generated dataset’s test.tsv has one column [sentence] . The tensor of column sentence and label are of the string type.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • usage (str, optional) – Usage of this dataset, can be train, test or dev. train will read from 67,349 train samples, test will read from 1,821 test samples, dev will read from all 872 samples. Default: None, will read train samples.

  • num_samples (int, optional) – The number of samples to be included in the dataset. Default: None, will include all text.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (Union[bool, Shuffle], optional) –

    Perform reshuffling of the data every epoch. Bool type and Shuffle enum are both supported to pass in. Default: Shuffle.GLOBAL . If shuffle is False, no shuffling will be performed; If shuffle is True, the behavior is the same as setting shuffle to be Shuffle.GLOBAL Set the mode of data shuffling by passing in enumeration variables:

    • Shuffle.GLOBAL: Shuffle the samples.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the maximum sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards. This argument can only be specified when num_shards is also specified. Default: None.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Examples

>>> sst2_dataset_dir = "/path/to/sst2_dataset_directory"
>>>
>>> # 1) Read 3 samples from SST2 dataset
>>> dataset = ds.SST2Dataset(dataset_dir=sst2_dataset_dir, num_samples=3)
>>>
>>> # 2) Read train samples from SST2 dataset
>>> dataset = ds.SST2Dataset(dataset_dir=sst2_dataset_dir, usage="train")

About SST2 dataset: The Stanford Sentiment Treebank is a corpus with fully labeled parse trees that allows for a complete analysis of the compositional effects of sentiment in language. The corpus is based on the dataset introduced by Pang and Lee (2005) and consists of 11,855 single sentences extracted from movie reviews. It was parsed with the Stanford parser and includes a total of 215,154 unique phrases from those parse trees, each annotated by 3 human judges.

Here is the original SST2 dataset structure. You can unzip the dataset files into this directory structure and read by Mindspore’s API.

.
└── sst2_dataset_dir
    ├── train.tsv
    ├── test.tsv
    ├── dev.tsv
    └── original

Citation:

@inproceedings{socher-etal-2013-recursive,
    title     = {Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank},
    author    = {Socher, Richard and Perelygin, Alex and Wu, Jean and Chuang, Jason and Manning,
                  Christopher D. and Ng, Andrew and Potts, Christopher},
    booktitle = {Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing},
    month     = oct,
    year      = {2013},
    address   = {Seattle, Washington, USA},
    publisher = {Association for Computational Linguistics},
    url       = {https://www.aclweb.org/anthology/D13-1170},
    pages     = {1631--1642},
}
class tinyms.data.TextFileDataset(dataset_files, num_samples=None, num_parallel_workers=None, shuffle=<Shuffle.GLOBAL: 'global'>, num_shards=None, shard_id=None, cache=None)[source]

A source dataset that reads and parses datasets stored on disk in text format. The generated dataset has one column [text] with type string.

Parameters:
  • dataset_files (Union[str, list[str]]) – String or list of files to be read or glob strings to search for a pattern of files. The list will be sorted in a lexicographical order.

  • num_samples (int, optional) – The number of samples to be included in the dataset. Default: None, will include all images.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (Union[bool, Shuffle], optional) –

    Perform reshuffling of the data every epoch. Default: Shuffle.GLOBAL . Bool type and Shuffle enum are both supported to pass in. If shuffle is False, no shuffling will be performed. If shuffle is True, performs global shuffle. There are three levels of shuffling, desired shuffle enum defined by mindspore.dataset.Shuffle.

    • Shuffle.GLOBAL: Shuffle both the files and samples, same as setting shuffle to True.

    • Shuffle.FILES: Shuffle files only.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the maximum sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • ValueError – If dataset_files are not valid or do not exist.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Examples

>>> text_file_dataset_dir = ["/path/to/text_file_dataset_file"] # contains 1 or multiple text files
>>> dataset = ds.TextFileDataset(dataset_files=text_file_dataset_dir)
class tinyms.data.UDPOSDataset(dataset_dir, usage=None, num_samples=None, shuffle=<Shuffle.GLOBAL: 'global'>, num_shards=None, shard_id=None, num_parallel_workers=None, cache=None)[source]

UDPOS(Universal Dependencies dataset for Part of Speech) dataset.

The generated dataset has three columns: [word, universal, stanford] , and the data type of three columns is string.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • usage (str, optional) – Usage of this dataset, can be ‘train’, ‘test’, ‘valid’ or ‘all’. ‘train’ will read from 12,543 train samples, ‘test’ will read from 2,077 test samples, ‘valid’ will read from 2,002 test samples, ‘all’ will read from all 16,622 samples. Default: None, all samples.

  • num_samples (int, optional) – Number of samples (rows) to read. Default: None, reads the full dataset.

  • shuffle (Union[bool, Shuffle], optional) –

    Perform reshuffling of the data every epoch. Bool type and Shuffle enum are both supported to pass in. Default: Shuffle.GLOBAL . If shuffle is False, no shuffling will be performed. If shuffle is True, it is equivalent to setting shuffle to mindspore.dataset.Shuffle.GLOBAL. Set the mode of data shuffling by passing in enumeration variables:

    • Shuffle.GLOBAL: Shuffle both the files and samples.

    • Shuffle.FILES: Shuffle files only.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the max sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

Examples

>>> udpos_dataset_dir = "/path/to/udpos_dataset_dir"
>>> dataset = ds.UDPOSDataset(dataset_dir=udpos_dataset_dir, usage='all')

About UDPOS dataset:

Text corpus dataset that clarifies syntactic or semantic sentence structure. The corpus comprises 254,830 words and 16,622 sentences, taken from various web media including weblogs, newsgroups, emails and reviews.

Citation:

@inproceedings{silveira14gold,
  year = {2014},
  author = {Natalia Silveira and Timothy Dozat and Marie-Catherine de Marneffe and Samuel Bowman
    and Miriam Connor and John Bauer and Christopher D. Manning},
  title = {A Gold Standard Dependency Corpus for {E}nglish},
  booktitle = {Proceedings of the Ninth International Conference on Language
    Resources and Evaluation (LREC-2014)}
}
class tinyms.data.WikiTextDataset(dataset_dir, usage=None, num_samples=None, num_parallel_workers=None, shuffle=<Shuffle.GLOBAL: 'global'>, num_shards=None, shard_id=None, cache=None)[source]

WikiText2 and WikiText103 datasets.

The generated dataset has one column [text] , and the tensor of column text is of the string type.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • usage (str, optional) – Acceptable usages include ‘train’, ‘test’, ‘valid’ and ‘all’. Default: None, all samples.

  • num_samples (int, optional) – Number of samples (rows) to read. Default: None, reads the full dataset.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (Union[bool, Shuffle], optional) –

    Perform reshuffling of the data every epoch. Bool type and Shuffle enum are both supported to pass in. Default: Shuffle.GLOBAL . If shuffle is False, no shuffling will be performed. If shuffle is True, it is equivalent to setting shuffle to mindspore.dataset.Shuffle.GLOBAL. Set the mode of data shuffling by passing in enumeration variables:

    • Shuffle.GLOBAL: Shuffle both the files and samples.

    • Shuffle.FILES: Shuffle files only.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the max sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files or invalid.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If shard_id is not in range of [0, num_shards ).

  • ValueError – If num_samples is invalid (< 0).

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

About WikiTextDataset dataset:

The WikiText Long Term Dependency Language Modeling Dataset is an English lexicon containing 100 million words. These terms are drawn from Wikipedia’s premium and benchmark articles, including versions of Wikitext2 and Wikitext103. For WikiText2, it has 36718 lines in wiki.train.tokens, 4358 lines in wiki.test.tokens and 3760 lines in wiki.valid.tokens. For WikiText103, it has 1801350 lines in wiki.train.tokens, 4358 lines in wiki.test.tokens and 3760 lines in wiki.valid.tokens.

Here is the original WikiText dataset structure. You can unzip the dataset files into this directory structure and read by MindSpore’s API.

.
└── WikiText2/WikiText103
     ├── wiki.train.tokens
     ├── wiki.test.tokens
     ├── wiki.valid.tokens

Citation:

@article{merity2016pointer,
  title={Pointer sentinel mixture models},
  author={Merity, Stephen and Xiong, Caiming and Bradbury, James and Socher, Richard},
  journal={arXiv preprint arXiv:1609.07843},
  year={2016}
}

Examples

>>> wiki_text_dataset_dir = "/path/to/wiki_text_dataset_directory"
>>> dataset = ds.WikiTextDataset(dataset_dir=wiki_text_dataset_dir, usage='all')
class tinyms.data.YahooAnswersDataset(dataset_dir, usage=None, num_samples=None, num_parallel_workers=None, shuffle=<Shuffle.GLOBAL: 'global'>, num_shards=None, shard_id=None, cache=None)[source]

YahooAnswers dataset.

The generated dataset has four columns [class, title, content, answer] , whose data type is string.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • usage (str, optional) – Usage of this dataset, can be ‘train’, ‘test’ or ‘all’. ‘train’ will read from 1,400,000 train samples, ‘test’ will read from 60,000 test samples, ‘all’ will read from all 1,460,000 samples. Default: None, all samples.

  • num_samples (int, optional) – The number of samples to be included in the dataset. Default: None, will include all text.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (Union[bool, Shuffle], optional) –

    Perform reshuffling of the data every epoch. Bool type and Shuffle enum are both supported to pass in. Default: Shuffle.GLOBAL . If shuffle is False, no shuffling will be performed. If shuffle is True, it is equivalent to setting shuffle to mindspore.dataset.Shuffle.GLOBAL. Set the mode of data shuffling by passing in enumeration variables:

    • Shuffle.GLOBAL: Shuffle both the files and samples.

    • Shuffle.FILES: Shuffle files only.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the maximum sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If shard_id is not in range of [0, num_shards ).

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

Examples

>>> yahoo_answers_dataset_dir = "/path/to/yahoo_answers_dataset_directory"
>>>
>>> # 1) Read 3 samples from YahooAnswers dataset
>>> dataset = ds.YahooAnswersDataset(dataset_dir=yahoo_answers_dataset_dir, num_samples=3)
>>>
>>> # 2) Read train samples from YahooAnswers dataset
>>> dataset = ds.YahooAnswersDataset(dataset_dir=yahoo_answers_dataset_dir, usage="train")

About YahooAnswers dataset:

The YahooAnswers dataset consists of 630,000 text samples in 10 classes, There are 560,000 samples in the train.csv and 70,000 samples in the test.csv. The 10 different classes represent Society & Culture, Science & Mathematics, Health, Education & Reference, Computers & Internet, Sports, Business & Finance, Entertainment & Music, Family & Relationships, Politics & Government.

Here is the original YahooAnswers dataset structure. You can unzip the dataset files into this directory structure and read by Mindspore’s API.

.
└── yahoo_answers_dataset_dir
    ├── train.csv
    ├── test.csv
    ├── classes.txt
    └── readme.txt

Citation:

@article{YahooAnswers,
title   = {Yahoo! Answers Topic Classification Dataset},
author  = {Xiang Zhang},
year    = {2015},
howpublished = {}
}
class tinyms.data.YelpReviewDataset(dataset_dir, usage=None, num_samples=None, shuffle=<Shuffle.GLOBAL: 'global'>, num_shards=None, shard_id=None, num_parallel_workers=None, cache=None)[source]

Yelp Review Polarity and Yelp Review Full datasets.

The generated dataset has two columns: [label, text] , and the data type of two columns is string.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • usage (str, optional) – Usage of this dataset, can be ‘train’, ‘test’ or ‘all’. For Polarity, ‘train’ will read from 560,000 train samples, ‘test’ will read from 38,000 test samples, ‘all’ will read from all 598,000 samples. For Full, ‘train’ will read from 650,000 train samples, ‘test’ will read from 50,000 test samples, ‘all’ will read from all 700,000 samples. Default: None, all samples.

  • num_samples (int, optional) – Number of samples (rows) to read. Default: None, reads all samples.

  • shuffle (Union[bool, Shuffle], optional) –

    Perform reshuffling of the data every epoch. Bool type and Shuffle enum are both supported to pass in. Default: Shuffle.GLOBAL . If shuffle is False, no shuffling will be performed. If shuffle is True, it is equivalent to setting shuffle to mindspore.dataset.Shuffle.GLOBAL. Set the mode of data shuffling by passing in enumeration variables:

    • Shuffle.GLOBAL: Shuffle both the files and samples.

    • Shuffle.FILES: Shuffle files only.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the max sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

Examples

>>> yelp_review_dataset_dir = "/path/to/yelp_review_dataset_dir"
>>> dataset = ds.YelpReviewDataset(dataset_dir=yelp_review_dataset_dir, usage='all')

About YelpReview Dataset:

The Yelp Review Full dataset consists of reviews from Yelp. It is extracted from the Yelp Dataset Challenge 2015 data, and it is mainly used for text classification.

The Yelp Review Polarity dataset is constructed from the above dataset, by considering stars 1 and 2 negative, and 3 and 4 positive.

The directory structures of these two datasets are the same. You can unzip the dataset files into the following structure and read by MindSpore’s API:

.
└── yelp_review_dir
     ├── train.csv
     ├── test.csv
     └── readme.txt

Citation:

For Yelp Review Polarity:

@article{zhangCharacterlevelConvolutionalNetworks2015,
  archivePrefix = {arXiv},
  eprinttype = {arxiv},
  eprint = {1509.01626},
  primaryClass = {cs},
  title = {Character-Level {{Convolutional Networks}} for {{Text Classification}}},
  abstract = {This article offers an empirical exploration on the use of character-level convolutional networks
              (ConvNets) for text classification. We constructed several large-scale datasets to show that
              character-level convolutional networks could achieve state-of-the-art or competitive results.
              Comparisons are offered against traditional models such as bag of words, n-grams and their TFIDF
              variants, and deep learning models such as word-based ConvNets and recurrent neural networks.},
  journal = {arXiv:1509.01626 [cs]},
  author = {Zhang, Xiang and Zhao, Junbo and LeCun, Yann},
  month = sep,
  year = {2015},
}

Citation:

For Yelp Review Full:

@article{zhangCharacterlevelConvolutionalNetworks2015,
  archivePrefix = {arXiv},
  eprinttype = {arxiv},
  eprint = {1509.01626},
  primaryClass = {cs},
  title = {Character-Level {{Convolutional Networks}} for {{Text Classification}}},
  abstract = {This article offers an empirical exploration on the use of character-level convolutional networks
              (ConvNets) for text classification. We constructed several large-scale datasets to show that
              character-level convolutional networks could achieve state-of-the-art or competitive results.
              Comparisons are offered against traditional models such as bag of words, n-grams and their TFIDF
              variants, and deep learning models such as word-based ConvNets and recurrent neural networks.},
  journal = {arXiv:1509.01626 [cs]},
  author = {Zhang, Xiang and Zhao, Junbo and LeCun, Yann},
  month = sep,
  year = {2015},
}
class tinyms.data.CMUArcticDataset(dataset_dir, name=None, num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, num_shards=None, shard_id=None, cache=None)[source]

CMU Arctic dataset.

The generated dataset has four columns: [waveform, sample_rate, transcript, utterance_id] . The tensor of column waveform is of the float32 type. The tensor of column sample_rate is of a scalar of uint32 type. The tensor of column transcript is of a scalar of string type. The tensor of column utterance_id is of a scalar of string type.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • name (str, optional) – Part of this dataset, can be ‘aew’, ‘ahw’, ‘aup’, ‘awb’, ‘axb’, ‘bdl’, ‘clb’, ‘eey’, ‘fem’, ‘gka’, ‘jmk’, ‘ksp’, ‘ljm’, ‘lnh’, ‘rms’, ‘rxr’, ‘slp’ or ‘slt’. Default: None, means ‘aew’.

  • num_samples (int, optional) – The number of audio to be included in the dataset. Default: None, will read all audio.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None, no dividing. When this argument is specified, num_samples reflects the max sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None, will use 0. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and num_shards/shard_id are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Note

  • Not support mindspore.dataset.PKSampler for sampler parameter yet.

  • This dataset can take in a sampler . sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter sampler

Parameter shuffle

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> cmu_arctic_dataset_directory = "/path/to/cmu_arctic_dataset_directory"
>>>
>>> # 1) Read 500 samples (audio files) in cmu_arctic_dataset_directory
>>> dataset = ds.CMUArcticDataset(cmu_arctic_dataset_directory, name="ahw", num_samples=500)
>>>
>>> # 2) Read all samples (audio files) in cmu_arctic_dataset_directory
>>> dataset = ds.CMUArcticDataset(cmu_arctic_dataset_directory)

About CMUArctic dataset:

The CMU Arctic databases are designed for the purpose of speech synthesis research. These single speaker speech databases have been carefully recorded under studio conditions and consist of approximately 1200 phonetically balanced English utterances. In addition to wavefiles, the databases provide complete support for the Festival Speech Synthesis System, including pre-built voices that may be used as is. The entire package is distributed as free software, without restriction on commercial or non-commercial use.

You can construct the following directory structure from CMUArctic dataset and read by MindSpore’s API.

.
└── cmu_arctic_dataset_directory
    ├── cmu_us_aew_arctic
    │    ├── wav
    │    │    ├──arctic_a0001.wav
    │    │    ├──arctic_a0002.wav
    │    │    ├──...
    │    ├── etc
    │    │    └── txt.done.data
    ├── cmu_us_ahw_arctic
    │    ├── wav
    │    │    ├──arctic_a0001.wav
    │    │    ├──arctic_a0002.wav
    │    │    ├──...
    │    └── etc
    │         └── txt.done.data
    └──...

Citation:

@article{LTI2003CMUArctic,
title        = {CMU ARCTIC databases for speech synthesis},
author       = {John Kominek and Alan W Black},
journal      = {Language Technologies Institute [Online]},
year         = {2003}
howpublished = {http://www.festvox.org/cmu_arctic/}
}
class tinyms.data.GTZANDataset(dataset_dir, usage=None, num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, num_shards=None, shard_id=None, cache=None)[source]

GTZAN dataset.

The generated dataset has three columns: [waveform, sample_rate, label] . The tensor of column waveform is of the float32 type. The tensor of column sample_rate is of a scalar of uint32 type. The tensor of column label is of a scalar of string type.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • usage (str, optional) – Usage of this dataset, can be ‘train’, ‘valid’, ‘test’ or ‘all’. Default: None, will read all samples.

  • num_samples (int, optional) – The number of audio to be included in the dataset. Default: None, will read all audio.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the max sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and num_shards/shard_id are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Note

  • Not support mindspore.dataset.PKSampler for sampler parameter yet.

  • This dataset can take in a sampler . sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter sampler

Parameter shuffle

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> gtzan_dataset_directory = "/path/to/gtzan_dataset_directory"
>>>
>>> # 1) Read 500 samples (audio files) in gtzan_dataset_directory
>>> dataset = ds.GTZANDataset(gtzan_dataset_directory, usage="all", num_samples=500)
>>>
>>> # 2) Read all samples (audio files) in gtzan_dataset_directory
>>> dataset = ds.GTZANDataset(gtzan_dataset_directory)

About GTZAN dataset:

The GTZAN dataset appears in at least 100 published works and is the most commonly used public dataset for evaluation in machine listening research for music genre recognition. It consists of 1000 audio tracks, each of which is 30 seconds long. It contains 10 genres (blues, classical, country, disco, hiphop, jazz, metal, pop, reggae and rock), each of which is represented by 100 tracks. The tracks are all 22050Hz Mono 16-bit audio files in .wav format.

You can construct the following directory structure from GTZAN dataset and read by MindSpore’s API.

.
└── gtzan_dataset_directory
    ├── blues
    │    ├──blues.00000.wav
    │    ├──blues.00001.wav
    │    ├──blues.00002.wav
    │    ├──...
    ├── disco
    │    ├──disco.00000.wav
    │    ├──disco.00001.wav
    │    ├──disco.00002.wav
    │    └──...
    └──...

Citation:

@misc{tzanetakis_essl_cook_2001,
author    = "Tzanetakis, George and Essl, Georg and Cook, Perry",
title     = "Automatic Musical Genre Classification Of Audio Signals",
url       = "http://ismir2001.ismir.net/pdf/tzanetakis.pdf",
publisher = "The International Society for Music Information Retrieval",
year      = "2001"
}
class tinyms.data.LibriTTSDataset(dataset_dir, usage=None, num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, num_shards=None, shard_id=None, cache=None)[source]

LibriTTS dataset.

The generated dataset has seven columns [waveform, sample_rate, original_text, normalized_text, speaker_id, chapter_id, utterance_id] . The tensor of column waveform is of the float32 type. The tensor of column sample_rate is of a scalar of uint32 type. The tensor of column original_text is of a scalar of string type. The tensor of column normalized_text is of a scalar of string type. The tensor of column speaker_id is of a scalar of uint32 type. The tensor of column chapter_id is of a scalar of uint32 type. The tensor of column utterance_id is of a scalar of string type.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • usage (str, optional) – Part of this dataset, can be ‘dev-clean’, ‘dev-other’, ‘test-clean’, ‘test-other’, ‘train-clean-100’, ‘train-clean-360’, ‘train-other-500’, or ‘all’. Default: None, means ‘all’.

  • num_samples (int, optional) – The number of images to be included in the dataset. Default: None, will read all audio.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the max sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and num_shards/shard_id are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Note

  • Not support mindspore.dataset.PKSampler for sampler parameter yet.

  • This dataset can take in a sampler . sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter sampler

Parameter shuffle

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> libri_tts_dataset_dir = "/path/to/libri_tts_dataset_directory"
>>>
>>> # 1) Read 500 samples (audio files) in libri_tts_dataset_directory
>>> dataset = ds.LibriTTSDataset(libri_tts_dataset_dir, usage="train-clean-100", num_samples=500)
>>>
>>> # 2) Read all samples (audio files) in libri_tts_dataset_directory
>>> dataset = ds.LibriTTSDataset(libri_tts_dataset_dir)

About LibriTTS dataset:

LibriTTS is a multi-speaker English corpus of approximately 585 hours of read English speech at 24kHz sampling rate, prepared by Heiga Zen with the assistance of Google Speech and Google Brain team members. The LibriTTS corpus is designed for TTS research. It is derived from the original materials (mp3 audio files from LibriVox and text files from Project Gutenberg) of the LibriSpeech corpus.

You can construct the following directory structure from LibriTTS dataset and read by MindSpore’s API.

.
└── libri_tts_dataset_directory
    ├── dev-clean
    │    ├── 116
    │    │    ├── 288045
    |    |    |    ├── 116_288045.trans.tsv
    │    │    │    ├── 116_288045_000003_000000.wav
    │    │    │    └──...
    │    │    ├── 288046
    |    |    |    ├── 116_288046.trans.tsv
    |    |    |    ├── 116_288046_000003_000000.wav
    │    |    |    └── ...
    |    |    └── ...
    │    ├── 1255
    │    │    ├── 138279
    |    |    |    ├── 1255_138279.trans.tsv
    │    │    │    ├── 1255_138279_000001_000000.wav
    │    │    │    └── ...
    │    │    ├── 74899
    |    |    |    ├── 1255_74899.trans.tsv
    |    |    |    ├── 1255_74899_000001_000000.wav
    │    |    |    └── ...
    |    |    └── ...
    |    └── ...
    └── ...

Citation:

@article{lecun2010mnist,
title        = {LIBRITTS handwritten digit database},
author       = {zpw, NBU},
journal      = {ATT Labs [Online]},
volume       = {2},
year         = {2010},
howpublished = {http://www.openslr.org/resources/60/},
description  = {The LibriSpeech ASR corpus (http://www.openslr.org/12/) [1] has been used in
                various research projects. However, as it was originally designed for ASR research,
                there are some undesired properties when using for TTS research}
}
class tinyms.data.LJSpeechDataset(dataset_dir, num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, num_shards=None, shard_id=None, cache=None)[source]

LJSpeech dataset.

The generated dataset has four columns [waveform, sample_rate, transcription, normalized_transcript] . The column waveform is a tensor of the float32 type. The column sample_rate is a scalar of the int32 type. The column transcription is a scalar of the string type. The column normalized_transcript is a scalar of the string type.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • num_samples (int, optional) – The number of audios to be included in the dataset. Default: None, all audios.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (bool, optional) – Whether to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the maximum sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and num_shards/shard_id are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Note

  • This dataset can take in a sampler . sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter sampler

Parameter shuffle

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> lj_speech_dataset_dir = "/path/to/lj_speech_dataset_directory"
>>>
>>> # 1) Get all samples from LJSPEECH dataset in sequence
>>> dataset = ds.LJSpeechDataset(dataset_dir=lj_speech_dataset_dir, shuffle=False)
>>>
>>> # 2) Randomly select 350 samples from LJSPEECH dataset
>>> dataset = ds.LJSpeechDataset(dataset_dir=lj_speech_dataset_dir, num_samples=350, shuffle=True)
>>>
>>> # 3) Get samples from LJSPEECH dataset for shard 0 in a 2-way distributed training
>>> dataset = ds.LJSpeechDataset(dataset_dir=lj_speech_dataset_dir, num_shards=2, shard_id=0)
>>>
>>> # In LJSPEECH dataset, each dictionary has keys "waveform", "sample_rate", "transcription"
>>> # and "normalized_transcript"

About LJSPEECH dataset:

This is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours.

The texts were published between 1884 and 1964, and are in the public domain. The audio was recorded in 2016-17 by the LibriVox project and is also in the public domain.

Here is the original LJSPEECH dataset structure. You can unzip the dataset files into the following directory structure and read by MindSpore’s API.

.
└── LJSpeech-1.1
    ├── README
    ├── metadata.csv
    └── wavs
        ├── LJ001-0001.wav
        ├── LJ001-0002.wav
        ├── LJ001-0003.wav
        ├── LJ001-0004.wav
        ├── LJ001-0005.wav
        ├── LJ001-0006.wav
        ├── LJ001-0007.wav
        ├── LJ001-0008.wav
        ...
        ├── LJ050-0277.wav
        └── LJ050-0278.wav

Citation:

@misc{lj_speech17,
author       = {Keith Ito and Linda Johnson},
title        = {The LJ Speech Dataset},
howpublished = {url{https://keithito.com/LJ-Speech-Dataset}},
year         = 2017
}
class tinyms.data.SpeechCommandsDataset(dataset_dir, usage=None, num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, num_shards=None, shard_id=None, cache=None)[source]

Speech Commands dataset.

The generated dataset has five columns [waveform, sample_rate, label, speaker_id, utterance_number] . The tensor of column waveform is a vector of the float32 type. The tensor of column sample_rate is a scalar of the int32 type. The tensor of column label is a scalar of the string type. The tensor of column speaker_id is a scalar of the string type. The tensor of column utterance_number is a scalar of the int32 type.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • usage (str, optional) – Usage of this dataset, can be ‘train’, ‘test’, ‘valid’ or ‘all’. ‘train’ will read from 84,843 samples, ‘test’ will read from 11,005 samples, ‘valid’ will read from 9,981 test samples and ‘all’ will read from all 105,829 samples. Default: None, will read all samples.

  • num_samples (int, optional) – The number of samples to be included in the dataset. Default: None, will read all samples.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the maximum sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and num_shards/shard_id are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Note

  • This dataset can take in a sampler . sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter sampler

Parameter shuffle

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> speech_commands_dataset_dir = "/path/to/speech_commands_dataset_directory"
>>>
>>> # Read 3 samples from SpeechCommands dataset
>>> dataset = ds.SpeechCommandsDataset(dataset_dir=speech_commands_dataset_dir, num_samples=3)

About SpeechCommands dataset:

The SpeechCommands is database for limited_vocabulary speech recognition, containing 105,829 audio samples of ‘.wav’ format.

Here is the original SpeechCommands dataset structure. You can unzip the dataset files into this directory structure and read by MindSpore’s API.

.
└── speech_commands_dataset_dir
     ├── cat
          ├── b433eff_nohash_0.wav
          ├── 5a33edf_nohash_1.wav
          └──....
     ├── dog
          ├── b433w2w_nohash_0.wav
          └──....
     ├── four
     └── ....

Citation:

@article{2018Speech,
title={Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition},
author={Warden, P.},
year={2018}
}
class tinyms.data.TedliumDataset(dataset_dir, release, usage=None, extensions=None, num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, num_shards=None, shard_id=None, cache=None)[source]

Tedlium dataset. The columns of generated dataset depend on the source SPH files and the corresponding STM files.

The generated dataset has six columns [waveform, sample_rate, transcript, talk_id, speaker_id, identifier] .

The data type of column waveform is float32, the data type of column sample_rate is int32, and the data type of columns transcript , talk_id , speaker_id and identifier is string.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • release (str) – Release of the dataset, can be ‘release1’, ‘release2’, ‘release3’.

  • usage (str, optional) – Usage of this dataset. For release1 or release2, can be ‘train’, ‘test’, ‘dev’ or ‘all’. ‘train’ will read from train samples, ‘test’ will read from test samples, ‘dev’ will read from dev samples, ‘all’ will read from all samples. For release3, can only be ‘all’, it will read from data samples. Default: None, all samples.

  • extensions (str, optional) – Extensions of the SPH files, only ‘.sph’ is valid. Default: None, “.sph”.

  • num_samples (int, optional) – The number of audio samples to be included in the dataset. Default: None, all samples.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (bool, optional) – Whether to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the maximum sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain stm files.

  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and num_shards/shard_id are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Note

  • This dataset can take in a sampler . sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter sampler

Parameter shuffle

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> # 1) Get all train samples from TEDLIUM_release1 dataset in sequence.
>>> dataset = ds.TedliumDataset(dataset_dir="/path/to/tedlium1_dataset_directory",
...                             release="release1", shuffle=False)
>>>
>>> # 2) Randomly select 10 samples from TEDLIUM_release2 dataset.
>>> dataset = ds.TedliumDataset(dataset_dir="/path/to/tedlium2_dataset_directory",
...                             release="release2", num_samples=10, shuffle=True)
>>>
>>> # 3) Get samples from TEDLIUM_release-3 dataset for shard 0 in a 2-way distributed training.
>>> dataset = ds.TedliumDataset(dataset_dir="/path/to/tedlium3_dataset_directory",
...                             release="release3", num_shards=2, shard_id=0)
>>>
>>> # In TEDLIUM dataset, each dictionary has keys : waveform, sample_rate, transcript, talk_id,
>>> # speaker_id and identifier.

About TEDLIUM_release1 dataset:

The TED-LIUM corpus is English-language TED talks, with transcriptions, sampled at 16kHz. It contains about 118 hours of speech.

About TEDLIUM_release2 dataset:

This is the TED-LIUM corpus release 2, licensed under Creative Commons BY-NC-ND 3.0. All talks and text are property of TED Conferences LLC. The TED-LIUM corpus was made from audio talks and their transcriptions available on the TED website. We have prepared and filtered these data in order to train acoustic models to participate to the International Workshop on Spoken Language Translation 2011 (the LIUM English/French SLT system reached the first rank in the SLT task).

About TEDLIUM_release-3 dataset:

This is the TED-LIUM corpus release 3, licensed under Creative Commons BY-NC-ND 3.0. All talks and text are property of TED Conferences LLC. This new TED-LIUM release was made through a collaboration between the Ubiqus company and the LIUM (University of Le Mans, France).

You can unzip the dataset files into the following directory structure and read by MindSpore’s API.

The structure of TEDLIUM release2 is the same as TEDLIUM release1, only the data is different.

.
└──TEDLIUM_release1
    └── dev
        ├── sph
            ├── AlGore_2009.sph
            ├── BarrySchwartz_2005G.sph
        ├── stm
            ├── AlGore_2009.stm
            ├── BarrySchwartz_2005G.stm
    └── test
        ├── sph
            ├── AimeeMullins_2009P.sph
            ├── BillGates_2010.sph
        ├── stm
            ├── AimeeMullins_2009P.stm
            ├── BillGates_2010.stm
    └── train
        ├── sph
            ├── AaronHuey_2010X.sph
            ├── AdamGrosser_2007.sph
        ├── stm
            ├── AaronHuey_2010X.stm
            ├── AdamGrosser_2007.stm
    └── readme
    └── TEDLIUM.150k.dic

The directory structure of TEDLIUM release3 is slightly different.

.
└──TEDLIUM_release-3
    └── data
        ├── ctl
        ├── sph
            ├── 911Mothers_2010W.sph
            ├── AalaElKhani.sph
        ├── stm
            ├── 911Mothers_2010W.stm
            ├── AalaElKhani.stm
    └── doc
    └── legacy
    └── LM
    └── speaker-adaptation
    └── readme
    └── TEDLIUM.150k.dic

Citation:

@article{
  title={TED-LIUM: an automatic speech recognition dedicated corpus},
  author={A. Rousseau, P. Deléglise, Y. Estève},
  journal={Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)},
  year={May 2012},
  biburl={https://www.openslr.org/7/}
}

@article{
  title={Enhancing the TED-LIUM Corpus with Selected Data for Language Modeling and More TED Talks},
  author={A. Rousseau, P. Deléglise, and Y. Estève},
  journal={Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)},
  year={May 2014},
  biburl={https://www.openslr.org/19/}
}

@article{
  title={TED-LIUM 3: twice as much data and corpus repartition for experiments on speaker adaptation},
  author={François Hernandez, Vincent Nguyen, Sahar Ghannay, Natalia Tomashenko, and Yannick Estève},
  journal={the 20th International Conference on Speech and Computer (SPECOM 2018)},
  year={September 2018},
  biburl={https://www.openslr.org/51/}
}
class tinyms.data.YesNoDataset(dataset_dir, num_samples=None, num_parallel_workers=None, shuffle=None, sampler=None, num_shards=None, shard_id=None, cache=None)[source]

YesNo dataset.

The generated dataset has three columns [waveform, sample_rate, labels] . The tensor of column waveform is a vector of the float32 type. The tensor of column sample_rate is a scalar of the int32 type. The tensor of column labels is a scalar of the int32 type.

Parameters:
  • dataset_dir (str) – Path to the root directory that contains the dataset.

  • num_samples (int, optional) – The number of images to be included in the dataset. Default: None, will read all images.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the maximum sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_dir does not contain data files.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • RuntimeError – If sampler and shuffle are specified at the same time.

  • RuntimeError – If sampler and num_shards/shard_id are specified at the same time.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Note

  • This dataset can take in a sampler . sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter sampler

Parameter shuffle

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> yes_no_dataset_dir = "/path/to/yes_no_dataset_directory"
>>>
>>> # Read 3 samples from YesNo dataset
>>> dataset = ds.YesNoDataset(dataset_dir=yes_no_dataset_dir, num_samples=3)
>>>
>>> # Note: In YesNo dataset, each dictionary has keys "waveform", "sample_rate", "label"

About YesNo dataset:

Yesno is an audio dataset consisting of 60 recordings of one individual saying yes or no in Hebrew; each recording is eight words long.

Here is the original YesNo dataset structure. You can unzip the dataset files into this directory structure and read by MindSpore’s API.

.
└── yes_no_dataset_dir
     ├── 1_1_0_0_1_1_0_0.wav
     ├── 1_0_0_0_1_1_0_0.wav
     ├── 1_1_0_0_1_1_0_0.wav
     └──....

Citation:

@NetworkResource{Kaldi_audio_project,
author    = {anonymous},
url       = "http://wwww.openslr.org/1/"
}
class tinyms.data.CSVDataset(dataset_files, field_delim=', ', column_defaults=None, column_names=None, num_samples=None, num_parallel_workers=None, shuffle=<Shuffle.GLOBAL: 'global'>, num_shards=None, shard_id=None, cache=None)[source]

A source dataset that reads and parses comma-separated values (CSV) files as dataset.

The columns of generated dataset depend on the source CSV files.

Parameters:
  • dataset_files (Union[str, list[str]]) – String or list of files to be read or glob strings to search for a pattern of files. The list will be sorted in a lexicographical order.

  • field_delim (str, optional) – A string that indicates the char delimiter to separate fields. Default: ‘,’.

  • column_defaults (list, optional) – List of default values for the CSV field. Default: None. Each item in the list is either a valid type (float, int, or string). If this is not provided, treats all columns as string type.

  • column_names (list[str], optional) – List of column names of the dataset. Default: None. If this is not provided, infers the column_names from the first row of CSV file.

  • num_samples (int, optional) – The number of samples to be included in the dataset. Default: None, will include all images.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (Union[bool, Shuffle], optional) –

    Perform reshuffling of the data every epoch. Default: Shuffle.GLOBAL. Bool type and Shuffle enum are both supported to pass in. If shuffle is False, no shuffling will be performed. If shuffle is True, performs global shuffle. There are three levels of shuffling, desired shuffle enum defined by mindspore.dataset.Shuffle.

    • Shuffle.GLOBAL: Shuffle both the files and samples, same as setting shuffle to True.

    • Shuffle.FILES: Shuffle files only.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the maximum sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • RuntimeError – If dataset_files are not valid or do not exist.

  • ValueError – If field_delim is invalid.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Examples

>>> csv_dataset_dir = ["/path/to/csv_dataset_file"] # contains 1 or multiple csv files
>>> dataset = ds.CSVDataset(dataset_files=csv_dataset_dir, column_names=['col1', 'col2', 'col3', 'col4'])
class tinyms.data.MindDataset(dataset_files, columns_list=None, num_parallel_workers=None, shuffle=None, num_shards=None, shard_id=None, sampler=None, padded_sample=None, num_padded=None, num_samples=None, cache=None)[source]

A source dataset that reads and parses MindRecord dataset.

The columns of generated dataset depend on the source MindRecord files.

Parameters:
  • dataset_files (Union[str, list[str]]) – If dataset_file is a str, it represents for a file name of one component of a mindrecord source, other files with identical source in the same path will be found and loaded automatically. If dataset_file is a list, it represents for a list of dataset files to be read directly.

  • columns_list (list[str], optional) – List of columns to be read. Default: None, read all columns.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (Union[bool, Shuffle], optional) –

    Perform reshuffling of the data every epoch. Default: None, performs global shuffle. Bool type and Shuffle enum are both supported to pass in. If shuffle is False, no shuffling will be performed. If shuffle is True, performs global shuffle. There are three levels of shuffling, desired shuffle enum defined by mindspore.dataset.Shuffle.

    • Shuffle.GLOBAL: Global shuffle of all rows of data in dataset, same as setting shuffle to True.

    • Shuffle.FILES: Shuffle the file sequence but keep the order of data within each file.

    • Shuffle.INFILE: Keep the file sequence the same but shuffle the data within each file.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the maximum sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • sampler (Sampler, optional) – Object used to choose samples from the dataset. Default: None, sampler is exclusive with shuffle and block_reader. Support list: SubsetRandomSampler, PkSampler, RandomSampler, SequentialSampler, DistributedSampler.

  • padded_sample (dict, optional) – Samples will be appended to dataset, where keys are the same as column_list.

  • num_padded (int, optional) – Number of padding samples. Dataset size plus num_padded should be divisible by num_shards.

  • num_samples (int, optional) – The number of samples to be included in the dataset. Default: None, all samples.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

Raises:
  • ValueError – If dataset_files are not valid or do not exist.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Note

  • This dataset can take in a sampler . sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter sampler

Parameter shuffle

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> mind_dataset_dir = ["/path/to/mind_dataset_file"] # contains 1 or multiple MindRecord files
>>> dataset = ds.MindDataset(dataset_files=mind_dataset_dir)
class tinyms.data.OBSMindDataset(dataset_files, server, ak, sk, sync_obs_path, columns_list=None, shuffle=<Shuffle.GLOBAL: 'global'>, num_shards=None, shard_id=None, shard_equal_rows=True)[source]

A source dataset that reads and parses MindRecord dataset which stored in cloud storage such as OBS, Minio or AWS S3.

The columns of generated dataset depend on the source MindRecord files.

Parameters:
  • dataset_files (list[str]) – List of files in cloud storage to be read and file path is in the format of s3://bucketName/objectKey.

  • server (str) – Endpoint for accessing cloud storage. If it’s OBS Service of Huawei Cloud,the endpoint is like <obs.cn-north-4.myhuaweicloud.com> (Region cn-north-4). If it’s Minio which starts locally, the endpoint is like <https://127.0.0.1:9000>.

  • ak (str) – Access key ID of cloud storage.

  • sk (str) – Secret key ID of cloud storage.

  • sync_obs_path (str) – Remote dir path used for synchronization, users need to create it on cloud storage in advance. Path is in the format of s3://bucketName/objectKey.

  • columns_list (list[str], optional) – List of columns to be read. Default: None, read all columns.

  • shuffle (Union[bool, Shuffle], optional) –

    Perform reshuffling of the data every epoch. Default: None, performs global shuffle. Bool type and Shuffle enum are both supported to pass in. If shuffle is False, no shuffling will be performed. If shuffle is True, performs global shuffle. There are three levels of shuffling, desired shuffle enum defined by mindspore.dataset.Shuffle.

    • Shuffle.GLOBAL: Global shuffle of all rows of data in dataset, same as setting shuffle to True.

    • Shuffle.FILES: Shuffle the file sequence but keep the order of data within each file.

    • Shuffle.INFILE: Keep the file sequence the same but shuffle the data within each file.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None.

  • shard_id (int, optional) – The shard ID within num_shards. Default: None. This argument can only be specified when num_shards is also specified.

  • shard_equal_rows (bool, optional) – Get equal rows for all shards. Default: True. If shard_equal_rows is false, number of rows of each shard may be not equal, and may lead to a failure in distributed training. When the number of samples of per MindRecord file are not equal, it is suggested to set to true. This argument should only be specified when num_shards is also specified.

Raises:
  • RuntimeError – If sync_obs_path do not exist.

  • ValueError – If columns_list is invalid.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Note

  • It’s necessary to create a synchronization directory on cloud storage in advance which be defined by parameter: sync_obs_path .

  • If training is offline(no cloud), it’s recommended to set the environment variable BATCH_JOB_ID .

  • In distributed training, if there are multiple nodes(servers), all 8 devices must be used in each node(server). If there is only one node(server), there is no such restriction.

Examples

>>> # OBS
>>> bucket = "iris"  # your obs bucket name
>>> # the bucket directory structure is similar to the following:
>>> #  - imagenet21k
>>> #        | - mr_imagenet21k_01
>>> #        | - mr_imagenet21k_02
>>> #  - sync_node
>>> dataset_obs_dir = ["s3://" + bucket + "/imagenet21k/mr_imagenet21k_01",
...                    "s3://" + bucket + "/imagenet21k/mr_imagenet21k_02"]
>>> sync_obs_dir = "s3://" + bucket + "/sync_node"
>>> num_shards = 8
>>> shard_id = 0
>>> dataset = ds.OBSMindDataset(dataset_obs_dir, "obs.cn-north-4.myhuaweicloud.com",
...                             "AK of OBS", "SK of OBS",
...                             sync_obs_dir, shuffle=True, num_shards=num_shards, shard_id=shard_id)
class tinyms.data.TFRecordDataset(dataset_files, schema=None, columns_list=None, num_samples=None, num_parallel_workers=None, shuffle=<Shuffle.GLOBAL: 'global'>, num_shards=None, shard_id=None, shard_equal_rows=False, cache=None, compression_type=None)[source]

A source dataset that reads and parses datasets stored on disk in TFData format.

The columns of generated dataset depend on the source TFRecord files.

Parameters:
  • dataset_files (Union[str, list[str]]) – String or list of files to be read or glob strings to search for a pattern of files. The list will be sorted in lexicographical order.

  • schema (Union[str, Schema], optional) – Data format policy, which specifies the data types and shapes of the data column to be read. Both JSON file path and objects constructed by mindspore.dataset.Schema are acceptable. Default: None.

  • columns_list (list[str], optional) – List of columns to be read. Default: None, read all columns.

  • num_samples (int, optional) – The number of samples (rows) to be included in the dataset. Default: None. Processing priority for num_samples is as the following: 1. If num_samples is greater than 0, read num_samples rows. 2. Otherwise, if numRows (parsed from schema ) is greater than 0, read numRows rows. 3. Otherwise, read the full dataset. num_samples or numRows (parsed from schema ) will be interpreted as number of rows per shard. It is highly recommended to provide num_samples or numRows (parsed from schema ) when compression_type is “GZIP” or “ZLIB” to avoid performance degradation due to multiple decompressions of the same file to obtain the file size.

  • num_parallel_workers (int, optional) – Number of worker threads to read the data. Default: None, will use global default workers(8), it can be set by mindspore.dataset.config.set_num_parallel_workers .

  • shuffle (Union[bool, Shuffle], optional) –

    Perform reshuffling of the data every epoch. Default: Shuffle.GLOBAL. Bool type and Shuffle enum are both supported to pass in. If shuffle is False, no shuffling will be performed. If shuffle is True, perform global shuffle. There are three levels of shuffling, desired shuffle enum defined by mindspore.dataset.Shuffle.

    • Shuffle.GLOBAL: Shuffle both the files and samples, same as setting shuffle to True.

    • Shuffle.FILES: Shuffle files only.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the maximum sample number per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument can only be specified when num_shards is also specified.

  • shard_equal_rows (bool, optional) – Get equal rows for all shards. Default: False. If shard_equal_rows is False, the number of rows of each shard may not be equal, and may lead to a failure in distributed training. When the number of samples per TFRecord file are not equal, it is suggested to set it to True. This argument should only be specified when num_shards is also specified. When compression_type is not None, and num_samples or numRows (parsed from schema ) is provided, shard_equal_rows will be implied as true.

  • cache (DatasetCache, optional) –

    Use tensor caching service to speed up dataset processing. More details: Single-Node Data Cache . Default: None, which means no cache is used.

  • compression_type (str, optional) – The type of compression used for all files, must be either ‘’, ‘GZIP’, or ‘ZLIB’. Default: None, as in empty string.

Raises:
  • ValueError – If dataset_files are not valid or do not exist.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If shard_id is not in range of [0, num_shards ).

  • ValueError – If compression_type is invalid (other than ‘’, ‘GZIP’, or ‘ZLIB’).

  • ValueError – If compression_type is provided, but the number of dataset files < num_shards .

  • ValueError – If num_samples < 0.

Examples

>>> from mindspore import dtype as mstype
>>>
>>> tfrecord_dataset_dir = ["/path/to/tfrecord_dataset_file"] # contains 1 or multiple TFRecord files
>>> tfrecord_schema_file = "/path/to/tfrecord_schema_file"
>>>
>>> # 1) Get all rows from tfrecord_dataset_dir with no explicit schema.
>>> # The meta-data in the first row will be used as a schema.
>>> dataset = ds.TFRecordDataset(dataset_files=tfrecord_dataset_dir)
>>>
>>> # 2) Get all rows from tfrecord_dataset_dir with user-defined schema.
>>> schema = ds.Schema()
>>> schema.add_column(name='col_1d', de_type=mstype.int64, shape=[2])
>>> dataset = ds.TFRecordDataset(dataset_files=tfrecord_dataset_dir, schema=schema)
>>>
>>> # 3) Get all rows from tfrecord_dataset_dir with the schema file.
>>> dataset = ds.TFRecordDataset(dataset_files=tfrecord_dataset_dir, schema=tfrecord_schema_file)
class tinyms.data.GeneratorDataset(source, column_names=None, column_types=None, schema=None, num_samples=None, num_parallel_workers=1, shuffle=None, sampler=None, num_shards=None, shard_id=None, python_multiprocessing=True, max_rowsize=6)[source]

A source dataset that generates data from Python by invoking Python data source each epoch.

The column names and column types of generated dataset depend on Python data defined by users.

Parameters:
  • source (Union[Callable, Iterable, Random Accessible]) – A generator callable object, an iterable Python object or a random accessible Python object. Callable source is required to return a tuple of NumPy arrays as a row of the dataset on source().next(). Iterable source is required to return a tuple of NumPy arrays as a row of the dataset on iter(source).next(). Random accessible source is required to return a tuple of NumPy arrays as a row of the dataset on source[idx].

  • column_names (Union[str, list[str]], optional) – List of column names of the dataset. Default: None. Users are required to provide either column_names or schema.

  • column_types (list[mindspore.dtype], optional) – List of column data types of the dataset. Default: None. If provided, sanity check will be performed on generator output.

  • schema (Union[str, Schema], optional) – Data format policy, which specifies the data types and shapes of the data column to be read. Both JSON file path and objects constructed by mindspore.dataset.Schema are acceptable. Default: None.

  • num_samples (int, optional) – The number of samples to be included in the dataset. Default: None, all images.

  • num_parallel_workers (int, optional) – Number of worker threads/subprocesses used to fetch the dataset in parallel. Default: 1.

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. Random accessible input is required. Default: None, expected order behavior shown in the table below.

  • sampler (Union[Sampler, Iterable], optional) – Object used to choose samples from the dataset. Random accessible input is required. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. Random accessible input is required. When this argument is specified, num_samples reflects the maximum sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument must be specified only when num_shards is also specified. Random accessible input is required.

  • python_multiprocessing (bool, optional) – Parallelize Python operations with multiple worker process. This option could be beneficial if the Python operation is computational heavy. Default: True.

  • max_rowsize (int, optional) – Maximum size of row in MB that is used for shared memory allocation to copy data between processes. This is only used if python_multiprocessing is set to True. Default: 6 MB.

Raises:
  • RuntimeError – If source raises an exception during execution.

  • RuntimeError – If len of column_names does not match output len of source.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • ValueError – If sampler and shuffle are specified at the same time.

  • ValueError – If sampler and sharding are specified at the same time.

  • ValueError – If num_shards is specified but shard_id is None.

  • ValueError – If shard_id is specified but num_shards is None.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Note

  • If you configure python_multiprocessing=True (default: True) and num_parallel_workers>1 (default: 1) indicates that the multi-process mode is started for data load acceleration. At this time, as the dataset iterates, the memory consumption of the subprocess will gradually increase, mainly because the subprocess of the user-defined dataset obtains the member variables from the main process in the Copy On Write way. Example: If you define a dataset with __ init__ function which contains a large number of member variable data (for example, a very large file name list is loaded during the dataset construction) and uses the multi-process mode, which may cause the problem of OOM (the estimated total memory usage is: (num_parallel_workers+1) * size of the parent process ). The simplest solution is to replace Python objects (such as list/dict/int/float/string) with non referenced data types (such as Pandas, Numpy or PyArrow objects) for member variables, or load less meta data in member variables, or configure python_multiprocessing=False to use multi-threading mode.

  • Input source accepts user-defined Python functions (PyFuncs), Do not add network computing operators from mindspore.nn and mindspore.ops or others into this source .

  • This dataset can take in a sampler . sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter sampler

Parameter shuffle

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Examples

>>> import numpy as np
>>>
>>> # 1) Multidimensional generator function as callable input.
>>> def generator_multidimensional():
...     for i in range(64):
...         yield (np.array([[i, i + 1], [i + 2, i + 3]]),)
>>>
>>> dataset = ds.GeneratorDataset(source=generator_multidimensional, column_names=["multi_dimensional_data"])
>>>
>>> # 2) Multi-column generator function as callable input.
>>> def generator_multi_column():
...     for i in range(64):
...         yield np.array([i]), np.array([[i, i + 1], [i + 2, i + 3]])
>>>
>>> dataset = ds.GeneratorDataset(source=generator_multi_column, column_names=["col1", "col2"])
>>>
>>> # 3) Iterable dataset as iterable input.
>>> class MyIterable:
...     def __init__(self):
...         self._index = 0
...         self._data = np.random.sample((5, 2))
...         self._label = np.random.sample((5, 1))
...
...     def __next__(self):
...         if self._index >= len(self._data):
...             raise StopIteration
...         else:
...             item = (self._data[self._index], self._label[self._index])
...             self._index += 1
...             return item
...
...     def __iter__(self):
...         self._index = 0
...         return self
...
...     def __len__(self):
...         return len(self._data)
>>>
>>> dataset = ds.GeneratorDataset(source=MyIterable(), column_names=["data", "label"])
>>>
>>> # 4) Random accessible dataset as random accessible input.
>>> class MyAccessible:
...     def __init__(self):
...         self._data = np.random.sample((5, 2))
...         self._label = np.random.sample((5, 1))
...
...     def __getitem__(self, index):
...         return self._data[index], self._label[index]
...
...     def __len__(self):
...         return len(self._data)
>>>
>>> dataset = ds.GeneratorDataset(source=MyAccessible(), column_names=["data", "label"])
>>>
>>> # list, dict, tuple of Python is also random accessible
>>> dataset = ds.GeneratorDataset(source=[(np.array(0),), (np.array(1),), (np.array(2),)], column_names=["col"])
class tinyms.data.NumpySlicesDataset(data, column_names=None, num_samples=None, num_parallel_workers=1, shuffle=None, sampler=None, num_shards=None, shard_id=None)[source]

Creates a dataset with given data slices, mainly for loading Python data into dataset.

The column names and column types of generated dataset depend on Python data defined by users.

Parameters:
  • data (Union[list, tuple, dict]) – list, tuple, dict and other NumPy formats. Input data will be sliced along the first dimension and generate additional rows, if input is list, there will be one column in each row, otherwise there tends to be multi columns. Large data is not recommended to be loaded in this way as data is loading into memory.

  • column_names (list[str], optional) – List of column names of the dataset. Default: None. If column_names is not provided, the output column names will be named as the keys of dict when the input data is a dict, otherwise they will be named like column_0, column_1 …

  • num_samples (int, optional) – The number of samples to be included in the dataset. Default: None, all samples.

  • num_parallel_workers (int, optional) – Number of worker subprocesses used to fetch the dataset in parallel. Default: 1.

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. Default: None, expected order behavior shown in the table below.

  • sampler (Union[Sampler, Iterable], optional) – Object used to choose samples from the dataset. Default: None, expected order behavior shown in the table below.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the max sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument must be specified only when num_shards is also specified.

Note

  • This dataset can take in a sampler . sampler and shuffle are mutually exclusive. The table below shows what input arguments are allowed and their expected behavior.

Expected Order Behavior of Using sampler and shuffle

Parameter sampler

Parameter shuffle

Expected Order Behavior

None

None

random order

None

True

random order

None

False

sequential order

Sampler object

None

order defined by sampler

Sampler object

True

not allowed

Sampler object

False

not allowed

Raises:
  • RuntimeError – If len of column_names does not match output len of data.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • ValueError – If sampler and shuffle are specified at the same time.

  • ValueError – If sampler and sharding are specified at the same time.

  • ValueError – If num_shards is specified but shard_id is None.

  • ValueError – If shard_id is specified but num_shards is None.

  • ValueError – If shard_id is not in range of [0, num_shards ).

Examples

>>> # 1) Input data can be a list
>>> data = [1, 2, 3]
>>> dataset = ds.NumpySlicesDataset(data=data, column_names=["column_1"])
>>>
>>> # 2) Input data can be a dictionary, and column_names will be its keys
>>> data = {"a": [1, 2], "b": [3, 4]}
>>> dataset = ds.NumpySlicesDataset(data=data)
>>>
>>> # 3) Input data can be a tuple of lists (or NumPy arrays), each tuple element refers to data in each column
>>> data = ([1, 2], [3, 4], [5, 6])
>>> dataset = ds.NumpySlicesDataset(data=data, column_names=["column_1", "column_2", "column_3"])
>>>
>>> # 4) Load data from CSV file
>>> import pandas as pd
>>> df = pd.read_csv(filepath_or_buffer=csv_dataset_dir[0])
>>> dataset = ds.NumpySlicesDataset(data=dict(df), shuffle=False)
class tinyms.data.PaddedDataset(padded_samples)[source]

Creates a dataset with filler data provided by user.

Mainly used to add to the original dataset and assign it to the corresponding shard.

Parameters:

padded_samples (list(dict)) – Samples provided by user.

Raises:
  • TypeError – If padded_samples is not an instance of list.

  • TypeError – If the element of padded_samples is not an instance of dict.

  • ValueError – If the padded_samples is empty.

Examples

>>> import numpy as np
>>> data = [{'image': np.zeros(1, np.uint8)}, {'image': np.zeros(2, np.uint8)}]
>>> dataset = ds.PaddedDataset(padded_samples=data)
class tinyms.data.GraphData(dataset_file, num_parallel_workers=None, working_mode='local', hostname='127.0.0.1', port=50051, num_client=1, auto_shutdown=True)[source]

Reads the graph dataset used for GNN training from the shared file and database. Support reading graph datasets like Cora, Citeseer and PubMed.

About how to load raw graph dataset into MindSpore please refer to Loading Graph Dataset .

Parameters:
  • dataset_file (str) – One of file names in the dataset.

  • num_parallel_workers (int, optional) – Number of workers to process the dataset in parallel. Default: None.

  • working_mode (str, optional) –

    Set working mode, now supports ‘local’/’client’/’server’. Default: ‘local’.

    • ’local’, used in non-distributed training scenarios.

    • ’client’, used in distributed training scenarios. The client does not load data, but obtains data from the server.

    • ’server’, used in distributed training scenarios. The server loads the data and is available to the client.

  • hostname (str, optional) – Hostname of the graph data server. This parameter is only valid when working_mode is set to ‘client’ or ‘server’. Default: ‘127.0.0.1’.

  • port (int, optional) – Port of the graph data server. The range is 1024-65535. This parameter is only valid when working_mode is set to ‘client’ or ‘server’. Default: 50051.

  • num_client (int, optional) – Maximum number of clients expected to connect to the server. The server will allocate resources according to this parameter. This parameter is only valid when working_mode is set to ‘server’. Default: 1.

  • auto_shutdown (bool, optional) – Valid when working_mode is set to ‘server’, when the number of connected clients reaches num_client and no client is being connected, the server automatically exits. Default: True.

Raises:
  • ValueError – If dataset_file does not exist or permission denied.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • ValueError – If working_mode is not ‘local’, ‘client’ or ‘server’.

  • TypeError – If hostname is illegal.

  • ValueError – If port is not in range [1024, 65535].

  • ValueError – If num_client is not in range [1, 255].

Supported Platforms:

CPU

Examples

>>> graph_dataset_dir = "/path/to/graph_dataset_file"
>>> graph_data = ds.GraphData(dataset_file=graph_dataset_dir, num_parallel_workers=2)
>>> nodes = graph_data.get_all_nodes(node_type=1)
>>> features = graph_data.get_node_feature(node_list=nodes, feature_types=[1])
get_all_edges(edge_type)[source]

Get all edges in the graph.

Parameters:

edge_type (int) – Specify the type of edge.

Returns:

numpy.ndarray, array of edges.

Examples

>>> edges = graph_data.get_all_edges(edge_type=0)
Raises:

TypeError – If edge_type is not integer.

get_all_neighbors(node_list, neighbor_type, output_format=<OutputFormat.NORMAL: 0>)[source]

Get neighbor_type neighbors of the nodes in node_list . We try to use the following example to illustrate the definition of these formats. 1 represents connected between two nodes, and 0 represents not connected.

Adjacent Matrix

0

1

2

3

0

0

1

0

0

1

0

0

1

0

2

1

0

0

1

3

1

0

0

0

Normal Format

src

0

1

2

3

dst_0

1

2

0

1

dst_1

-1

-1

3

-1

COO Format

src

0

1

2

2

3

dst

1

2

0

3

1

CSR Format

offsetTable

0

1

2

4

dstTable

1

2

0

3

1

Parameters:
  • node_list (Union[list, numpy.ndarray]) – The given list of nodes.

  • neighbor_type (int) – Specify the type of neighbor node.

  • output_format (OutputFormat, optional) – Output storage format. Default: OutputFormat.NORMAL. It can be any of [OutputFormat.NORMAL, OutputFormat.COO, OutputFormat.CSR].

Returns:

For NORMAL format or COO format numpy.ndarray which represents the array of neighbors will return. As if CSR format is specified, two numpy.ndarrays will return. The first one is offset table, the second one is neighbors

Examples

>>> from mindspore.dataset.engine import OutputFormat
>>> nodes = graph_data.get_all_nodes(node_type=1)
>>> neighbors = graph_data.get_all_neighbors(node_list=nodes, neighbor_type=2)
>>> neighbors_coo = graph_data.get_all_neighbors(node_list=nodes, neighbor_type=2,
...                                              output_format=OutputFormat.COO)
>>> offset_table, neighbors_csr = graph_data.get_all_neighbors(node_list=nodes, neighbor_type=2,
...                                                            output_format=OutputFormat.CSR)
Raises:
  • TypeError – If node_list is not list or ndarray.

  • TypeError – If neighbor_type is not integer.

get_all_nodes(node_type)[source]

Get all nodes in the graph.

Parameters:

node_type (int) – Specify the type of node.

Returns:

numpy.ndarray, array of nodes.

Examples

>>> nodes = graph_data.get_all_nodes(node_type=1)
Raises:

TypeError – If node_type is not integer.

get_edge_feature(edge_list, feature_types)[source]

Get feature_types feature of the edges in edge_list .

Parameters:
Returns:

numpy.ndarray, array of features.

Examples

>>> edges = graph_data.get_all_edges(edge_type=0)
>>> features = graph_data.get_edge_feature(edge_list=edges, feature_types=[1])
Raises:
  • TypeError – If edge_list is not list or ndarray.

  • TypeError – If feature_types is not list or ndarray.

get_edges_from_nodes(node_list)[source]

Get edges from the nodes.

Parameters:

node_list (Union[list[tuple], numpy.ndarray]) – The given list of pair nodes ID.

Returns:

numpy.ndarray, array of edges ID.

Examples

>>> edges = graph_data.get_edges_from_nodes(node_list=[(101, 201), (103, 207)])
Raises:

TypeError – If edge_list is not list or ndarray.

get_neg_sampled_neighbors(node_list, neg_neighbor_num, neg_neighbor_type)[source]

Get neg_neighbor_type negative sampled neighbors of the nodes in node_list .

Parameters:
  • node_list (Union[list, numpy.ndarray]) – The given list of nodes.

  • neg_neighbor_num (int) – Number of neighbors sampled.

  • neg_neighbor_type (int) – Specify the type of negative neighbor.

Returns:

numpy.ndarray, array of neighbors.

Examples

>>> nodes = graph_data.get_all_nodes(node_type=1)
>>> neg_neighbors = graph_data.get_neg_sampled_neighbors(node_list=nodes, neg_neighbor_num=5,
...                                                      neg_neighbor_type=2)
Raises:
  • TypeError – If node_list is not list or ndarray.

  • TypeError – If neg_neighbor_num is not integer.

  • TypeError – If neg_neighbor_type is not integer.

get_node_feature(node_list, feature_types)[source]

Get feature_types feature of the nodes in node_list .

Parameters:
Returns:

numpy.ndarray, array of features.

Examples

>>> nodes = graph_data.get_all_nodes(node_type=1)
>>> features = graph_data.get_node_feature(node_list=nodes, feature_types=[2, 3])
Raises:
  • TypeError – If node_list is not list or ndarray.

  • TypeError – If feature_types is not list or ndarray.

get_nodes_from_edges(edge_list)[source]

Get nodes from the edges.

Parameters:

edge_list (Union[list, numpy.ndarray]) – The given list of edges.

Returns:

numpy.ndarray, array of nodes.

Examples

>>> from mindspore.dataset import GraphData
>>>
>>> g = ds.GraphData("/path/to/testdata", 1)
>>> edges = g.get_all_edges(0)
>>> nodes = g.get_nodes_from_edges(edges)
Raises:

TypeError – If edge_list is not list or ndarray.

get_sampled_neighbors(node_list, neighbor_nums, neighbor_types, strategy=<SamplingStrategy.RANDOM: 0>)[source]

Get sampled neighbor information.

The api supports multi-hop neighbor sampling. That is, the previous sampling result is used as the input of next-hop sampling. A maximum of 6-hop are allowed.

The sampling result is tiled into a list in the format of [input node, 1-hop sampling result, 2-hop sampling result …].

Parameters:
  • node_list (Union[list, numpy.ndarray]) – The given list of nodes.

  • neighbor_nums (Union[list, numpy.ndarray]) – Number of neighbors sampled per hop.

  • neighbor_types (Union[list, numpy.ndarray]) – Neighbor type sampled per hop, type of each element in neighbor_types should be int.

  • strategy (SamplingStrategy, optional) –

    Sampling strategy. Default: SamplingStrategy.RANDOM. It can be any of [SamplingStrategy.RANDOM, SamplingStrategy.EDGE_WEIGHT].

    • SamplingStrategy.RANDOM, random sampling with replacement.

    • SamplingStrategy.EDGE_WEIGHT, sampling with edge weight as probability.

Returns:

numpy.ndarray, array of neighbors.

Examples

>>> nodes = graph_data.get_all_nodes(node_type=1)
>>> neighbors = graph_data.get_sampled_neighbors(node_list=nodes, neighbor_nums=[2, 2],
...                                              neighbor_types=[2, 1])
Raises:
  • TypeError – If node_list is not list or ndarray.

  • TypeError – If neighbor_nums is not list or ndarray.

  • TypeError – If neighbor_types is not list or ndarray.

graph_info()[source]

Get the meta information of the graph, including the number of nodes, the type of nodes, the feature information of nodes, the number of edges, the type of edges, and the feature information of edges.

Returns:

dict, meta information of the graph. The key is node_type, edge_type, node_num, edge_num, node_feature_type and edge_feature_type.

Examples: >>> from mindspore.dataset import GraphData >>> >>> g = ds.GraphData(“/path/to/testdata”, 2) >>> graph_info = g.graph_info()

random_walk(target_nodes, meta_path, step_home_param=1.0, step_away_param=1.0, default_node=-1)[source]

Random walk in nodes.

Parameters:
  • target_nodes (list[int]) – Start node list in random walk

  • meta_path (list[int]) – node type for each walk step

  • step_home_param (float, optional) – return hyper parameter in node2vec algorithm. Default: 1.0.

  • step_away_param (float, optional) – in out hyper parameter in node2vec algorithm. Default: 1.0.

  • default_node (int, optional) – default node if no more neighbors found. Default: -1. A default value of -1 indicates that no node is given.

Returns:

numpy.ndarray, array of nodes.

Examples

>>> nodes = graph_data.get_all_nodes(node_type=1)
>>> walks = graph_data.random_walk(target_nodes=nodes, meta_path=[2, 1, 2])
Raises:
  • TypeError – If target_nodes is not list or ndarray.

  • TypeError – If meta_path is not list or ndarray.

class tinyms.data.Graph(edges, node_feat=None, edge_feat=None, graph_feat=None, node_type=None, edge_type=None, num_parallel_workers=None, working_mode='local', hostname='127.0.0.1', port=50051, num_client=1, auto_shutdown=True)[source]

A graph object for storing Graph structure and feature data, and provide capabilities such as graph sampling.

This class supports init graph With input numpy array data, which represent node, edge and its features. If working mode is local , there is no need to specify input arguments like working_mode , hostname , port , num_client , auto_shutdown .

Parameters:
  • edges (Union[list, numpy.ndarray]) – edges of graph in COO format with shape [2, num_edges].

  • node_feat (dict, optional) – feature of nodes, input data format should be dict, key is feature type, which is represented with string like ‘weight’ etc, value should be numpy.array with shape [num_nodes, num_node_features].

  • edge_feat (dict, optional) – feature of edges, input data format should be dict, key is feature type, which is represented with string like ‘weight’ etc, value should be numpy.array with shape [num_edges, num_edge_features].

  • graph_feat (dict, optional) – additional feature, which can not be assigned to node_feat or edge_feat, input data format should be dict, key is feature type, which is represented with string, value should be numpy.array, its shape is not restricted.

  • node_type (Union[list, numpy.ndarray], optional) – type of nodes, each element should be string which represent type of corresponding node. If not provided, default type for each node is “0”.

  • edge_type (Union[list, numpy.ndarray], optional) – type of edges, each element should be string which represent type of corresponding edge. If not provided, default type for each edge is “0”.

  • num_parallel_workers (int, optional) – Number of workers to process the dataset in parallel. Default: None.

  • working_mode (str, optional) –

    Set working mode, now supports ‘local’/’client’/’server’. Default: ‘local’.

    • ’local’, used in non-distributed training scenarios.

    • ’client’, used in distributed training scenarios. The client does not load data, but obtains data from the server.

    • ’server’, used in distributed training scenarios. The server loads the data and is available to the client.

  • hostname (str, optional) – Hostname of the graph data server. This parameter is only valid when working_mode is set to ‘client’ or ‘server’. Default: ‘127.0.0.1’.

  • port (int, optional) – Port of the graph data server. The range is 1024-65535. This parameter is only valid when working_mode is set to ‘client’ or ‘server’. Default: 50051.

  • num_client (int, optional) – Maximum number of clients expected to connect to the server. The server will allocate resources according to this parameter. This parameter is only valid when working_mode is set to ‘server’. Default: 1.

  • auto_shutdown (bool, optional) – Valid when working_mode is set to ‘server’, when the number of connected clients reaches num_client and no client is being connected, the server automatically exits. Default: True.

Raises:
  • TypeError – If edges not list or NumPy array.

  • TypeError – If node_feat provided but not dict, or key in dict is not string type, or value in dict not NumPy array.

  • TypeError – If edge_feat provided but not dict, or key in dict is not string type, or value in dict not NumPy array.

  • TypeError – If graph_feat provided but not dict, or key in dict is not string type, or value in dict not NumPy array.

  • TypeError – If node_type provided but its type not list or NumPy array.

  • TypeError – If edge_type provided but its type not list or NumPy array.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

  • ValueError – If working_mode is not ‘local’, ‘client’ or ‘server’.

  • TypeError – If hostname is illegal.

  • ValueError – If port is not in range [1024, 65535].

  • ValueError – If num_client is not in range [1, 255].

Examples

>>> import numpy as np
>>> from mindspore.dataset import Graph
>>>
>>> # 1) Only provide edges for creating graph, as this is the only required input parameter
>>> edges = np.array([[1, 2], [0, 1]], dtype=np.int32)
>>> graph = Graph(edges)
>>> graph_info = graph.graph_info()
>>>
>>> # 2) Setting node_feat and edge_feat for corresponding node and edge
>>> #    first dimension of feature shape should be corresponding node num or edge num.
>>> edges = np.array([[1, 2], [0, 1]], dtype=np.int32)
>>> node_feat = {"node_feature_1": np.array([[0], [1], [2]], dtype=np.int32)}
>>> edge_feat = {"edge_feature_1": np.array([[1, 2], [3, 4]], dtype=np.int32)}
>>> graph = Graph(edges, node_feat, edge_feat)
>>>
>>> # 3) Setting graph feature for graph, there is no shape limit for graph feature
>>> edges = np.array([[1, 2], [0, 1]], dtype=np.int32)
>>> graph_feature = {"graph_feature_1": np.array([1, 2, 3, 4, 5, 6], dtype=np.int32)}
>>> graph = Graph(edges, graph_feat=graph_feature)
get_all_edges(edge_type)[source]

Get all edges in the graph.

Parameters:

edge_type (str) – Specify the type of edge, default edge_type is “0” when init graph without specify edge_type.

Returns:

numpy.ndarray, array of edges.

Examples

>>> edges = graph.get_all_edges(edge_type="0")
Raises:

TypeError – If edge_type is not string.

get_all_neighbors(node_list, neighbor_type, output_format=<OutputFormat.NORMAL: 0>)[source]

Get neighbor_type neighbors of the nodes in node_list . We try to use the following example to illustrate the definition of these formats. 1 represents connected between two nodes, and 0 represents not connected.

Adjacent Matrix

0

1

2

3

0

0

1

0

0

1

0

0

1

0

2

1

0

0

1

3

1

0

0

0

Normal Format

src

0

1

2

3

dst_0

1

2

0

1

dst_1

-1

-1

3

-1

COO Format

src

0

1

2

2

3

dst

1

2

0

3

1

CSR Format

offsetTable

0

1

2

4

dstTable

1

2

0

3

1

Parameters:
  • node_list (Union[list, numpy.ndarray]) – The given list of nodes.

  • neighbor_type (str) – Specify the type of neighbor node.

  • output_format (OutputFormat, optional) – Output storage format. Default: OutputFormat.NORMAL. It can be any of [OutputFormat.NORMAL, OutputFormat.COO, OutputFormat.CSR].

Returns:

For NORMAL format or COO format numpy.ndarray which represents the array of neighbors will return. As if CSR format is specified, two numpy.ndarrays will return. The first one is offset table, the second one is neighbors

Examples

>>> from mindspore.dataset.engine import OutputFormat
>>> nodes = graph.get_all_nodes(node_type="0")
>>> neighbors = graph.get_all_neighbors(node_list=nodes, neighbor_type="0")
>>> neighbors_coo = graph.get_all_neighbors(node_list=nodes, neighbor_type="0",
...                                         output_format=OutputFormat.COO)
>>> offset_table, neighbors_csr = graph.get_all_neighbors(node_list=nodes, neighbor_type="0",
...                                                       output_format=OutputFormat.CSR)
Raises:
  • TypeError – If node_list is not list or ndarray.

  • TypeError – If neighbor_type is not string.

get_all_nodes(node_type)[source]

Get all nodes in the graph.

Parameters:

node_type (str) – Specify the type of node.

Returns:

numpy.ndarray, array of nodes.

Examples

>>> nodes = graph.get_all_nodes(node_type="0")
Raises:

TypeError – If node_type is not string.

get_edge_feature(edge_list, feature_types)[source]

Get feature_types feature of the edges in edge_list .

Parameters:
  • edge_list (Union[list, numpy.ndarray]) – The given list of edges.

  • feature_types (Union[list, numpy.ndarray]) – The given list of feature types, each element should be string.

Returns:

numpy.ndarray, array of features.

Examples

>>> edges = graph.get_all_edges(edge_type="0")
>>> features = graph.get_edge_feature(edge_list=edges, feature_types=["edge_feature_1"])
Raises:
  • TypeError – If edge_list is not list or ndarray.

  • TypeError – If feature_types is not list or ndarray.

get_graph_feature(feature_types)[source]

Get feature_types feature that stored in Graph feature level.

Parameters:

feature_types (Union[list, numpy.ndarray]) – The given list of feature types, each element should be string.

Returns:

numpy.ndarray, array of features.

Examples

>>> features = graph.get_graph_feature(feature_types=['graph_feature_1'])
Raises:

TypeError – If feature_types is not list or ndarray.

get_neg_sampled_neighbors(node_list, neg_neighbor_num, neg_neighbor_type)[source]

Get neg_neighbor_type negative sampled neighbors of the nodes in node_list .

Parameters:
  • node_list (Union[list, numpy.ndarray]) – The given list of nodes.

  • neg_neighbor_num (int) – Number of neighbors sampled.

  • neg_neighbor_type (str) – Specify the type of negative neighbor.

Returns:

numpy.ndarray, array of neighbors.

Examples

>>> nodes = graph.get_all_nodes(node_type="0")
>>> neg_neighbors = graph.get_neg_sampled_neighbors(node_list=nodes, neg_neighbor_num=3,
...                                                 neg_neighbor_type="0")
Raises:
  • TypeError – If node_list is not list or ndarray.

  • TypeError – If neg_neighbor_num is not integer.

  • TypeError – If neg_neighbor_type is not string.

get_node_feature(node_list, feature_types)[source]

Get feature_types feature of the nodes in node_list .

Parameters:
  • node_list (Union[list, numpy.ndarray]) – The given list of nodes.

  • feature_types (Union[list, numpy.ndarray]) – The given list of feature types, each element should be string.

Returns:

numpy.ndarray, array of features.

Examples

>>> nodes = graph.get_all_nodes(node_type="0")
>>> features = graph.get_node_feature(node_list=nodes, feature_types=["node_feature_1"])
Raises:
  • TypeError – If node_list is not list or ndarray.

  • TypeError – If feature_types is not list or ndarray.

get_sampled_neighbors(node_list, neighbor_nums, neighbor_types, strategy=<SamplingStrategy.RANDOM: 0>)[source]

Get sampled neighbor information.

The api supports multi-hop neighbor sampling. That is, the previous sampling result is used as the input of next-hop sampling. A maximum of 6-hop are allowed.

The sampling result is tiled into a list in the format of [input node, 1-hop sampling result, 2-hop sampling result …].

Parameters:
  • node_list (Union[list, numpy.ndarray]) – The given list of nodes.

  • neighbor_nums (Union[list, numpy.ndarray]) – Number of neighbors sampled per hop.

  • neighbor_types (Union[list, numpy.ndarray]) – Neighbor type sampled per hop, type of each element in neighbor_types should be str.

  • strategy (SamplingStrategy, optional) –

    Sampling strategy. Default: SamplingStrategy.RANDOM. It can be any of [SamplingStrategy.RANDOM, SamplingStrategy.EDGE_WEIGHT].

    • SamplingStrategy.RANDOM, random sampling with replacement.

    • SamplingStrategy.EDGE_WEIGHT, sampling with edge weight as probability.

Returns:

numpy.ndarray, array of neighbors.

Examples

>>> nodes = graph.get_all_nodes(node_type="0")
>>> neighbors = graph.get_sampled_neighbors(node_list=nodes, neighbor_nums=[2, 2],
...                                         neighbor_types=["0", "0"])
Raises:
  • TypeError – If node_list is not list or ndarray.

  • TypeError – If neighbor_nums is not list or ndarray.

  • TypeError – If neighbor_types is not list or ndarray.

graph_info()[source]

Get the meta information of the graph, including the number of nodes, the type of nodes, the feature information of nodes, the number of edges, the type of edges, and the feature information of edges.

Returns:

dict, meta information of the graph. The key is node_type, edge_type, node_num, edge_num, node_feature_type, edge_feature_type and graph_feature_type.

class tinyms.data.InMemoryGraphDataset(data_dir, save_dir='./processed', column_names='graph', num_samples=None, num_parallel_workers=1, shuffle=None, num_shards=None, shard_id=None, python_multiprocessing=True, max_rowsize=6)[source]

Basic Dataset for loading graph into memory.

Recommended to Implement your own dataset with inheriting this class, and implement your own method like process , save and load , refer source code of ArgoverseDataset for how to implement your own dataset. When init your own dataset like ArgoverseDataset, The executed process like follows. Check if there are already processed data under given data_dir , if so will call load method to load it directly, otherwise it will call process method to create graphs and call save method to save the graphs into save_dir .

You can access graph in created dataset using graphs = my_dataset.graphs and also you can iterate dataset and get data using my_dataset.create_tuple_iterator() (in this way you need to implement methods like __getitem__ and __len__), referring to the following example for detail. Note: we have overwritten the __new__ method to reinitialize __init__ internally, which means the user-defined __new__ method won’t work.

Parameters:
  • data_dir (str) – directory for loading dataset, here contains origin format data and will be loaded in process method.

  • save_dir (str) – relative directory for saving processed dataset, this directory is under data_dir . Default: ‘./processed’.

  • column_names (Union[str, list[str]], optional) – single column name or list of column names of the dataset, num of column name should be equal to num of item in return data when implement method like __getitem__ . Default: ‘graph’.

  • num_samples (int, optional) – The number of samples to be included in the dataset. Default: None, all samples.

  • num_parallel_workers (int, optional) – Number of subprocesses used to fetch the dataset in parallel. Default: 1.

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. This parameter can only be specified when the implemented dataset has a random access attribute ( __getitem__ ). Default: None.

  • num_shards (int, optional) – Number of shards that the dataset will be divided into. Default: None. When this argument is specified, num_samples reflects the max sample number of per shard.

  • shard_id (int, optional) – The shard ID within num_shards . Default: None. This argument must be specified only when num_shards is also specified.

  • python_multiprocessing (bool, optional) – Parallelize Python operations with multiple worker process. This option could be beneficial if the Python operation is computational heavy. Default: True.

  • max_rowsize (int, optional) – Maximum size of row in MB that is used for shared memory allocation to copy data between processes. This is only used if python_multiprocessing is set to True. Default: 6 MB.

Raises:
  • TypeError – If data_dir is not of type str.

  • TypeError – If save_dir is not of type str.

  • TypeError – If num_parallel_workers is not of type int.

  • TypeError – If shuffle is not of type bool.

  • TypeError – If python_multiprocessing is not of type bool.

  • TypeError – If perf_mode is not of type bool.

  • RuntimeError – If data_dir is not valid or does not exit.

  • RuntimeError – If num_shards is specified but shard_id is None.

  • RuntimeError – If shard_id is specified but num_shards is None.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

Examples

>>> from mindspore.dataset import InMemoryGraphDataset, Graph
>>>
>>> class MyDataset(InMemoryGraphDataset):
...     def __init__(self, data_dir):
...         super().__init__(data_dir)
...
...     def process(self):
...         # create graph with loading data in given data_dir
...         # here create graph with numpy array directly instead
...         edges = np.array([[0, 1], [1, 2]])
...         graph = Graph(edges=edges)
...         self.graphs.append(graph)
...
...     def __getitem__(self, index):
...         # this method and '__len__' method are required when iterating created dataset
...         graph = self.graphs[index]
...         return graph.get_all_edges('0')
...
...     def __len__(self):
...         return len(self.graphs)
load()[source]

Load data from given(processed) path, you can also override this method in your dataset class.

process()[source]

Process method based on origin dataset, override this method in your our dataset class.

save()[source]

Save processed data into disk in numpy.npz format, you can also override this method in your dataset class.

class tinyms.data.ArgoverseDataset(data_dir, column_names='graph', num_parallel_workers=1, shuffle=None, python_multiprocessing=True, perf_mode=True)[source]

Load argoverse dataset and create graph.

Here argoverse dataset is public dataset for autonomous driving, current implement ArgoverseDataset is mainly for loading Motion Forecasting Dataset in argoverse dataset, recommend to visit official website for more detail: https://www.argoverse.org/av1.html#download-link.

Parameters:
  • data_dir (str) – directory for loading dataset, here contains origin format data and will be loaded in process method.

  • column_names (Union[str, list[str]], optional) – single column name or list of column names of the dataset. Default: “graph”. Num of column name should be equal to num of item in return data when implement method like __getitem__, recommend to specify it with column_names=[“edge_index”, “x”, “y”, “cluster”, “valid_len”, “time_step_len”] like the following example.

  • num_parallel_workers (int, optional) – Number of subprocesses used to fetch the dataset in parallel. Default: 1.

  • shuffle (bool, optional) – Whether or not to perform shuffle on the dataset. This parameter can only be specified when the implemented dataset has a random access attribute ( __getitem__ ). Default: None.

  • python_multiprocessing (bool, optional) – Parallelize Python operations with multiple worker process. This option could be beneficial if the Python operation is computational heavy. Default: True.

  • perf_mode (bool, optional) – mode for obtaining higher performance when iterate created dataset(will call __getitem__ method in this process). Default True, will save all the data in graph (like edge index, node feature and graph feature) into graph feature.

Raises:
  • TypeError – If data_dir is not of type str.

  • TypeError – If num_parallel_workers is not of type int.

  • TypeError – If shuffle is not of type bool.

  • TypeError – If python_multiprocessing is not of type bool.

  • TypeError – If perf_mode is not of type bool.

  • RuntimeError – If data_dir is not valid or does not exit.

  • ValueError – If num_parallel_workers exceeds the max thread numbers.

Examples

>>> from mindspore.dataset import ArgoverseDataset
>>>
>>> argoverse_dataset_dir = "/path/to/argoverse_dataset_directory"
>>> graph_dataset = ArgoverseDataset(data_dir=argoverse_dataset_dir,
...                                  column_names=["edge_index", "x", "y", "cluster", "valid_len",
...                                                "time_step_len"])
>>> for item in graph_dataset.create_dict_iterator(output_numpy=True, num_epochs=1):
...     pass

About Argoverse Dataset:

Argverse is the first dataset containing high-precision maps, which contains 290KM high-precision map data with geometric shape and semantic information.

You can unzip the dataset files into the following structure and read by MindSpore’s API:

.
└── argoverse_dataset_dir
    ├── train
    │    ├──...
    ├── val
    │    └──...
    ├── test
    │    └──...

Citation:

@inproceedings{Argoverse,
author     = {Ming-Fang Chang and John W Lambert and Patsorn Sangkloy and Jagjeet Singh
           and Slawomir Bak and Andrew Hartnett and De Wang and Peter Carr
           and Simon Lucey and Deva Ramanan and James Hays},
title      = {Argoverse: 3D Tracking and Forecasting with Rich Maps},
booktitle  = {Conference on Computer Vision and Pattern Recognition (CVPR)},
year       = {2019}
}
process()[source]

Process method for argoverse dataset, here we load original dataset and create a lot of graphs based on it. Pre-processed method mainly refers to: https://github.com/xk-huang/yet-another-vectornet/blob/master/dataset.py.

class tinyms.data.DistributedSampler(dataset_size, num_replicas=None, rank=None, shuffle=True)[source]

Distributed sampler.

Parameters:
  • dataset_size (int) – Dataset list length

  • num_replicas (int) – Replicas num.

  • rank (int) – Device rank.

  • shuffle (bool) – Whether the dataset needs to be shuffled. Default: True.

Returns:

DistributedSampler instance.

class tinyms.data.RandomSampler(replacement=False, num_samples=None)[source]

Samples the elements randomly.

Parameters:
  • replacement (bool, optional) – If True, put the sample ID back for the next draw. Default: False.

  • num_samples (int, optional) – Number of elements to sample. Default: None, which means sample all elements.

Raises:
  • TypeError – If replacement is not of type bool.

  • TypeError – If num_samples is not of type int.

  • ValueError – If num_samples is a negative value.

Examples

>>> # creates a RandomSampler
>>> sampler = ds.RandomSampler()
>>> dataset = ds.ImageFolderDataset(image_folder_dataset_dir,
...                                 num_parallel_workers=8,
...                                 sampler=sampler)
parse()[source]

Parse the sampler.

parse_for_minddataset()[source]

Parse the sampler for MindRecord.

class tinyms.data.SequentialSampler(start_index=None, num_samples=None)[source]

Samples the dataset elements sequentially that is equivalent to not using a sampler.

Parameters:
  • start_index (int, optional) – Index to start sampling at. Default: None, start at first ID.

  • num_samples (int, optional) – Number of elements to sample. Default: None, which means sample all elements.

Raises:
  • TypeError – If start_index is not of type int.

  • TypeError – If num_samples is not of type int.

  • RuntimeError – If start_index is a negative value.

  • ValueError – If num_samples is a negative value.

Examples

>>> # creates a SequentialSampler
>>> sampler = ds.SequentialSampler()
>>> dataset = ds.ImageFolderDataset(image_folder_dataset_dir,
...                                 num_parallel_workers=8,
...                                 sampler=sampler)
parse()[source]

Parse the sampler.

parse_for_minddataset()[source]

Parse the sampler for MindRecord.

class tinyms.data.SubsetRandomSampler(indices, num_samples=None)[source]

Samples the elements randomly from a sequence of indices.

Parameters:
  • indices (Iterable) – A sequence of indices (Any iterable Python object but string).

  • num_samples (int, optional) – Number of elements to sample. Default: None, which means sample all elements.

Raises:
  • TypeError – If elements of indices are not of type number.

  • TypeError – If num_samples is not of type int.

  • ValueError – If num_samples is a negative value.

Examples

>>> indices = [0, 1, 2, 3, 7, 88, 119]
>>>
>>> # create a SubsetRandomSampler, will sample from the provided indices
>>> sampler = ds.SubsetRandomSampler(indices)
>>> data = ds.ImageFolderDataset(image_folder_dataset_dir, num_parallel_workers=8, sampler=sampler)
parse()[source]

Parse the sampler.

parse_for_minddataset()[source]

Parse the sampler for MindRecord.

class tinyms.data.SubsetSampler(indices, num_samples=None)[source]

Samples the elements from a sequence of indices.

Parameters:
  • indices (Iterable) – A sequence of indices (Any iterable Python object but string).

  • num_samples (int, optional) – Number of elements to sample. Default: None, which means sample all elements.

Raises:
  • TypeError – If elements of indices are not of type number.

  • TypeError – If num_samples is not of type int.

  • ValueError – If num_samples is a negative value.

Examples

>>> indices = [0, 1, 2, 3, 4, 5]
>>>
>>> # creates a SubsetSampler, will sample from the provided indices
>>> sampler = ds.SubsetSampler(indices)
>>> dataset = ds.ImageFolderDataset(image_folder_dataset_dir,
...                                 num_parallel_workers=8,
...                                 sampler=sampler)
parse()[source]

Parse the sampler.

parse_for_minddataset()[source]

Parse the sampler for MindRecord.

class tinyms.data.PKSampler(num_val, num_class=None, shuffle=False, class_column='label', num_samples=None)[source]

Samples K elements for each P class in the dataset.

Parameters:
  • num_val (int) – Number of elements to sample for each class.

  • num_class (int, optional) – Number of classes to sample. Default: None, sample all classes. The parameter does not support to specify currently.

  • shuffle (bool, optional) – If True, the class IDs are shuffled, otherwise it will not be shuffled. Default: False.

  • class_column (str, optional) – Name of column with class labels for MindDataset. Default: ‘label’.

  • num_samples (int, optional) – The number of samples to draw. Default: None, which means sample all elements.

Raises:

Examples

>>> # creates a PKSampler that will get 3 samples from every class.
>>> sampler = ds.PKSampler(3)
>>> dataset = ds.ImageFolderDataset(image_folder_dataset_dir,
...                                 num_parallel_workers=8,
...                                 sampler=sampler)
parse()[source]

Parse the sampler.

parse_for_minddataset()[source]

Parse the sampler for MindRecord.

class tinyms.data.WeightedRandomSampler(weights, num_samples=None, replacement=True)[source]

Samples the elements from [0, len(weights) - 1] randomly with the given weights (probabilities).

Parameters:
  • weights (list[float, int]) – A sequence of weights, not necessarily summing up to 1.

  • num_samples (int, optional) – Number of elements to sample. Default: None, which means sample all elements.

  • replacement (bool) – If True, put the sample ID back for the next draw. Default: True.

Raises:
  • TypeError – If elements of weights are not of type number.

  • TypeError – If num_samples is not of type int.

  • TypeError – If replacement is not of type bool.

  • RuntimeError – If weights is empty or all zero.

  • ValueError – If num_samples is a negative value.

Examples

>>> weights = [0.9, 0.01, 0.4, 0.8, 0.1, 0.1, 0.3]
>>>
>>> # creates a WeightedRandomSampler that will sample 4 elements without replacement
>>> sampler = ds.WeightedRandomSampler(weights, 4)
>>> dataset = ds.ImageFolderDataset(image_folder_dataset_dir,
...                                 num_parallel_workers=8,
...                                 sampler=sampler)
parse()[source]

Parse the sampler.

class tinyms.data.DatasetCache(session_id, size=0, spilling=False, hostname=None, port=None, num_connections=None, prefetch_size=None)[source]

A client to interface with tensor caching service.

For details, please check Tutorial .

Parameters:
  • session_id (int) – A user assigned session id for the current pipeline.

  • size (int, optional) – Size of the memory set aside for the row caching. Default: 0, which means unlimited, note that it might bring in the risk of running out of memory on the machine.

  • spilling (bool, optional) – Whether or not spilling to disk if out of memory. Default: False.

  • hostname (str, optional) – Host name. Default: None, use default hostname ‘127.0.0.1’.

  • port (int, optional) – Port to connect to server. Default: None, use default port 50052.

  • num_connections (int, optional) – Number of tcp/ip connections. Default: None, use default value 12.

  • prefetch_size (int, optional) – The size of the cache queue between operations. Default: None, use default value 20.

Examples

>>> import mindspore.dataset as ds
>>>
>>> # Create a cache instance, in which session_id is generated from command line `cache_admin -g`
>>> # In the following code, suppose the session_id is 780643335
>>> some_cache = ds.DatasetCache(session_id=780643335, size=0)
>>>
>>> dataset_dir = "/path/to/image_folder_dataset_directory"
>>> ds1 = ds.ImageFolderDataset(dataset_dir, cache=some_cache)
get_stat()[source]

Get the statistics from a cache. After data pipeline, three types of statistics can be obtained, including average number of cache hits (avg_cache_sz), number of caches in memory (num_mem_cached) and number of caches in disk (num_disk_cached).

class tinyms.data.DSCallback(step_size=1)[source]

Abstract base class used to build dataset callback classes.

Users can obtain the dataset pipeline context through ds_run_context , including cur_epoch_num , cur_step_num_in_epoch and cur_step_num .

Parameters:

step_size (int, optional) – The number of steps between adjacent ds_step_begin/ds_step_end calls. Default: 1, will be called at each step.

Examples

>>> from mindspore.dataset import DSCallback
>>> from mindspore.dataset.transforms import transforms
>>>
>>> class PrintInfo(DSCallback):
...     def ds_epoch_end(self, ds_run_context):
...         print(ds_run_context.cur_epoch_num)
...         print(ds_run_context.cur_step_num)
>>>
>>> dataset = ds.MnistDataset(mnist_dataset_dir, num_samples=100)
>>> op = transforms.OneHot(10)
>>> dataset = dataset.map(operations=op, callbacks=PrintInfo())
create_runtime_obj()[source]

Internal method, creates a runtime (C++) object from the callback methods defined by the user.

Returns:

_c_dataengine.PyDSCallback.

ds_begin(ds_run_context)[source]

Called before the data pipeline is started.

Parameters:

ds_run_context (RunContext) – Include some information of the data pipeline.

ds_epoch_begin(ds_run_context)[source]

Called before a new epoch is started.

Parameters:

ds_run_context (RunContext) – Include some information of the data pipeline.

ds_epoch_end(ds_run_context)[source]

Called after an epoch is finished.

Parameters:

ds_run_context (RunContext) – Include some information of the data pipeline.

ds_step_begin(ds_run_context)[source]

Called before a step start.

Parameters:

ds_run_context (RunContext) – Include some information of the data pipeline.

ds_step_end(ds_run_context)[source]

Called after a step finished.

Parameters:

ds_run_context (RunContext) – Include some information of the data pipeline.

class tinyms.data.WaitedDSCallback(step_size=1)[source]

Abstract base class used to build dataset callback classes that are synchronized with the training callback class mindspore.train.Callback .

It can be used to execute a custom callback method before a step or an epoch, such as updating the parameters of operations according to the loss of the previous training epoch in auto augmentation.

Users can obtain the network training context through train_run_context , such as network , train_network , epoch_num , batch_num , loss_fn , optimizer , parallel_mode , device_number , list_callback , cur_epoch_num , cur_step_num , dataset_sink_mode , net_outputs , etc., see mindspore.train.Callback .

Users can obtain the dataset pipeline context through ds_run_context , including cur_epoch_num , cur_step_num_in_epoch and cur_step_num .

Note

Note that the call is triggered only at the beginning of the second step or epoch.

Parameters:

step_size (int, optional) – The number of rows in each step, usually set equal to the batch size. Default: 1.

Examples

>>> import mindspore.nn as nn
>>> import mindspore as ms
>>> from mindspore.dataset import WaitedDSCallback
>>> import mindspore.dataset as ds
>>>
>>> ms.set_context(mode=ms.GRAPH_MODE, device_target="CPU")
>>>
>>> # custom callback class for data synchronization in data pipeline
>>> class MyWaitedCallback(WaitedDSCallback):
...     def __init__(self, events, step_size=1):
...         super().__init__(step_size)
...         self.events = events
...
...     # callback method to be executed by data pipeline before the epoch starts
...     def sync_epoch_begin(self, train_run_context, ds_run_context):
...         event = f"ds_epoch_begin_{ds_run_context.cur_epoch_num}_{ds_run_context.cur_step_num}"
...         self.events.append(event)
...
...     # callback method to be executed by data pipeline before the step starts
...     def sync_step_begin(self, train_run_context, ds_run_context):
...         event = f"ds_step_begin_{ds_run_context.cur_epoch_num}_{ds_run_context.cur_step_num}"
...         self.events.append(event)
>>>
>>> # custom callback class for data synchronization in network training
>>> class MyMSCallback(ms.Callback):
...     def __init__(self, events):
...         self.events = events
...
...     # callback method to be executed by network training after the epoch ends
...     def epoch_end(self, run_context):
...         cb_params = run_context.original_args()
...         event = f"ms_epoch_end_{cb_params.cur_epoch_num}_{cb_params.cur_step_num}"
...         self.events.append(event)
...
...     # callback method to be executed by network training after the step ends
...     def step_end(self, run_context):
...         cb_params = run_context.original_args()
...         event = f"ms_step_end_{cb_params.cur_epoch_num}_{cb_params.cur_step_num}"
...         self.events.append(event)
>>>
>>> # custom network
>>> class Net(nn.Cell):
...     def construct(self, x, y):
...         return x
>>>
>>> # define a parameter that needs to be synchronized between data pipeline and network training
>>> events = []
>>>
>>> # define callback classes of data pipeline and netwok training
>>> my_cb1 = MyWaitedCallback(events, 1)
>>> my_cb2 = MyMSCallback(events)
>>> arr = [1, 2, 3, 4]
>>>
>>> # construct data pipeline
>>> data = ds.NumpySlicesDataset((arr, arr), column_names=["c1", "c2"], shuffle=False)
>>> # map the data callback object into the pipeline
>>> data = data.map(operations=(lambda x: x), callbacks=my_cb1)
>>>
>>> net = Net()
>>> model = ms.Model(net)
>>>
>>> # add the data and network callback objects to the model training callback list
>>> model.train(2, data, dataset_sink_mode=False, callbacks=[my_cb2, my_cb1])
create_runtime_obj()[source]

Internal method, creates a runtime (C++) object from the callback methods defined by the user.

Returns:

_c_dataengine.PyDSCallback.

ds_epoch_begin(ds_run_context)[source]

Internal method, do not call/override. Define mindspore.dataset.DSCallback.ds_epoch_begin to wait for mindspore.train.callback.Callback.epoch_end.

Parameters:

ds_run_context – Include some information of the data pipeline.

ds_step_begin(ds_run_context)[source]

Internal method, do not call/override. Define mindspore.dataset.DSCallback.ds_step_begin to wait for mindspore.train.callback.Callback.step_end.

Parameters:

ds_run_context – Include some information of the data pipeline.

end(run_context)[source]

Internal method, release wait when the network training ends.

Parameters:

run_context – Include some information of the model.

epoch_end(run_context)[source]

Internal method, do not call/override. Defines epoch_end of Callback to release the wait in ds_epoch_begin.

Parameters:

run_context – Include some information of the model.

step_end(run_context)[source]

Internal method, do not call/override. Defines step_end of Callback to release the wait in ds_step_begin.

Parameters:

run_context – Include some information of the model.

sync_epoch_begin(train_run_context, ds_run_context)[source]

Called before a new dataset epoch is started and after the previous training epoch is ended.

Parameters:
  • train_run_context – Include some information of the model with feedback from the previous epoch.

  • ds_run_context – Include some information of the data pipeline.

sync_step_begin(train_run_context, ds_run_context)[source]

Called before a new dataset step is started and after the previous training step is ended.

Parameters:
  • train_run_context – Include some information of the model with feedback from the previous step.

  • ds_run_context – Include some information of the data pipeline.

class tinyms.data.Schema(schema_file=None)[source]

Class to represent a schema of a dataset.

Parameters:

schema_file (str) – Path of the schema file. Default: None.

Returns:

Schema object, schema info about dataset.

Raises:

RuntimeError – If schema file failed to load.

Examples

>>> from mindspore import dtype as mstype
>>>
>>> # Create schema; specify column name, mindspore.dtype and shape of the column
>>> schema = ds.Schema()
>>> schema.add_column(name='col1', de_type=mstype.int64, shape=[2])
add_column(name, de_type, shape=None)[source]

Add new column to the schema.

Parameters:
  • name (str) – The new name of the column.

  • de_type (str) – Data type of the column.

  • shape (list[int], optional) – Shape of the column. Default: None, [-1] which is an unknown shape of rank 1.

Raises:

ValueError – If column type is unknown.

Examples: >>> from mindspore import dtype as mstype >>> >>> schema = ds.Schema() >>> schema.add_column(‘col_1d’, de_type=mstype.int64, shape=[2])

from_json(json_obj)[source]

Get schema file from JSON object.

Parameters:

json_obj (dictionary) – Object of JSON parsed.

Raises:

Examples

>>> import json
>>>
>>> from mindspore.dataset import Schema
>>>
>>> with open("/path/to/schema_file") as file:
...     json_obj = json.load(file)
...     schema = ds.Schema()
...     schema.from_json(json_obj)
parse_columns(columns)[source]

Parse the columns and add it to self.

Parameters:

columns (Union[dict, list[dict], tuple[dict]]) –

Dataset attribute information, decoded from schema file.

  • list[dict], name and type must be in keys, shape optional.

  • dict, columns.keys() as name, columns.values() is dict, and type inside, shape optional.

Raises:

Examples

>>> from mindspore.dataset import Schema
>>> schema = Schema()
>>> columns1 = [{'name': 'image', 'type': 'int8', 'shape': [3, 3]},
...             {'name': 'label', 'type': 'int8', 'shape': [1]}]
>>> schema.parse_columns(columns1)
>>> columns2 = {'image': {'shape': [3, 3], 'type': 'int8'}, 'label': {'shape': [1], 'type': 'int8'}}
>>> schema.parse_columns(columns2)
to_json()[source]

Get a JSON string of the schema.

Returns:

str, JSON string of the schema.

Examples

>>> from mindspore.dataset import Schema
>>>
>>> schema1 = ds.Schema()
>>> schema2 = schema1.to_json()
tinyms.data.compare(pipeline1, pipeline2)[source]

Compare if two dataset pipelines are the same.

Parameters:
  • pipeline1 (Dataset) – a dataset pipeline.

  • pipeline2 (Dataset) – a dataset pipeline.

Returns:

Whether pipeline1 is equal to pipeline2.

Examples

>>> pipeline1 = ds.MnistDataset(mnist_dataset_dir, num_samples=100)
>>> pipeline2 = ds.Cifar10Dataset(cifar10_dataset_dir, num_samples=100)
>>> res = ds.compare(pipeline1, pipeline2)
tinyms.data.deserialize(input_dict=None, json_filepath=None)[source]

Construct dataset pipeline from a JSON file produced by dataset serialize function.

Parameters:
  • input_dict (dict) – A Python dictionary containing a serialized dataset graph. Default: None.

  • json_filepath (str) – A path to the JSON file containing dataset graph. User can obtain this file by calling API mindspore.dataset.serialize() . Default: None.

Returns:

de.Dataset or None if error occurs.

Raises:

OSError – Can not open the JSON file.

Examples

>>> dataset = ds.MnistDataset(mnist_dataset_dir, num_samples=100)
>>> one_hot_encode = transforms.OneHot(10)  # num_classes is input argument
>>> dataset = dataset.map(operations=one_hot_encode, input_columns="label")
>>> dataset = dataset.batch(batch_size=10, drop_remainder=True)
>>> # Case 1: to/from JSON file
>>> serialized_data = ds.serialize(dataset, json_filepath="/path/to/mnist_dataset_pipeline.json")
>>> deserialized_dataset = ds.deserialize(json_filepath="/path/to/mnist_dataset_pipeline.json")
>>> # Case 2: to/from Python dictionary
>>> serialized_data = ds.serialize(dataset)
>>> deserialized_dataset = ds.deserialize(input_dict=serialized_data)
tinyms.data.serialize(dataset, json_filepath='')[source]

Serialize dataset pipeline into a JSON file.

Note

Complete serialization of Python objects is not currently supported. Scenarios that are not supported include data pipelines that use GeneratorDataset or map or batch operations that contain custom Python functions. For Python objects, serialization operations do not yield the full object content, which means that deserialization of the JSON file obtained by serialization may result in errors. For example, when serializing the data pipeline of Python user-defined functions, a related warning message is reported and the obtained JSON file cannot be deserialized into a usable data pipeline.

Parameters:
  • dataset (Dataset) – The starting node.

  • json_filepath (str) – The filepath where a serialized JSON file will be generated. Default: ‘’.

Returns:

Dict, the dictionary contains the serialized dataset graph.

Raises:

OSError – Cannot open a file.

Examples

>>> dataset = ds.MnistDataset(mnist_dataset_dir, num_samples=100)
>>> one_hot_encode = transforms.OneHot(10)  # num_classes is input argument
>>> dataset = dataset.map(operations=one_hot_encode, input_columns="label")
>>> dataset = dataset.batch(batch_size=10, drop_remainder=True)
>>> # serialize it to JSON file
>>> serialized_data = ds.serialize(dataset, json_filepath="/path/to/mnist_dataset_pipeline.json")
tinyms.data.show(dataset, indentation=2)[source]

Write the dataset pipeline graph to logger.info file.

Parameters:
  • dataset (Dataset) – The starting node.

  • indentation (int, optional) – The indentation used by the JSON print. Do not indent if indentation is None. Default: 2.

Examples

>>> dataset = ds.MnistDataset(mnist_dataset_dir, num_samples=100)
>>> one_hot_encode = transforms.OneHot(10)
>>> dataset = dataset.map(operations=one_hot_encode, input_columns="label")
>>> dataset = dataset.batch(batch_size=10, drop_remainder=True)
>>> ds.show(dataset)
tinyms.data.sync_wait_for_dataset(rank_id, rank_size, current_epoch)[source]

Wait util the dataset files required by all devices are downloaded.

Note

It should be used together with mindspore.dataset.OBSMindDataset and be called before each epoch.

Parameters:
  • rank_id (int) – Rank ID of the device.

  • rank_size (int) – Rank size.

  • current_epoch (int) – Number of current epochs.

Examples

>>> # Create a synchronization callback
>>> import mindspore as ms
>>> from mindspore.dataset import sync_wait_for_dataset
>>>
>>> class SyncForDataset(ms.Callback):
...     def __init__(self):
...         super(SyncForDataset, self).__init__()
...     def epoch_begin(self, run_context):
...         cb_params = run_context.original_args()
...         epoch_num = cb_params.cur_epoch_num
...         sync_wait_for_dataset(rank_id, rank_size, epoch_num)
tinyms.data.zip(datasets)[source]

Zip the datasets in the input tuple of datasets.

Parameters:

datasets (tuple[Dataset]) – A tuple of datasets to be zipped together. The number of datasets must be more than 1.

Returns:

Dataset, dataset zipped.

Raises:

Examples

>>> # Create a dataset which is the combination of dataset_1 and dataset_2
>>> dataset = ds.zip((dataset_1, dataset_2))
class tinyms.data.FileWriter(file_name, shard_num=1, overwrite=False)[source]

Class to write user defined raw data into MindRecord files.

Note

After the MindRecord file is generated, if the file name is changed, the file may fail to be read.

Parameters:
  • file_name (str) – File name of MindRecord file.

  • shard_num (int, optional) – The Number of MindRecord files. It should be between [1, 1000]. Default: 1.

  • overwrite (bool, optional) – Whether to overwrite if the file already exists. Default: False.

Raises:

ParamValueError – If file_name or shard_num or overwrite is invalid.

Examples

>>> from mindspore.mindrecord import FileWriter
>>> schema_json = {"file_name": {"type": "string"}, "label": {"type": "int32"}, "data": {"type": "bytes"}}
>>> indexes = ["file_name", "label"]
>>> data = [{"file_name": "1.jpg", "label": 0,
...          "data": b"\x10c\xb3w\xa8\xee$o&<q\x8c\x8e(\xa2\x90\x90\x96\xbc\xb1\x1e\xd4QER\x13?\xff"},
...         {"file_name": "2.jpg", "label": 56,
...          "data": b"\xe6\xda\xd1\xae\x07\xb8>\xd4\x00\xf8\x129\x15\xd9\xf2q\xc0\xa2\x91YFUO\x1dsE1"},
...         {"file_name": "3.jpg", "label": 99,
...          "data": b"\xaf\xafU<\xb8|6\xbd}\xc1\x99[\xeaj+\x8f\x84\xd3\xcc\xa0,i\xbb\xb9-\xcdz\xecp{T\xb1"}]
>>> writer = FileWriter(file_name="test.mindrecord", shard_num=1, overwrite=True)
>>> schema_id = writer.add_schema(schema_json, "test_schema")
>>> status = writer.add_index(indexes)
>>> status = writer.write_raw_data(data)
>>> status = writer.commit()
add_index(index_fields)[source]

Select index fields from schema to accelerate reading. schema is added through add_schema .

Note

The index fields should be primitive type. e.g. int/float/str. If the function is not called, the fields of the primitive type in schema are set as indexes by default.

Please refer to the Examples of class: mindspore.mindrecord.FileWriter .

Parameters:

index_fields (list[str]) – fields from schema.

Returns:

MSRStatus, SUCCESS or FAILED.

Raises:
  • ParamTypeError – If index field is invalid.

  • MRMDefineIndexError – If index field is not primitive type.

  • MRMAddIndexError – If failed to add index field.

  • MRMGetMetaError – If the schema is not set or failed to get meta.

add_schema(content, desc=None)[source]

The schema is added to describe the raw data to be written.

Note

Please refer to the Examples of class: mindspore.mindrecord.FileWriter .

Parameters:
  • content (dict) – Dictionary of schema content.

  • desc (str, optional) – String of schema description, Default: None.

Returns:

int, schema id.

Raises:
  • MRMInvalidSchemaError – If schema is invalid.

  • MRMBuildSchemaError – If failed to build schema.

  • MRMAddSchemaError – If failed to add schema.

commit()[source]

Flush data in memory to disk and generate the corresponding database files.

Note

Please refer to the Examples of class: mindspore.mindrecord.FileWriter .

Returns:

MSRStatus, SUCCESS or FAILED.

Raises:
  • MRMOpenError – If failed to open MindRecord file.

  • MRMSetHeaderError – If failed to set header.

  • MRMIndexGeneratorError – If failed to create index generator.

  • MRMGenerateIndexError – If failed to write to database.

  • MRMCommitError – If failed to flush data to disk.

  • RuntimeError – Parallel write failed.

open_and_set_header()[source]

Open writer and set header which stores meta information. The function is only used for parallel writing and is called before the write_raw_data .

Returns:

MSRStatus, SUCCESS or FAILED.

Raises:
  • MRMOpenError – If failed to open MindRecord file.

  • MRMSetHeaderError – If failed to set header.

classmethod open_for_append(file_name)[source]

Open MindRecord file and get ready to append data.

Parameters:

file_name (str) – String of MindRecord file name.

Returns:

FileWriter, file writer object for the opened MindRecord file.

Raises:
  • ParamValueError – If file_name is invalid.

  • FileNameError – If path contains invalid characters.

  • MRMOpenError – If failed to open MindRecord file.

  • MRMOpenForAppendError – If failed to open file for appending data.

Examples

>>> from mindspore.mindrecord import FileWriter
>>> schema_json = {"file_name": {"type": "string"}, "label": {"type": "int32"}, "data": {"type": "bytes"}}
>>> data = [{"file_name": "1.jpg", "label": 0,
...          "data": b"\x10c\xb3w\xa8\xee$o&<q\x8c\x8e(\xa2\x90\x90\x96\xbc\xb1\x1e\xd4QER\x13?\xff"}]
>>> writer = FileWriter(file_name="test.mindrecord", shard_num=1, overwrite=True)
>>> schema_id = writer.add_schema(schema_json, "test_schema")
>>> status = writer.write_raw_data(data)
>>> status = writer.commit()
>>> write_append = FileWriter.open_for_append("test.mindrecord")
>>> status = write_append.write_raw_data(data)
>>> status = write_append.commit()
set_header_size(header_size)[source]

Set the size of header which contains shard information, schema information, page meta information, etc. The larger a header, the more data the MindRecord file can store. If the size of header is larger than the default size (16MB), users need to call the API to set a proper size.

Parameters:

header_size (int) – Size of header, between 16*1024(16KB) and 128*1024*1024(128MB).

Returns:

MSRStatus, SUCCESS or FAILED.

Raises:

MRMInvalidHeaderSizeError – If failed to set header size.

Examples

>>> from mindspore.mindrecord import FileWriter
>>> writer = FileWriter(file_name="test.mindrecord", shard_num=1)
>>> status = writer.set_header_size(1 << 25) # 32MB
set_page_size(page_size)[source]

Set the size of page that represents the area where data is stored, and the areas are divided into two types: raw page and blob page. The larger a page, the more data the page can store. If the size of a sample is larger than the default size (32MB), users need to call the API to set a proper size.

Parameters:

page_size (int) – Size of page, between 32*1024(32KB) and 256*1024*1024(256MB).

Returns:

MSRStatus, SUCCESS or FAILED.

Raises:

MRMInvalidPageSizeError – If failed to set page size.

Examples

>>> from mindspore.mindrecord import FileWriter
>>> writer = FileWriter(file_name="test.mindrecord", shard_num=1)
>>> status = writer.set_page_size(1 << 26)  # 64MB
write_raw_data(raw_data, parallel_writer=False)[source]

Convert raw data into a series of consecutive MindRecord files after the raw data is verified against the schema.

Note

Please refer to the Examples of class: mindspore.mindrecord.FileWriter .

Parameters:
  • raw_data (list[dict]) – List of raw data.

  • parallel_writer (bool, optional) – Write raw data in parallel if it equals to True. Default: False.

Returns:

MSRStatus, SUCCESS or FAILED.

Raises:
  • ParamTypeError – If index field is invalid.

  • MRMOpenError – If failed to open MindRecord file.

  • MRMValidateDataError – If data does not match blob fields.

  • MRMSetHeaderError – If failed to set header.

  • MRMWriteDatasetError – If failed to write dataset.

  • TypeError – If parallel_writer is not bool.

class tinyms.data.FileReader(file_name, num_consumer=4, columns=None, operator=None)[source]

Class to read MindRecord files.

Note

If file_name is a file path, it tries to load all MindRecord files generated in a conversion, and throws an exception if a MindRecord file is missing. If file_name is file path list, only the MindRecord files in the list are loaded.

Parameters:
  • file_name (str, list[str]) – One of MindRecord file path or file path list.

  • num_consumer (int, optional) – Number of reader workers which load data. Default: 4. It should not be smaller than 1 or larger than the number of processor cores.

  • columns (list[str], optional) – A list of fields where corresponding data would be read. Default: None.

  • operator (int, optional) – Reserved parameter for operators. Default: None.

Raises:

ParamValueError – If file_name , num_consumer or columns is invalid.

Examples

>>> from mindspore.mindrecord import FileReader
>>>
>>> mindrecord_file = "/path/to/mindrecord/file"
>>> reader = FileReader(file_name=mindrecord_file)
>>>
>>> # create iterator for mindrecord and get saved data
>>> for _, item in enumerate(reader.get_next()):
...     ori_data = item
>>> reader.close()
close()[source]

Stop reader worker and close file.

get_next()[source]

Yield a batch of data according to columns at a time.

Returns:

dict, a batch whose keys are the same as columns.

Raises:

MRMUnsupportedSchemaError – If schema is invalid.

len()[source]

Get the number of the samples in MindRecord.

Returns:

int, the number of the samples in MindRecord.

schema()[source]

Get the schema of the MindRecord.

Returns:

dict, the schema info.

class tinyms.data.MindPage(file_name, num_consumer=4)[source]

Class to read MindRecord files in pagination.

Parameters:
  • file_name (Union[str, list[str]]) – One of MindRecord files or a file list.

  • num_consumer (int, optional) – The number of reader workers which load data. Default: 4. It should not be smaller than 1 or larger than the number of processor cores.

Raises:
  • ParamValueError – If file_name , num_consumer or columns is invalid.

  • MRMInitSegmentError – If failed to initialize ShardSegment.

property candidate_fields

Return candidate category fields.

Returns:

list[str], by which data could be grouped.

property category_field

Getter function for category fields.

Returns:

list[str], by which data could be grouped.

get_category_fields()[source]

Return candidate category fields.

Returns:

list[str], by which data could be grouped.

read_at_page_by_id(category_id, page, num_row)[source]

Query by category id in pagination.

Parameters:
  • category_id (int) – Category id, referred to the return of read_category_info .

  • page (int) – Index of page.

  • num_row (int) – Number of rows in a page.

Returns:

list[dict], data queried by category id.

Raises:
  • ParamValueError – If any parameter is invalid.

  • MRMFetchDataError – If failed to fetch data by category.

  • MRMUnsupportedSchemaError – If schema is invalid.

read_at_page_by_name(category_name, page, num_row)[source]

Query by category name in pagination.

Parameters:
  • category_name (str) – String of category field’s value, referred to the return of read_category_info .

  • page (int) – Index of page.

  • num_row (int) – Number of row in a page.

Returns:

list[dict], data queried by category name.

read_category_info()[source]

Return category information when data is grouped by indicated category field.

Returns:

str, description of group information.

Raises:

MRMReadCategoryInfoError – If failed to read category information.

set_category_field(category_field)[source]

Set category field for reading.

Note

Should be a candidate category field.

Parameters:

category_field (str) – String of category field name.

Returns:

MSRStatus, SUCCESS or FAILED.

class tinyms.data.Cifar10ToMR(source, destination)[source]

A class to transform from cifar10 to MindRecord.

Note

For details about Examples, please refer to Converting the CIFAR-10 Dataset .

Parameters:
  • source (str) – The cifar10 directory to be transformed.

  • destination (str) – MindRecord file path to transform into, ensure that the directory is created in advance and no file with the same name exists in the directory.

Raises:

ValueError – If source or destination is invalid.

run(fields=None)[source]

Execute transformation from cifar10 to MindRecord.

Parameters:

fields (list[str], optional) – A list of index fields. Default: None. For index field settings, please refer to mindspore.mindrecord.FileWriter.add_index() .

Returns:

MSRStatus, SUCCESS or FAILED.

transform(fields=None)[source]

Encapsulate the mindspore.mindrecord.Cifar10ToMR.run() function to exit normally.

Parameters:

fields (list[str], optional) – A list of index fields. Default: None. For index field settings, please refer to mindspore.mindrecord.FileWriter.add_index() .

Returns:

MSRStatus, SUCCESS or FAILED.

class tinyms.data.Cifar100ToMR(source, destination)[source]

A class to transform from cifar100 to MindRecord.

Note

For details about Examples, please refer to Converting the CIFAR-10 Dataset .

Parameters:
  • source (str) – The cifar100 directory to be transformed.

  • destination (str) – MindRecord file path to transform into, ensure that the directory is created in advance and no file with the same name exists in the directory.

Raises:

ValueError – If source or destination is invalid.

run(fields=None)[source]

Execute transformation from cifar100 to MindRecord.

Parameters:

fields (list[str], optional) – A list of index field, e.g.[“fine_label”, “coarse_label”]. Default: None. For index field settings, please refer to mindspore.mindrecord.FileWriter.add_index() .

Returns:

MSRStatus, SUCCESS or FAILED.

transform(fields=None)[source]

Encapsulate the mindspore.mindrecord.Cifar100ToMR.run() function to exit normally.

Parameters:

fields (list[str], optional) – A list of index field, e.g.[“fine_label”, “coarse_label”]. Default: None. For index field settings, please refer to mindspore.mindrecord.FileWriter.add_index() .

Returns:

MSRStatus, SUCCESS or FAILED.

class tinyms.data.CsvToMR(source, destination, columns_list=None, partition_number=1)[source]

A class to transform from csv to MindRecord.

Note

For details about Examples, please refer to Converting CSV Dataset .

Parameters:
  • source (str) – The file path of csv.

  • destination (str) – The MindRecord file path to transform into, ensure that the directory is created in advance and no file with the same name exists in the directory.

  • columns_list (list[str], optional) – A list of columns to be read. Default: None.

  • partition_number (int, optional) – The partition size, Default: 1.

Raises:
  • ValueError – If source , destination , partition_number is invalid.

  • RuntimeError – If columns_list is invalid.

run()[source]

Execute transformation from csv to MindRecord.

Returns:

MSRStatus, SUCCESS or FAILED.

transform()[source]

Encapsulate the mindspore.mindrecord.CsvToMR.run() function to exit normally.

Returns:

MSRStatus, SUCCESS or FAILED.

class tinyms.data.ImageNetToMR(map_file, image_dir, destination, partition_number=1)[source]

A class to transform from imagenet to MindRecord.

Parameters:
  • map_file (str) –

    The map file that indicates label. This file can be generated by command ls -l [image_dir] | grep -vE "total|\." | awk -F " " '{print $9, NR-1;}' > [file_path] , where image_dir is image directory contains n01440764, n01443537, n01484850 and n15075141 directory and file_path is the generated map_file . An example of map_file is as below:

    n01440764 0
    n01443537 1
    n01484850 2
    n01491361 3
    ...
    n15075141 999
    

  • image_dir (str) – Image directory contains n01440764, n01443537, n01484850 and n15075141 directory.

  • destination (str) – MindRecord file path to transform into, ensure that the directory is created in advance and no file with the same name exists in the directory.

  • partition_number (int, optional) – The partition size. Default: 1.

Raises:

ValueError – If map_file , image_dir or destination is invalid.

run()[source]

Execute transformation from imagenet to MindRecord.

Returns:

MSRStatus, SUCCESS or FAILED.

transform()[source]

Encapsulate the mindspore.mindrecord.ImageNetToMR.run() function to exit normally.

Returns:

MSRStatus, SUCCESS or FAILED.

class tinyms.data.MnistToMR(source, destination, partition_number=1)[source]

A class to transform from Mnist to MindRecord.

Parameters:
  • source (str) – Directory that contains t10k-images-idx3-ubyte.gz, train-images-idx3-ubyte.gz, t10k-labels-idx1-ubyte.gz and train-labels-idx1-ubyte.gz.

  • destination (str) – MindRecord file path to transform into, ensure that the directory is created in advance and no file with the same name exists in the directory.

  • partition_number (int, optional) – The partition size. Default: 1.

Raises:

ValueError – If source , destination , partition_number is invalid.

run()[source]

Execute transformation from Mnist to MindRecord.

Returns:

MSRStatus, SUCCESS or FAILED.

transform()[source]

Encapsulate the mindspore.mindrecord.MnistToMR.run() function to exit normally.

Returns:

MSRStatus, SUCCESS or FAILED.

class tinyms.data.TFRecordToMR(source, destination, feature_dict, bytes_fields=None)[source]

A class to transform from TFRecord to MindRecord.

Note

For details about Examples, please refer to Converting TFRecord Dataset .

Parameters:
  • source (str) – TFRecord file to be transformed.

  • destination (str) – MindRecord file path to transform into, ensure that the directory is created in advance and no file with the same name exists in the directory.

  • feature_dict (dict[str, FixedLenFeature]) – Dictionary that states the feature type, and FixedLenFeature is supported.

  • bytes_fields (list[str], optional) – The bytes fields which are in feature_dict and can be images bytes. Default: None, means that there is no byte dtype field such as image.

Raises:
  • ValueError – If parameter is invalid.

  • Exception – when tensorflow module is not found or version is not correct.

run()[source]

Execute transformation from TFRecord to MindRecord.

Returns:

MSRStatus, SUCCESS or FAILED.

tfrecord_iterator()[source]

Yield a dictionary whose keys are fields in schema.

Returns:

dict, data dictionary whose keys are the same as columns.

tfrecord_iterator_oldversion()[source]

Yield a dict with key to be fields in schema, and value to be data. This function is for old version tensorflow whose version number < 2.1.0.

Returns:

dict, data dictionary whose keys are the same as columns.

transform()[source]

Encapsulate the mindspore.mindrecord.TFRecordToMR.run() function to exit normally.

Returns:

MSRStatus, SUCCESS or FAILED.

tinyms.data.download_dataset(dataset_name, local_path='.')[source]

This function is defined to easily download any public dataset without specifing much details.

Parameters:
  • dataset_name (str) – The official name of dataset, currently supports mnist, cifar10 and cifar100.

  • local_path (str) – Specifies the local location of dataset to be downloaded. Default: ..

Returns:

str, the source location of dataset downloaded.

Examples

>>> from tinyms.data import download_dataset
>>>
>>> ds_path = download_dataset('mnist')
tinyms.data.generate_image_list(dir_path, max_dataset_size=inf)[source]

Traverse the directory to generate a list of images path.

Parameters:
  • dir_path (str) – image directory.

  • max_dataset_size (int) – Maximum number of return image paths.

Returns:

Image path list.

tinyms.data.load_resized_img(path, width=256, height=256)[source]

Load image with RGB and resize to (256, 256).

Parameters:
  • path (str) – image path.

  • width (int) – image width, default: 256.

  • height (int) – image height, default: 256.

Returns:

PIL image class.

tinyms.data.load_img(path)[source]

Load image with RGB.

Parameters:

path (str) – image path.

Returns:

PIL image class.