tinyms.text

This module is to support text processing for NLP tasks. It is a high performance NLP text processing module which is developed with ICU4C and cppjieba.

class tinyms.text.BertDatasetTransform[源代码]

Apply preprocess operation on GeneratorDataset instance.

class tinyms.text.Lookup(vocab, unknown_token=None, data_type=mindspore.int32)[源代码]

Look up a word into an id according to the input vocabulary table.

参数
  • vocab (Vocab) – A vocabulary object.

  • unknown_token (str, optional) – Word used for lookup if the word being looked up is out-of-vocabulary (OOV). If unknown_token is OOV, a runtime error will be thrown (default=None).

  • data_type (mindspore.dtype, optional) – mindspore.dtype that lookup maps string to (default=mindspore.int32)

实际案例

>>> # Load vocabulary from list
>>> vocab = text.Vocab.from_list(['深', '圳', '欢', '迎', '您'])
>>> # Use Lookup operator to map tokens to ids
>>> lookup = text.Lookup(vocab)
>>> text_file_dataset = text_file_dataset.map(operations=[lookup])
class tinyms.text.JiebaTokenizer(hmm_path, mp_path, mode=<JiebaMode.MIX: 0>, with_offsets=False)[源代码]

Tokenize Chinese string into words based on dictionary.

注解

The integrity of the HMMSEgment algorithm and MPSegment algorithm files must be confirmed.

参数
  • hmm_path (str) – Dictionary file is used by HMMSegment algorithm. The dictionary can be obtained on the official website of cppjieba.

  • mp_path (str) – Dictionary file is used by MPSegment algorithm. The dictionary can be obtained on the official website of cppjieba.

  • mode (JiebaMode, optional) –

    Valid values can be any of [JiebaMode.MP, JiebaMode.HMM, JiebaMode.MIX](default=JiebaMode.MIX).

    • JiebaMode.MP, tokenize with MPSegment algorithm.

    • JiebaMode.HMM, tokenize with Hiddel Markov Model Segment algorithm.

    • JiebaMode.MIX, tokenize with a mix of MPSegment and HMMSegment algorithm.

  • with_offsets (bool, optional) – If or not output offsets of tokens (default=False).

实际案例

>>> from mindspore.dataset.text import JiebaMode
>>> # If with_offsets=False, default output one column {["text", dtype=str]}
>>> jieba_hmm_file = "/path/to/jieba/hmm/file"
>>> jieba_mp_file = "/path/to/jieba/mp/file"
>>> tokenizer_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP, with_offsets=False)
>>> text_file_dataset = text_file_dataset.map(operations=tokenizer_op)
>>> # If with_offsets=False, then output three columns {["token", dtype=str], ["offsets_start", dtype=uint32],
>>> #                                                   ["offsets_limit", dtype=uint32]}
>>> tokenizer_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP, with_offsets=True)
>>> text_file_dataset_1 = text_file_dataset_1.map(operations=tokenizer_op, input_columns=["text"],
...                                               output_columns=["token", "offsets_start", "offsets_limit"],
...                                               column_order=["token", "offsets_start", "offsets_limit"])
add_dict(user_dict)[源代码]

Add user defined word to JiebaTokenizer’s dictionary.

参数

user_dict (Union[str, dict]) –

One of the two loading methods is file path(str) loading (according to the Jieba dictionary format) and the other is Python dictionary(dict) loading, Python Dict format: {word1:freq1, word2:freq2,…}. Jieba dictionary format : word(required), freq(optional), such as:

word1 freq1
word2 None
word3 freq3

实际案例

>>> from mindspore.dataset.text import JiebaMode
>>> jieba_hmm_file = "/path/to/jieba/hmm/file"
>>> jieba_mp_file = "/path/to/jieba/mp/file"
>>> user_dict = {"男默女泪": 10}
>>> jieba_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP)
>>> jieba_op.add_dict(user_dict)
>>> text_file_dataset = text_file_dataset.map(operations=jieba_op, input_columns=["text"])
add_word(word, freq=None)[源代码]

Add user defined word to JiebaTokenizer’s dictionary.

参数
  • word (str) – The word to be added to the JiebaTokenizer instance. The added word will not be written into the built-in dictionary on disk.

  • freq (int, optional) – The frequency of the word to be added. The higher the frequency, the better chance the word will be tokenized (default=None, use default frequency).

实际案例

>>> from mindspore.dataset.text import JiebaMode
>>> jieba_hmm_file = "/path/to/jieba/hmm/file"
>>> jieba_mp_file = "/path/to/jieba/mp/file"
>>> jieba_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP)
>>> sentence_piece_vocab_file = "/path/to/sentence/piece/vocab/file"
>>> with open(sentence_piece_vocab_file, 'r') as f:
...     for line in f:
...         word = line.split(',')[0]
...         jieba_op.add_word(word)
>>> text_file_dataset = text_file_dataset.map(operations=jieba_op, input_columns=["text"])
class tinyms.text.UnicodeCharTokenizer(with_offsets=False)[源代码]

Tokenize a scalar tensor of UTF-8 string to Unicode characters.

参数

with_offsets (bool, optional) – If or not output offsets of tokens (default=False).

实际案例

>>> # If with_offsets=False, default output one column {["text", dtype=str]}
>>> tokenizer_op = text.UnicodeCharTokenizer(with_offsets=False)
>>> text_file_dataset = text_file_dataset.map(operations=tokenizer_op)
>>> # If with_offsets=True, then output three columns {["token", dtype=str], ["offsets_start", dtype=uint32],
>>> #                                                   ["offsets_limit", dtype=uint32]}
>>> tokenizer_op = text.UnicodeCharTokenizer(with_offsets=True)
>>> text_file_dataset = text_file_dataset.map(operations=tokenizer_op, input_columns=["text"],
...                                           output_columns=["token", "offsets_start", "offsets_limit"],
...                                           column_order=["token", "offsets_start", "offsets_limit"])
class tinyms.text.Ngram(n, left_pad=('', 0), right_pad=('', 0), separator=' ')[源代码]

TensorOp to generate n-gram from a 1-D string Tensor.

Refer to https://en.wikipedia.org/wiki/N-gram#Examples for an overview of what n-gram is and how it works.

参数
  • n (list[int]) – n in n-gram, which is a list of positive integers. For example, if n=[4, 3], then the result would be a 4-gram followed by a 3-gram in the same tensor. If the number of words is not enough to make up for a n-gram, an empty string will be returned. For example, 3 grams on [“mindspore”, “best”] will result in an empty string produced.

  • left_pad (tuple, optional) – Padding performed on left side of the sequence shaped like (“pad_token”, pad_width). pad_width will be capped at n-1. For example, specifying left_pad=(“_”, 2) would pad left side of the sequence with “__” (default=None).

  • right_pad (tuple, optional) – Padding performed on right side of the sequence shaped like (“pad_token”, pad_width). pad_width will be capped at n-1. For example, specifying right_pad=(“_”, 2) would pad right side of the sequence with “__” (default=None).

  • separator (str, optional) – Symbol used to join strings together. For example. if 2-gram is [“mindspore”, “amazing”] with separator=”-“, the result would be [“mindspore-amazing”] (default=None, which will use whitespace as separator).

实际案例

>>> ngram_op = text.Ngram(3, separator="-")
>>> output = ngram_op(["WildRose Country", "Canada's Ocean Playground", "Land of Living Skies"])
>>> # output
>>> # ["WildRose Country-Canada's Ocean Playground-Land of Living Skies"]
>>> # same ngram_op called through map
>>> text_file_dataset = text_file_dataset.map(operations=ngram_op)
class tinyms.text.WordpieceTokenizer(vocab, suffix_indicator='##', max_bytes_per_token=100, unknown_token='[UNK]', with_offsets=False)[源代码]

Tokenize scalar token or 1-D tokens to 1-D subword tokens.

参数
  • vocab (Vocab) – A vocabulary object.

  • suffix_indicator (str, optional) – Used to show that the subword is the last part of a word (default=’##’).

  • max_bytes_per_token (int, optional) – Tokens exceeding this length will not be further split (default=100).

  • unknown_token (str, optional) – When a token cannot be found: if ‘unknown_token’ is empty string, return the token directly, else return ‘unknown_token’ (default=’[UNK]’).

  • with_offsets (bool, optional) – If or not output offsets of tokens (default=False).

实际案例

>>> vocab_list = ["book", "cholera", "era", "favor", "##ite", "my", "is", "love", "dur", "##ing", "the"]
>>> vocab = text.Vocab.from_list(vocab_list)
>>> # If with_offsets=False, default output one column {["text", dtype=str]}
>>> tokenizer_op = text.WordpieceTokenizer(vocab=vocab, unknown_token='[UNK]',
...                                        max_bytes_per_token=100, with_offsets=False)
>>> text_file_dataset = text_file_dataset.map(operations=tokenizer_op)
>>> # If with_offsets=True, then output three columns {["token", dtype=str], ["offsets_start", dtype=uint32],
>>> #                                                   ["offsets_limit", dtype=uint32]}
>>> tokenizer_op = text.WordpieceTokenizer(vocab=vocab, unknown_token='[UNK]',
...                                       max_bytes_per_token=100, with_offsets=True)
>>> text_file_dataset = text_file_dataset.map(operations=tokenizer_op, input_columns=["text"],
...                                           output_columns=["token", "offsets_start", "offsets_limit"],
...                                           column_order=["token", "offsets_start", "offsets_limit"])
class tinyms.text.TruncateSequencePair(max_length)[源代码]

Truncate a pair of rank-1 tensors such that the total length is less than max_length.

This operation takes two input tensors and returns two output Tensors.

参数

max_length (int) – Maximum length required.

实际案例

>>> dataset = ds.NumpySlicesDataset(data={"col1": [[1, 2, 3]], "col2": [[4, 5]]})
>>> # Data before
>>> # |   col1    |   col2    |
>>> # +-----------+-----------|
>>> # | [1, 2, 3] |  [4, 5]   |
>>> # +-----------+-----------+
>>> truncate_sequence_pair_op = text.TruncateSequencePair(max_length=4)
>>> dataset = dataset.map(operations=truncate_sequence_pair_op)
>>> # Data after
>>> # |   col1    |   col2    |
>>> # +-----------+-----------+
>>> # |  [1, 2]   |  [4, 5]   |
>>> # +-----------+-----------+
class tinyms.text.ToNumber(data_type)[源代码]

Tensor operation to convert every element of a string tensor to a number.

Strings are casted according to the rules specified in the following links: https://en.cppreference.com/w/cpp/string/basic_string/stof, https://en.cppreference.com/w/cpp/string/basic_string/stoul, except that any strings which represent negative numbers cannot be cast to an unsigned integer type.

参数

data_type (mindspore.dtype) – mindspore.dtype to be casted to. Must be a numeric type.

引发

RuntimeError – If strings are invalid to cast, or are out of range after being casted.

实际案例

>>> import mindspore.common.dtype as mstype
>>> data = [["1", "2", "3"]]
>>> dataset = ds.NumpySlicesDataset(data)
>>> to_number_op = text.ToNumber(mstype.int8)
>>> dataset = dataset.map(operations=to_number_op)
class tinyms.text.SlidingWindow(width, axis=0)[源代码]

TensorOp to construct a tensor from data (only 1-D for now), where each element in the dimension axis is a slice of data starting at the corresponding position, with a specified width.

参数
  • width (int) – The width of the window. It must be an integer and greater than zero.

  • axis (int, optional) – The axis along which the sliding window is computed (default=0).

实际案例

>>> dataset = ds.NumpySlicesDataset(data=[[1, 2, 3, 4, 5]], column_names="col1")
>>> # Data before
>>> # |     col1     |
>>> # +--------------+
>>> # | [[1, 2, 3, 4, 5]] |
>>> # +--------------+
>>> dataset = dataset.map(operations=text.SlidingWindow(3, 0))
>>> # Data after
>>> # |     col1     |
>>> # +--------------+
>>> # |  [[1, 2, 3], |
>>> # |   [2, 3, 4], |
>>> # |   [3, 4, 5]] |
>>> # +--------------+
class tinyms.text.SentencePieceTokenizer(mode, out_type)[源代码]

Tokenize scalar token or 1-D tokens to tokens by sentencepiece.

参数
  • mode (Union[str, SentencePieceVocab]) – If the input parameter is a file, then it is of type string. If the input parameter is a SentencePieceVocab object, then it is of type SentencePieceVocab.

  • out_type (Union[str, int]) – The type of output.

实际案例

>>> from mindspore.dataset.text import SentencePieceModel, SPieceTokenizerOutType
>>> sentence_piece_vocab_file = "/path/to/sentence/piece/vocab/file"
>>> vocab = text.SentencePieceVocab.from_file([sentence_piece_vocab_file], 5000, 0.9995,
...                                           SentencePieceModel.UNIGRAM, {})
>>> tokenizer = text.SentencePieceTokenizer(vocab, out_type=SPieceTokenizerOutType.STRING)
>>> text_file_dataset = text_file_dataset.map(operations=tokenizer)
class tinyms.text.PythonTokenizer(tokenizer)[源代码]

Callable class to be used for user-defined string tokenizer.

参数

tokenizer (Callable) – Python function that takes a str and returns a list of str as tokens.

实际案例

>>> def my_tokenizer(line):
...     return line.split()
>>> text_file_dataset = text_file_dataset.map(operations=text.PythonTokenizer(my_tokenizer))
tinyms.text.to_str(array, encoding='utf8')[源代码]

Convert NumPy array of bytes to array of str by decoding each element based on charset encoding.

参数
  • array (numpy.ndarray) – Array of type bytes representing strings.

  • encoding (str) – Indicating the charset for decoding.

返回

numpy.ndarray, NumPy array of str.

tinyms.text.to_bytes(array, encoding='utf8')[源代码]

Convert NumPy array of str to array of bytes by encoding each element based on charset encoding.

参数
  • array (numpy.ndarray) – Array of type str representing strings.

  • encoding (str) – Indicating the charset for encoding.

返回

numpy.ndarray, NumPy array of bytes.

class tinyms.text.Vocab[源代码]

Vocab object that is used to lookup a word.

It contains a map that maps each word(str) to an id (int).

classmethod from_dataset(dataset, columns=None, freq_range=None, top_k=None, special_tokens=None, special_first=True)[源代码]

Build a vocab from a dataset.

This would collect all unique words in a dataset and return a vocab within the frequency range specified by user in freq_range. User would be warned if no words fall into the frequency. Words in vocab are ordered from highest frequency to lowest frequency. Words with the same frequency would be ordered lexicographically.

参数
  • dataset (Dataset) – dataset to build vocab from.

  • columns (list[str], optional) – column names to get words from. It can be a list of column names. (default=None, where all columns will be used. If any column isn’t string type, will return error).

  • freq_range (tuple, optional) – A tuple of integers (min_frequency, max_frequency). Words within the frequency range would be kept. 0 <= min_frequency <= max_frequency <= total_words. min_frequency=0 is the same as min_frequency=1. max_frequency > total_words is the same as max_frequency = total_words. min_frequency/max_frequency can be None, which corresponds to 0/total_words separately (default=None, all words are included).

  • top_k (int, optional) – top_k > 0. Number of words to be built into vocab. top_k most frequent words are taken. top_k is taken after freq_range. If not enough top_k, all words will be taken (default=None, all words are included).

  • special_tokens (list, optional) – a list of strings, each one is a special token. for example special_tokens=[“<pad>”,”<unk>”] (default=None, no special tokens will be added).

  • special_first (bool, optional) – whether special_tokens will be prepended/appended to vocab. If special_tokens is specified and special_first is set to True, special_tokens will be prepended (default=True).

返回

Vocab, vocab built from the dataset.

classmethod from_dict(word_dict)[源代码]

Build a vocab object from a dict.

参数

word_dict (dict) – dict contains word and id pairs, where word should be str and id be int. id is recommended to start from 0 and be continuous. ValueError will be raised if id is negative.

返回

Vocab, vocab built from the dict.

classmethod from_file(file_path, delimiter='', vocab_size=None, special_tokens=None, special_first=True)[源代码]

Build a vocab object from a list of word.

参数
  • file_path (str) – path to the file which contains the vocab list.

  • delimiter (str, optional) – a delimiter to break up each line in file, the first element is taken to be the word (default=””).

  • vocab_size (int, optional) – number of words to read from file_path (default=None, all words are taken).

  • special_tokens (list, optional) – a list of strings, each one is a special token. for example special_tokens=[“<pad>”,”<unk>”] (default=None, no special tokens will be added).

  • special_first (bool, optional) – whether special_tokens will be prepended/appended to vocab, If special_tokens is specified and special_first is set to True, special_tokens will be prepended (default=True).

返回

Vocab, vocab built from the file.

classmethod from_list(word_list, special_tokens=None, special_first=True)[源代码]

Build a vocab object from a list of word.

参数
  • word_list (list) – a list of string where each element is a word of type string.

  • special_tokens (list, optional) – a list of strings, each one is a special token. for example special_tokens=[“<pad>”,”<unk>”] (default=None, no special tokens will be added).

  • special_first (bool, optional) – whether special_tokens will be prepended/appended to vocab, If special_tokens is specified and special_first is set to True, special_tokens will be prepended (default=True).

返回

Vocab, vocab built from the list.

class tinyms.text.SentencePieceVocab[源代码]

SentencePiece obiect that is used to segmentate words

classmethod from_dataset(dataset, col_names, vocab_size, character_coverage, model_type, params)[源代码]

Build a sentencepiece from a dataset

参数
  • dataset (Dataset) – Dataset to build sentencepiece.

  • col_names (list) – The list of the col name.

  • vocab_size (int) – Vocabulary size.

  • character_coverage (float) – Amount of characters covered by the model, good defaults are: 0.9995 for languages. with rich character set like Japanese or Chinese and 1.0 for other languages with small character set.

  • model_type (SentencePieceModel) – Choose from UNIGRAM (default), BPE, CHAR, or WORD. The input sentence must be pretokenized when using word type.

  • params (dict) – A dictionary with no incoming parameters.

返回

SentencePieceVocab, vocab built from the dataset.

classmethod from_file(file_path, vocab_size, character_coverage, model_type, params)[源代码]

Build a SentencePiece object from a list of word.

参数
  • file_path (list) – Path to the file which contains the sentencepiece list.

  • vocab_size (int) – Vocabulary size, the type of uint32_t.

  • character_coverage (float) – Amount of characters covered by the model, good defaults are: 0.9995 for languages. with rich character set like Japanse or Chinese and 1.0 for other languages with small character set.

  • model_type (SentencePieceModel) – Choose from unigram (default), bpe, char, or word. The input sentence must be pretokenized when using word type.

  • params (dict) –

    A dictionary with no incoming parameters(The parameters are derived from SentencePiece library).

    input_sentence_size 0
    max_sentencepiece_length 16
    

返回

SentencePieceVocab, vocab built from the file.

classmethod save_model(vocab, path, filename)[源代码]

Save model to filepath

参数
  • vocab (SentencePieceVocab) – A sentencepiece object.

  • path (str) – Path to store model.

  • filename (str) – The name of the file.

class tinyms.text.SentencePieceModel[源代码]

An enumeration for SentencePieceModel, effective enumeration types are UNIGRAM, BPE, CHAR, WORD.

class tinyms.text.SPieceTokenizerOutType[源代码]

An enumeration for SPieceTokenizerOutType, effective enumeration types are STRING, INT.

class tinyms.text.SPieceTokenizerLoadType[源代码]

An enumeration for SPieceTokenizerLoadType, effective enumeration types are FILE, MODEL.

class tinyms.text.Compose(transforms)[源代码]

Compose a list of transforms into a single transform.

参数

transforms (list) – List of transformations to be applied.

实际案例

>>> compose = c_transforms.Compose([c_vision.Decode(), c_vision.RandomCrop(512)])
>>> image_folder_dataset = image_folder_dataset.map(operations=compose)
class tinyms.text.Concatenate(axis=0, prepend=None, append=None)[源代码]

Tensor operation that concatenates all columns into a single tensor.

参数
  • axis (int, optional) – Concatenate the tensors along given axis (Default=0).

  • prepend (numpy.array, optional) – NumPy array to be prepended to the already concatenated tensors (Default=None).

  • append (numpy.array, optional) – NumPy array to be appended to the already concatenated tensors (Default=None).

实际案例

>>> import numpy as np
>>> # concatenate string
>>> prepend_tensor = np.array(["dw", "df"], dtype='S')
>>> append_tensor = np.array(["dwsdf", "df"], dtype='S')
>>> concatenate_op = c_transforms.Concatenate(0, prepend_tensor, append_tensor)
>>> data = [["This","is","a","string"]]
>>> dataset = ds.NumpySlicesDataset(data)
>>> dataset = dataset.map(operations=concatenate_op)
class tinyms.text.Duplicate[源代码]

Duplicate the input tensor to output, only support transform one column each time.

实际案例

>>> # Data before
>>> # |  x      |
>>> # +---------+
>>> # | [1,2,3] |
>>> # +---------+
>>> data = [[1,2,3]]
>>> numpy_slices_dataset = ds.NumpySlicesDataset(data, ["x"])
>>> numpy_slices_dataset = numpy_slices_dataset.map(operations=c_transforms.Duplicate(),
...                                                 input_columns=["x"],
...                                                 output_columns=["x", "y"],
...                                                 column_order=["x", "y"])
>>> # Data after
>>> # |  x      |  y      |
>>> # +---------+---------+
>>> # | [1,2,3] | [1,2,3] |
>>> # +---------+---------+
class tinyms.text.Fill(fill_value)[源代码]

Tensor operation to fill all elements in the tensor with the specified value. The output tensor will have the same shape and type as the input tensor.

参数

fill_value (Union[str, bytes, int, float, bool])) – scalar value to fill the tensor with.

实际案例

>>> import numpy as np
>>> # generate a 1D integer numpy array from 0 to 4
>>> def generator_1d():
...     for i in range(5):
...         yield (np.array([i]),)
>>> generator_dataset = ds.GeneratorDataset(generator_1d, column_names="col1")
>>> # [[0], [1], [2], [3], [4]]
>>> fill_op = c_transforms.Fill(3)
>>> generator_dataset = generator_dataset.map(operations=fill_op)
>>> # [[3], [3], [3], [3], [3]]
class tinyms.text.Mask(operator, constant, dtype=mindspore.bool)[源代码]

Mask content of the input tensor with the given predicate. Any element of the tensor that matches the predicate will be evaluated to True, otherwise False.

参数
  • operator (Relational) – One of the relational operators EQ, NE LT, GT, LE or GE

  • constant (Union[str, int, float, bool]) – Constant to be compared to. Constant will be cast to the type of the input tensor.

  • dtype (mindspore.dtype, optional) – Type of the generated mask (Default to bool).

实际案例

>>> from mindspore.dataset.transforms.c_transforms import Relational
>>> # Data before
>>> # |  col   |
>>> # +---------+
>>> # | [1,2,3] |
>>> # +---------+
>>> data = [[1, 2, 3]]
>>> numpy_slices_dataset = ds.NumpySlicesDataset(data, ["col"])
>>> numpy_slices_dataset = numpy_slices_dataset.map(operations=c_transforms.Mask(Relational.EQ, 2))
>>> # Data after
>>> # |       col         |
>>> # +--------------------+
>>> # | [False,True,False] |
>>> # +--------------------+
class tinyms.text.OneHot(num_classes)[源代码]

Tensor operation to apply one hot encoding.

参数

num_classes (int) – Number of classes of objects in dataset. It should be larger than the largest label number in the dataset.

引发

RuntimeError – feature size is bigger than num_classes.

实际案例

>>> # Assume that dataset has 10 classes, thus the label ranges from 0 to 9
>>> onehot_op = c_transforms.OneHot(num_classes=10)
>>> mnist_dataset = mnist_dataset.map(operations=onehot_op, input_columns=["label"])
class tinyms.text.PadEnd(pad_shape, pad_value=None)[源代码]

Pad input tensor according to pad_shape, need to have same rank.

参数
  • pad_shape (list(int)) – List of integers representing the shape needed. Dimensions that set to None will not be padded (i.e., original dim will be used). Shorter dimensions will truncate the values.

  • pad_value (Union[str, bytes, int, float, bool]), optional) – Value used to pad. Default to 0 or empty string in case of tensors of strings.

实际案例

>>> # Data before
>>> # |   col   |
>>> # +---------+
>>> # | [1,2,3] |
>>> # +---------|
>>> data = [[1, 2, 3]]
>>> numpy_slices_dataset = ds.NumpySlicesDataset(data, ["col"])
>>> numpy_slices_dataset = numpy_slices_dataset.map(operations=c_transforms.PadEnd(pad_shape=[4],
...                                                                                pad_value=10))
>>> # Data after
>>> # |    col     |
>>> # +------------+
>>> # | [1,2,3,10] |
>>> # +------------|
class tinyms.text.RandomApply(transforms, prob=0.5)[源代码]

Randomly perform a series of transforms with a given probability.

参数
  • transforms (list) – List of transformations to be applied.

  • prob (float, optional) – The probability to apply the transformation list (default=0.5)

实际案例

>>> rand_apply = c_transforms.RandomApply([c_vision.RandomCrop(512)])
>>> image_folder_dataset = image_folder_dataset.map(operations=rand_apply)
class tinyms.text.RandomChoice(transforms)[源代码]

Randomly select one transform from a list of transforms to perform operation.

参数

transforms (list) – List of transformations to be chosen from to apply.

实际案例

>>> rand_choice = c_transforms.RandomChoice([c_vision.CenterCrop(50), c_vision.RandomCrop(512)])
>>> image_folder_dataset = image_folder_dataset.map(operations=rand_choice)
class tinyms.text.Slice(*slices)[源代码]

Slice operation to extract a tensor out using the given n slices.

The functionality of Slice is similar to NumPy’s indexing feature. (Currently only rank-1 tensors are supported).

参数

slices (Union[int, list[int], slice, None, Ellipsis]) –

Maximum n number of arguments to slice a tensor of rank n . One object in slices can be one of:

  1. int: Slice this index only along the first dimension. Negative index is supported.

  2. list(int): Slice these indices along the first dimension. Negative indices are supported.

  3. slice: Slice the generated indices from the slice object along the first dimension. Similar to start:stop:step.

  4. None: Slice the whole dimension. Similar to [:] in Python indexing.

  5. Ellipsis: Slice the whole dimension, same result with None.

实际案例

>>> # Data before
>>> # |   col   |
>>> # +---------+
>>> # | [1,2,3] |
>>> # +---------|
>>> data = [[1, 2, 3]]
>>> numpy_slices_dataset = ds.NumpySlicesDataset(data, ["col"])
>>> # slice indices 1 and 2 only
>>> numpy_slices_dataset = numpy_slices_dataset.map(operations=c_transforms.Slice(slice(1,3)))
>>> # Data after
>>> # |   col   |
>>> # +---------+
>>> # |  [2,3]  |
>>> # +---------|
class tinyms.text.TypeCast(data_type)[源代码]

Tensor operation to cast to a given MindSpore data type.

参数

data_type (mindspore.dtype) – mindspore.dtype to be cast to.

实际案例

>>> import numpy as np
>>> import mindspore.common.dtype as mstype
>>>
>>> # Generate 1d int numpy array from 0 - 63
>>> def generator_1d():
...     for i in range(64):
...         yield (np.array([i]),)
>>>
>>> dataset = ds.GeneratorDataset(generator_1d, column_names='col')
>>> type_cast_op = c_transforms.TypeCast(mstype.int32)
>>> dataset = dataset.map(operations=type_cast_op)
class tinyms.text.Unique[源代码]

Perform the unique operation on the input tensor, only support transform one column each time.

Return 3 tensor: unique output tensor, index tensor, count tensor.

Unique output tensor contains all the unique elements of the input tensor in the same order that they occur in the input tensor.

Index tensor that contains the index of each element of the input tensor in the unique output tensor.

Count tensor that contains the count of each element of the output tensor in the input tensor.

注解

Call batch op before calling this function.

实际案例

>>> # Data before
>>> # |  x                 |
>>> # +--------------------+
>>> # | [[0,1,2], [1,2,3]] |
>>> # +--------------------+
>>> data = [[[0,1,2], [1,2,3]]]
>>> dataset = ds.NumpySlicesDataset(data, ["x"])
>>> dataset = dataset.map(operations=c_transforms.Unique(),
...                       input_columns=["x"],
...                       output_columns=["x", "y", "z"],
...                       column_order=["x", "y", "z"])
>>> # Data after
>>> # |  x      |  y              |z        |
>>> # +---------+-----------------+---------+
>>> # | [0,1,2,3] | [0,1,2,1,2,3] | [1,2,2,1]
>>> # +---------+-----------------+---------+