tinyms.text¶

This module is to support text processing for NLP tasks. It is a high performance NLP text processing module which is developed with ICU4C and cppjieba.

class tinyms.text.BertDatasetTransform[source]¶: Apply preprocess operation on GeneratorDataset instance.

class tinyms.text.Lookup(vocab, unknown_token=None, data_type=mindspore.int32)[source]¶

Look up a word into an id according to the input vocabulary table.

Parameters

vocab (Vocab) – A vocabulary object.
unknown_token (str, optional) – Word is used for lookup. In case of the word is out of vocabulary (OOV), the result of lookup will be replaced with unknown_token. If the unknown_token is not specified or it is OOV, runtime error will be thrown (default={}, means no unknown_token is specified).
data_type (mindspore.dtype, optional) – The data type that lookup operation maps string to(default=mindspore.int32).

Examples

>>> # Load vocabulary from list
>>> vocab = text.Vocab.from_list(['深', '圳', '欢', '迎', '您'])
>>> # Use Lookup operator to map tokens to ids
>>> lookup = text.Lookup(vocab)
>>> text_file_dataset = text_file_dataset.map(operations=[lookup])

class tinyms.text.JiebaTokenizer(hmm_path, mp_path, mode=<JiebaMode.MIX: 0>, with_offsets=False)[source]¶

Tokenize Chinese string into words based on dictionary.

Note

The integrity of the HMMSEgment algorithm and MPSegment algorithm files must be confirmed.

Parameters

hmm_path (str) – Dictionary file is used by HMMSegment algorithm. The dictionary can be obtained on the official website of cppjieba.
mp_path (str) – Dictionary file is used by MPSegment algorithm. The dictionary can be obtained on the official website of cppjieba.
mode (JiebaMode, optional) –
Valid values can be any of [JiebaMode.MP, JiebaMode.HMM, JiebaMode.MIX](default=JiebaMode.MIX).
- JiebaMode.MP, tokenize with MPSegment algorithm.
- JiebaMode.HMM, tokenize with Hidden Markov Model Segment algorithm.
- JiebaMode.MIX, tokenize with a mix of MPSegment and HMMSegment algorithm.
with_offsets (bool, optional) – Whether or not output offsets of tokens (default=False).

Examples

>>> from mindspore.dataset.text import JiebaMode
>>> # If with_offsets=False, default output one column {["text", dtype=str]}
>>> jieba_hmm_file = "/path/to/jieba/hmm/file"
>>> jieba_mp_file = "/path/to/jieba/mp/file"
>>> tokenizer_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP, with_offsets=False)
>>> text_file_dataset = text_file_dataset.map(operations=tokenizer_op)
>>> # If with_offsets=False, then output three columns {["token", dtype=str], ["offsets_start", dtype=uint32],
>>> #                                                   ["offsets_limit", dtype=uint32]}
>>> tokenizer_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP, with_offsets=True)
>>> text_file_dataset_1 = text_file_dataset_1.map(operations=tokenizer_op, input_columns=["text"],
...                                               output_columns=["token", "offsets_start", "offsets_limit"],
...                                               column_order=["token", "offsets_start", "offsets_limit"])

add_dict(user_dict)[source]¶

Add a user defined word to JiebaTokenizer’s dictionary.

Parameters

user_dict (Union[str, dict]) –

One of the two loading methods is file path(str) loading (according to the Jieba dictionary format) and the other is Python dictionary(dict) loading, Python Dict format: {word1:freq1, word2:freq2,…}. Jieba dictionary format : word(required), freq(optional), such as:

word1 freq1
word2 None
word3 freq3

Only valid word-freq pairs in user provided file will be added into the dictionary. Rows containing invalid input will be ignored. No error nor warning Status is returned.

Examples

>>> from mindspore.dataset.text import JiebaMode
>>> jieba_hmm_file = "/path/to/jieba/hmm/file"
>>> jieba_mp_file = "/path/to/jieba/mp/file"
>>> user_dict = {"男默女泪": 10}
>>> jieba_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP)
>>> jieba_op.add_dict(user_dict)
>>> text_file_dataset = text_file_dataset.map(operations=jieba_op, input_columns=["text"])

add_word(word, freq=None)[source]¶

Add a user defined word to JiebaTokenizer’s dictionary.

Parameters

word (str) – The word to be added to the JiebaTokenizer instance. The added word will not be written into the built-in dictionary on disk.
freq (int, optional) – The frequency of the word to be added. The higher the frequency, the better chance the word will be tokenized (default=None, use default frequency).

Examples

>>> from mindspore.dataset.text import JiebaMode
>>> jieba_hmm_file = "/path/to/jieba/hmm/file"
>>> jieba_mp_file = "/path/to/jieba/mp/file"
>>> jieba_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP)
>>> sentence_piece_vocab_file = "/path/to/sentence/piece/vocab/file"
>>> with open(sentence_piece_vocab_file, 'r') as f:
...     for line in f:
...         word = line.split(',')[0]
...         jieba_op.add_word(word)
>>> text_file_dataset = text_file_dataset.map(operations=jieba_op, input_columns=["text"])

class tinyms.text.UnicodeCharTokenizer(with_offsets=False)[source]¶

Tokenize a scalar tensor of UTF-8 string to Unicode characters.

Parameters: with_offsets (bool, optional) – Whether or not output offsets of tokens (default=False).

Examples

>>> # If with_offsets=False, default output one column {["text", dtype=str]}
>>> tokenizer_op = text.UnicodeCharTokenizer(with_offsets=False)
>>> text_file_dataset = text_file_dataset.map(operations=tokenizer_op)
>>> # If with_offsets=True, then output three columns {["token", dtype=str], ["offsets_start", dtype=uint32],
>>> #                                                   ["offsets_limit", dtype=uint32]}
>>> tokenizer_op = text.UnicodeCharTokenizer(with_offsets=True)
>>> text_file_dataset = text_file_dataset.map(operations=tokenizer_op, input_columns=["text"],
...                                           output_columns=["token", "offsets_start", "offsets_limit"],
...                                           column_order=["token", "offsets_start", "offsets_limit"])

class tinyms.text.Ngram(n, left_pad=('', 0), right_pad=('', 0), separator=' ')[source]¶

TensorOp to generate n-gram from a 1-D string Tensor.

Refer to https://en.wikipedia.org/wiki/N-gram#Examples for an overview of what n-gram is and how it works.

Parameters

n (list[int]) – n in n-gram, which is a list of positive integers. For example, if n=[4, 3], then the result would be a 4-gram followed by a 3-gram in the same tensor. If the number of words is not enough to make up for a n-gram, an empty string will be returned. For example, 3 grams on [“mindspore”, “best”] will result in an empty string produced.
left_pad (tuple, optional) – Padding performed on left side of the sequence shaped like (“pad_token”, pad_width). pad_width will be capped at n-1. For example, specifying left_pad=(“_”, 2) would pad left side of the sequence with “__” (default=None).
right_pad (tuple, optional) – Padding performed on right side of the sequence shaped like (“pad_token”, pad_width). pad_width will be capped at n-1. For example, specifying right_pad=(“_”, 2) would pad right side of the sequence with “__” (default=None).
separator (str, optional) – Symbol used to join strings together. For example. if 2-gram is [“mindspore”, “amazing”] with separator=”-”, the result would be [“mindspore-amazing”] (default=None, which will use whitespace as separator).

Examples

>>> ngram_op = text.Ngram(3, separator="-")
>>> output = ngram_op(["WildRose Country", "Canada's Ocean Playground", "Land of Living Skies"])
>>> # output
>>> # ["WildRose Country-Canada's Ocean Playground-Land of Living Skies"]
>>> # same ngram_op called through map
>>> text_file_dataset = text_file_dataset.map(operations=ngram_op)

class tinyms.text.WordpieceTokenizer(vocab, suffix_indicator='##', max_bytes_per_token=100, unknown_token='[UNK]', with_offsets=False)[source]¶

Tokenize scalar token or 1-D tokens to 1-D subword tokens.

Parameters

vocab (Vocab) – A vocabulary object.
suffix_indicator (str, optional) – Used to show that the subword is the last part of a word (default=’##’).
max_bytes_per_token (int, optional) – Tokens exceeding this length will not be further split (default=100).
unknown_token (str, optional) – When a token cannot be found: if ‘unknown_token’ is empty string, return the token directly, else return ‘unknown_token’ (default=’[UNK]’).
with_offsets (bool, optional) – Whether or not output offsets of tokens (default=False).

Examples

>>> vocab_list = ["book", "cholera", "era", "favor", "##ite", "my", "is", "love", "dur", "##ing", "the"]
>>> vocab = text.Vocab.from_list(vocab_list)
>>> # If with_offsets=False, default output one column {["text", dtype=str]}
>>> tokenizer_op = text.WordpieceTokenizer(vocab=vocab, unknown_token='[UNK]',
...                                        max_bytes_per_token=100, with_offsets=False)
>>> text_file_dataset = text_file_dataset.map(operations=tokenizer_op)
>>> # If with_offsets=True, then output three columns {["token", dtype=str], ["offsets_start", dtype=uint32],
>>> #                                                   ["offsets_limit", dtype=uint32]}
>>> tokenizer_op = text.WordpieceTokenizer(vocab=vocab, unknown_token='[UNK]',
...                                       max_bytes_per_token=100, with_offsets=True)
>>> text_file_dataset = text_file_dataset.map(operations=tokenizer_op, input_columns=["text"],
...                                           output_columns=["token", "offsets_start", "offsets_limit"],
...                                           column_order=["token", "offsets_start", "offsets_limit"])

class tinyms.text.TruncateSequencePair(max_length)[source]¶

Truncate a pair of rank-1 tensors such that the total length is less than max_length.

This operation takes two input tensors and returns two output Tensors.

Parameters: max_length (int) – Maximum length required.

Examples

>>> dataset = ds.NumpySlicesDataset(data={"col1": [[1, 2, 3]], "col2": [[4, 5]]})
>>> # Data before
>>> # |   col1    |   col2    |
>>> # +-----------+-----------|
>>> # | [1, 2, 3] |  [4, 5]   |
>>> # +-----------+-----------+
>>> truncate_sequence_pair_op = text.TruncateSequencePair(max_length=4)
>>> dataset = dataset.map(operations=truncate_sequence_pair_op)
>>> # Data after
>>> # |   col1    |   col2    |
>>> # +-----------+-----------+
>>> # |  [1, 2]   |  [4, 5]   |
>>> # +-----------+-----------+

class tinyms.text.ToNumber(data_type)[source]¶

Tensor operation to convert every element of a string tensor to a number.

Strings are cast according to the rules specified in the following links, except that any strings which represent negative numbers cannot be cast to an unsigned integer type, rules links are as follows: https://en.cppreference.com/w/cpp/string/basic_string/stof, https://en.cppreference.com/w/cpp/string/basic_string/stoul,

Parameters: data_type (mindspore.dtype) – Type to be cast to. Must be a numeric type in mindspore.dtype.
Raises: RuntimeError – If strings are invalid to cast, or are out of range after being cast.

Examples

>>> from mindspore import dtype as mstype
>>> data = [["1", "2", "3"]]
>>> dataset = ds.NumpySlicesDataset(data)
>>> to_number_op = text.ToNumber(mstype.int8)
>>> dataset = dataset.map(operations=to_number_op)

class tinyms.text.SlidingWindow(width, axis=0)[source]¶

Construct a tensor from given data (only support 1-D for now), where each element in the dimension axis is a slice of data starting at the corresponding position, with a specified width.

Parameters

width (int) – The width of the window. It must be an integer and greater than zero.
axis (int, optional) – The axis along which the sliding window is computed (default=0).

Examples

>>> dataset = ds.NumpySlicesDataset(data=[[1, 2, 3, 4, 5]], column_names="col1")
>>> # Data before
>>> # |     col1     |
>>> # +--------------+
>>> # | [[1, 2, 3, 4, 5]] |
>>> # +--------------+
>>> dataset = dataset.map(operations=text.SlidingWindow(3, 0))
>>> # Data after
>>> # |     col1     |
>>> # +--------------+
>>> # |  [[1, 2, 3], |
>>> # |   [2, 3, 4], |
>>> # |   [3, 4, 5]] |
>>> # +--------------+

class tinyms.text.SentencePieceTokenizer(mode, out_type)[source]¶

Tokenize scalar token or 1-D tokens to tokens by sentencepiece.

Parameters

mode (Union[str, SentencePieceVocab]) – If the input parameter is a file, then its type should be string. If the input parameter is a SentencePieceVocab object, then its type should be SentencePieceVocab.
out_type (SPieceTokenizerOutType) –
The type of output, it can be any of [SPieceTokenizerOutType.STRING, SPieceTokenizerOutType.INT].
- SPieceTokenizerOutType.STRING, means output type of SentencePice Tokenizer is string.
- SPieceTokenizerOutType.INT, means output type of SentencePice Tokenizer is int.

Examples

>>> from mindspore.dataset.text import SentencePieceModel, SPieceTokenizerOutType
>>> sentence_piece_vocab_file = "/path/to/sentence/piece/vocab/file"
>>> vocab = text.SentencePieceVocab.from_file([sentence_piece_vocab_file], 5000, 0.9995,
...                                           SentencePieceModel.UNIGRAM, {})
>>> tokenizer = text.SentencePieceTokenizer(vocab, out_type=SPieceTokenizerOutType.STRING)
>>> text_file_dataset = text_file_dataset.map(operations=tokenizer)

class tinyms.text.PythonTokenizer(tokenizer)[source]¶

Class that applies user-defined string tokenizer into input string.

Parameters: tokenizer (Callable) – Python function that takes a str and returns a list of str as tokens.

Examples

>>> def my_tokenizer(line):
...     return line.split()
>>> text_file_dataset = text_file_dataset.map(operations=text.PythonTokenizer(my_tokenizer))

tinyms.text.to_str(array, encoding='utf8')[source]¶

Convert NumPy array of bytes to array of str by decoding each element based on charset encoding.

Parameters

array (numpy.ndarray) – Array of bytes type representing strings.
encoding (str) – Indicating the charset for decoding.

Returns

numpy.ndarray, NumPy array of str.

Examples

>>> text_file_dataset_dir = ["/path/to/text_file_dataset_file"]
>>> dataset = ds.TextFileDataset(dataset_files=text_file_dataset_dir, shuffle=False)
>>> for item in dataset.create_dict_iterator(num_epochs=1, output_numpy=True):
...     print(text.to_str(item["text"]))

tinyms.text.to_bytes(array, encoding='utf8')[source]¶

Convert NumPy array of str to array of bytes by encoding each element based on charset encoding.

Parameters

array (numpy.ndarray) – Array of str type representing strings.
encoding (str) – Indicating the charset for encoding.

Returns

numpy.ndarray, NumPy array of bytes.

class tinyms.text.Vocab[source]¶

Vocab object that is used to lookup a word.

It contains a map that maps each word(str) to an id (int).

classmethod from_dataset(dataset, columns=None, freq_range=None, top_k=None, special_tokens=None, special_first=True)[source]¶

Build a vocab from a dataset.

This would collect all unique words in a dataset and return a vocab within the frequency range specified by user in freq_range. User would be warned if no words fall into the frequency. Words in vocab are ordered from highest frequency to lowest frequency. Words with the same frequency would be ordered lexicographically.

Parameters

dataset (Dataset) – dataset to build vocab from.
columns (list[str], optional) – column names to get words from. It can be a list of column names. (default=None, where all columns will be used. If any column isn’t string type, will return error).
freq_range (tuple, optional) – A tuple of integers (min_frequency, max_frequency). Words within the frequency range would be kept. 0 <= min_frequency <= max_frequency <= total_words. min_frequency=0 is the same as min_frequency=1. max_frequency > total_words is the same as max_frequency = total_words. min_frequency/max_frequency can be None, which corresponds to 0/total_words separately (default=None, all words are included).
top_k (int, optional) – top_k is greater than 0. Number of words to be built into vocab. top_k means most frequent words are taken. top_k is taken after freq_range. If not enough top_k, all words will be taken (default=None, all words are included).
special_tokens (list, optional) – A list of strings, each one is a special token. For example special_tokens=[“<pad>”,”<unk>”] (default=None, no special tokens will be added).
special_first (bool, optional) – Whether special_tokens will be prepended/appended to vocab. If special_tokens is specified and special_first is set to True, special_tokens will be prepended (default=True).

Returns

Vocab, vocab built from the dataset.

Examples

>>> dataset = ds.TextFileDataset("/path/to/sentence/piece/vocab/file", shuffle=False)
>>> vocab = text.Vocab.from_dataset(dataset, "text", freq_range=None, top_k=None,
...                                 special_tokens=["<pad>", "<unk>"],
...                                 special_first=True)
>>> dataset = dataset.map(operations=text.Lookup(vocab, "<unk>"), input_columns=["text"])

classmethod from_dict(word_dict)[source]¶

Build a vocab object from a dict.

Parameters: word_dict (dict) – Dict contains word and id pairs, where word should be str and id be int. id is recommended to start from 0 and be continuous. ValueError will be raised if id is negative.
Returns: Vocab, vocab built from the dict.

Examples

>>> vocab = text.Vocab.from_dict({"home": 3, "behind": 2, "the": 4, "world": 5, "<unk>": 6})

classmethod from_file(file_path, delimiter='', vocab_size=None, special_tokens=None, special_first=True)[source]¶

Build a vocab object from a list of word.

Parameters

file_path (str) – Path to the file which contains the vocab list.
delimiter (str, optional) – A delimiter to break up each line in file, the first element is taken to be the word (default=””).
vocab_size (int, optional) – Number of words to read from file_path (default=None, all words are taken).
special_tokens (list, optional) – A list of strings, each one is a special token. for example special_tokens=[“<pad>”,”<unk>”] (default=None, no special tokens will be added).
special_first (bool, optional) – Whether special_tokens will be prepended/appended to vocab, If special_tokens is specified and special_first is set to True, special_tokens will be prepended (default=True).

Returns

Vocab, vocab built from the file.

Examples

>>> vocab = text.Vocab.from_file("/path/to/simple/vocab/file", ",", None, ["<pad>", "<unk>"], True)

classmethod from_list(word_list, special_tokens=None, special_first=True)[source]¶

Build a vocab object from a list of word.

Parameters

word_list (list) – A list of string where each element is a word of type string.
special_tokens (list, optional) – A list of strings, each one is a special token. for example special_tokens=[“<pad>”,”<unk>”] (default=None, no special tokens will be added).
special_first (bool, optional) – Whether special_tokens is prepended or appended to vocab. If special_tokens is specified and special_first is set to True, special_tokens will be prepended (default=True).

Returns

Vocab, vocab built from the list.

Examples

>>> vocab = text.Vocab.from_list(["w1", "w2", "w3"], special_tokens=["<unk>"], special_first=True)

class tinyms.text.SentencePieceVocab[source]¶

SentencePiece object that is used to do words segmentation.

classmethod from_dataset(dataset, col_names, vocab_size, character_coverage, model_type, params)[source]¶

Build a SentencePiece from a dataset.

Parameters

dataset (Dataset) – Dataset to build SentencePiece.
col_names (list) – The list of the col name.
vocab_size (int) – Vocabulary size.
character_coverage (float) – Amount of characters covered by the model, good defaults are: 0.9995 for languages with rich character set like Japanese or Chinese and 1.0 for other languages with small character set.
model_type (SentencePieceModel) –
It can be any of [SentencePieceModel.UNIGRAM, SentencePieceModel.BPE, SentencePieceModel.CHAR, SentencePieceModel.WORD], default is SentencePieceModel.UNIGRAM. The input sentence must be pre-tokenized when using SentencePieceModel.WORD type.
- SentencePieceModel.UNIGRAM, Unigram Language Model means the next word in the sentence is assumed to be independent of the previous words generated by the model.
- SentencePieceModel.BPE, refers to byte pair encoding algorithm, which replaces the most frequent pair of bytes in a sentence with a single, unused byte.
- SentencePieceModel.CHAR, refers to char based sentencePiece Model type.
- SentencePieceModel.WORD, refers to word based sentencePiece Model type.
params (dict) – A dictionary with no incoming parameters.

Returns

SentencePieceVocab, vocab built from the dataset.

Examples

>>> from mindspore.dataset.text import SentencePieceModel
>>> dataset = ds.TextFileDataset("/path/to/sentence/piece/vocab/file", shuffle=False)
>>> vocab = text.SentencePieceVocab.from_dataset(dataset, ["text"], 5000, 0.9995,
...                                              SentencePieceModel.UNIGRAM, {})

classmethod from_file(file_path, vocab_size, character_coverage, model_type, params)[source]¶

Build a SentencePiece object from a list of word.

Parameters

file_path (list) – Path to the file which contains the SentencePiece list.
vocab_size (int) – Vocabulary size.
character_coverage (float) – Amount of characters covered by the model, good defaults are: 0.9995 for languages with rich character set like Japanese or Chinese and 1.0 for other languages with small character set.
model_type (SentencePieceModel) –
It can be any of [SentencePieceModel.UNIGRAM, SentencePieceModel.BPE, SentencePieceModel.CHAR, SentencePieceModel.WORD], default is SentencePieceModel.UNIGRAM. The input sentence must be pre-tokenized when using SentencePieceModel.WORD type.
- SentencePieceModel.UNIGRAM, Unigram Language Model means the next word in the sentence is assumed to be independent of the previous words generated by the model.
- SentencePieceModel.BPE, refers to byte pair encoding algorithm, which replaces the most frequent pair of bytes in a sentence with a single, unused byte.
- SentencePieceModel.CHAR, refers to char based sentencePiece Model type.
- SentencePieceModel.WORD, refers to word based sentencePiece Model type.
params (dict) –
A dictionary with no incoming parameters(The parameters are derived from SentencePiece library).
```
input_sentence_size 0
max_sentencepiece_length 16
```

Returns

SentencePieceVocab, vocab built from the file.

Examples

>>> from mindspore.dataset.text import SentencePieceModel
>>> vocab = text.SentencePieceVocab.from_file(["/path/to/sentence/piece/vocab/file"], 5000, 0.9995,
...                                           SentencePieceModel.UNIGRAM, {})

classmethod save_model(vocab, path, filename)[source]¶

Save model into given filepath.

Parameters

vocab (SentencePieceVocab) – A SentencePiece object.
path (str) – Path to store model.
filename (str) – The name of the file.

Examples

>>> from mindspore.dataset.text import SentencePieceModel
>>> vocab = text.SentencePieceVocab.from_file(["/path/to/sentence/piece/vocab/file"], 5000, 0.9995,
...                                           SentencePieceModel.UNIGRAM, {})
>>> text.SentencePieceVocab.save_model(vocab, "./", "m.model")

class tinyms.text.SentencePieceModel[source]¶

An enumeration for SentencePieceModel.

Possible enumeration values are: SentencePieceModel.UNIGRAM, SentencePieceModel.BPE, SentencePieceModel.CHAR, SentencePieceModel.WORD.

SentencePieceModel,UNIGRAM: Unigram Language Model means the next word in the sentence is assumed to be independent of the previous words generated by the model.
SentencePieceModel.BPE: refers to byte pair encoding algorithm, which replaces the most frequent pair of bytes in a sentence with a single, unused byte.
SentencePieceModel.CHAR: refers to char based sentencePiece Model type.
SentencePieceModel.WORD: refers to word based sentencePiece Model type.

class tinyms.text.SPieceTokenizerOutType[source]¶

An enumeration for SPieceTokenizerOutType.

Possible enumeration values are: SPieceTokenizerOutType.STRING, SPieceTokenizerOutType.INT.

SPieceTokenizerOutType.STRING: means output type of SentencePice Tokenizer is string.
SPieceTokenizerOutType.INT: means output type of SentencePice Tokenizer is int.

class tinyms.text.SPieceTokenizerLoadType[source]¶

An enumeration for SPieceTokenizerLoadType.

Possible enumeration values are: SPieceTokenizerLoadType.FILE, SPieceTokenizerLoadTypeMODEL.

SPieceTokenizerLoadType.FILE: Load sentencepiece tokenizer from local sentencepiece vocab file.
SPieceTokenizerLoadType.MODEL: Load sentencepiece tokenizer from sentencepiece vocab instance.

class tinyms.text.Compose(transforms)[source]¶

Compose a list of transforms into a single transform.

Parameters: transforms (list) – List of transformations to be applied.

Examples

>>> compose = c_transforms.Compose([c_vision.Decode(), c_vision.RandomCrop(512)])
>>> image_folder_dataset = image_folder_dataset.map(operations=compose)

class tinyms.text.Concatenate(axis=0, prepend=None, append=None)[source]¶

Tensor operation that concatenates all columns into a single tensor.

Parameters

axis (int, optional) – Concatenate the tensors along given axis (Default=0).
prepend (numpy.array, optional) – NumPy array to be prepended to the already concatenated tensors (Default=None).
append (numpy.array, optional) – NumPy array to be appended to the already concatenated tensors (Default=None).

Examples

>>> import numpy as np
>>> # concatenate string
>>> prepend_tensor = np.array(["dw", "df"], dtype='S')
>>> append_tensor = np.array(["dwsdf", "df"], dtype='S')
>>> concatenate_op = c_transforms.Concatenate(0, prepend_tensor, append_tensor)
>>> data = [["This","is","a","string"]]
>>> dataset = ds.NumpySlicesDataset(data)
>>> dataset = dataset.map(operations=concatenate_op)

class tinyms.text.Duplicate[source]¶

Duplicate the input tensor to output, only support transform one column each time.

Examples

>>> # Data before
>>> # |  x      |
>>> # +---------+
>>> # | [1,2,3] |
>>> # +---------+
>>> data = [[1,2,3]]
>>> numpy_slices_dataset = ds.NumpySlicesDataset(data, ["x"])
>>> numpy_slices_dataset = numpy_slices_dataset.map(operations=c_transforms.Duplicate(),
...                                                 input_columns=["x"],
...                                                 output_columns=["x", "y"],
...                                                 column_order=["x", "y"])
>>> # Data after
>>> # |  x      |  y      |
>>> # +---------+---------+
>>> # | [1,2,3] | [1,2,3] |
>>> # +---------+---------+

class tinyms.text.Fill(fill_value)[source]¶

Tensor operation to fill all elements in the tensor with the specified value. The output tensor will have the same shape and type as the input tensor.

Parameters: fill_value (Union[str, bytes, int, float, bool]) – scalar value to fill the tensor with.

Examples

>>> import numpy as np
>>> # generate a 1D integer numpy array from 0 to 4
>>> def generator_1d():
...     for i in range(5):
...         yield (np.array([i]),)
>>> generator_dataset = ds.GeneratorDataset(generator_1d, column_names="col1")
>>> # [[0], [1], [2], [3], [4]]
>>> fill_op = c_transforms.Fill(3)
>>> generator_dataset = generator_dataset.map(operations=fill_op)
>>> # [[3], [3], [3], [3], [3]]

class tinyms.text.Mask(operator, constant, dtype=mindspore.bool_)[source]¶

Mask content of the input tensor with the given predicate. Any element of the tensor that matches the predicate will be evaluated to True, otherwise False.

Parameters

operator (Relational) – relational operators, it can be any of [Relational.EQ, Relational.NE, Relational.LT, Relational.GT, Relational.LE, Relational.GE], take Relational.EQ as example, EQ refers to equal.
constant (Union[str, int, float, bool]) – Constant to be compared to. Constant will be cast to the type of the input tensor.
dtype (mindspore.dtype, optional) – Type of the generated mask (Default mstype.bool_).

Examples

>>> from mindspore.dataset.transforms.c_transforms import Relational
>>> # Data before
>>> # |  col   |
>>> # +---------+
>>> # | [1,2,3] |
>>> # +---------+
>>> data = [[1, 2, 3]]
>>> numpy_slices_dataset = ds.NumpySlicesDataset(data, ["col"])
>>> numpy_slices_dataset = numpy_slices_dataset.map(operations=c_transforms.Mask(Relational.EQ, 2))
>>> # Data after
>>> # |       col         |
>>> # +--------------------+
>>> # | [False,True,False] |
>>> # +--------------------+

class tinyms.text.OneHot(num_classes)[source]¶

Tensor operation to apply one hot encoding.

Parameters: num_classes (int) – Number of classes of objects in dataset. It should be larger than the largest label number in the dataset.
Raises: RuntimeError – feature size is bigger than num_classes.

Examples

>>> # Assume that dataset has 10 classes, thus the label ranges from 0 to 9
>>> onehot_op = c_transforms.OneHot(num_classes=10)
>>> mnist_dataset = mnist_dataset.map(operations=onehot_op, input_columns=["label"])

class tinyms.text.PadEnd(pad_shape, pad_value=None)[source]¶

Pad input tensor according to pad_shape, input tensor needs to have same rank.

Parameters

pad_shape (list(int)) – List of integers representing the shape needed. Dimensions that set to None will not be padded (i.e., original dim will be used). Shorter dimensions will truncate the values.
pad_value (Union[str, bytes, int, float, bool]), optional) – Value used to pad. Default to 0 or empty string in case of tensors of strings.

Examples

>>> # Data before
>>> # |   col   |
>>> # +---------+
>>> # | [1,2,3] |
>>> # +---------|
>>> data = [[1, 2, 3]]
>>> numpy_slices_dataset = ds.NumpySlicesDataset(data, ["col"])
>>> numpy_slices_dataset = numpy_slices_dataset.map(operations=c_transforms.PadEnd(pad_shape=[4],
...                                                                                pad_value=10))
>>> # Data after
>>> # |    col     |
>>> # +------------+
>>> # | [1,2,3,10] |
>>> # +------------|

class tinyms.text.RandomApply(transforms, prob=0.5)[source]¶

Randomly perform a series of transforms with a given probability.

Parameters

transforms (list) – List of transformations to be applied.
prob (float, optional) – The probability to apply the transformation list (default=0.5).

Examples

>>> rand_apply = c_transforms.RandomApply([c_vision.RandomCrop(512)])
>>> image_folder_dataset = image_folder_dataset.map(operations=rand_apply)

class tinyms.text.RandomChoice(transforms)[source]¶

Randomly select one transform from a list of transforms to perform operation.

Parameters: transforms (list) – List of transformations to be chosen from to apply.

Examples

>>> rand_choice = c_transforms.RandomChoice([c_vision.CenterCrop(50), c_vision.RandomCrop(512)])
>>> image_folder_dataset = image_folder_dataset.map(operations=rand_choice)

class tinyms.text.Slice(*slices)[source]¶

Slice operation to extract a tensor out using the given n slices.

The functionality of Slice is similar to NumPy’s indexing feature (Currently only rank-1 tensors are supported).

Parameters

slices (Union[int, list[int], slice, None, Ellipsis]) –

Maximum n number of arguments to slice a tensor of rank n . One object in slices can be one of:

int: Slice this index only along the first dimension. Negative index is supported.
list(int): Slice these indices along the first dimension. Negative indices are supported.
slice: Slice the generated indices from the slice object along the first dimension. Similar to start:stop:step.
None: Slice the whole dimension. Similar to [:] in Python indexing.
Ellipsis: Slice the whole dimension, same result with None.

Examples

>>> # Data before
>>> # |   col   |
>>> # +---------+
>>> # | [1,2,3] |
>>> # +---------|
>>> data = [[1, 2, 3]]
>>> numpy_slices_dataset = ds.NumpySlicesDataset(data, ["col"])
>>> # slice indices 1 and 2 only
>>> numpy_slices_dataset = numpy_slices_dataset.map(operations=c_transforms.Slice(slice(1,3)))
>>> # Data after
>>> # |   col   |
>>> # +---------+
>>> # |  [2,3]  |
>>> # +---------|

class tinyms.text.TypeCast(data_type)[source]¶

Tensor operation to cast to a given MindSpore data type.

Parameters: data_type (mindspore.dtype) – mindspore.dtype to be cast to.

Examples

>>> import numpy as np
>>> from mindspore import dtype as mstype
>>>
>>> # Generate 1d int numpy array from 0 - 63
>>> def generator_1d():
...     for i in range(64):
...         yield (np.array([i]),)
>>>
>>> dataset = ds.GeneratorDataset(generator_1d, column_names='col')
>>> type_cast_op = c_transforms.TypeCast(mstype.int32)
>>> dataset = dataset.map(operations=type_cast_op)

class tinyms.text.Unique[source]¶

Perform the unique operation on the input tensor, only support transform one column each time.

Return 3 tensor: unique output tensor, index tensor, count tensor.

Unique output tensor contains all the unique elements of the input tensor in the same order that they occur in the input tensor.

Index tensor that contains the index of each element of the input tensor in the unique output tensor.

Count tensor that contains the count of each element of the output tensor in the input tensor.

Note

Call batch op before calling this function.

Examples

>>> # Data before
>>> # |  x                 |
>>> # +--------------------+
>>> # | [[0,1,2], [1,2,3]] |
>>> # +--------------------+
>>> data = [[[0,1,2], [1,2,3]]]
>>> dataset = ds.NumpySlicesDataset(data, ["x"])
>>> dataset = dataset.map(operations=c_transforms.Unique(),
...                       input_columns=["x"],
...                       output_columns=["x", "y", "z"],
...                       column_order=["x", "y", "z"])
>>> # Data after
>>> # |  x      |  y              |z        |
>>> # +---------+-----------------+---------+
>>> # | [0,1,2,3] | [0,1,2,1,2,3] | [1,2,2,1]
>>> # +---------+-----------------+---------+