tinyms.text¶
This module is to support text processing for NLP tasks. It is a high performance NLP text processing module which is developed with ICU4C and cppjieba.
-
class
tinyms.text.
BertDatasetTransform
[source]¶ Apply preprocess operation on GeneratorDataset instance.
-
class
tinyms.text.
Lookup
(vocab, unknown_token=None, data_type=mindspore.int32)[source]¶ Look up a word into an id according to the input vocabulary table.
- Parameters
vocab (Vocab) – A vocabulary object.
unknown_token (str, optional) – Word used for lookup if the word being looked up is out-of-vocabulary (OOV). If unknown_token is OOV, a runtime error will be thrown (default=None).
data_type (mindspore.dtype, optional) – mindspore.dtype that lookup maps string to (default=mindspore.int32)
Examples
>>> # Load vocabulary from list >>> vocab = text.Vocab.from_list(['深', '圳', '欢', '迎', '您']) >>> # Use Lookup operator to map tokens to ids >>> lookup = text.Lookup(vocab) >>> text_file_dataset = text_file_dataset.map(operations=[lookup])
-
class
tinyms.text.
JiebaTokenizer
(hmm_path, mp_path, mode=<JiebaMode.MIX: 0>, with_offsets=False)[source]¶ Tokenize Chinese string into words based on dictionary.
Note
The integrity of the HMMSEgment algorithm and MPSegment algorithm files must be confirmed.
- Parameters
hmm_path (str) – Dictionary file is used by HMMSegment algorithm. The dictionary can be obtained on the official website of cppjieba.
mp_path (str) – Dictionary file is used by MPSegment algorithm. The dictionary can be obtained on the official website of cppjieba.
mode (JiebaMode, optional) –
Valid values can be any of [JiebaMode.MP, JiebaMode.HMM, JiebaMode.MIX](default=JiebaMode.MIX).
JiebaMode.MP, tokenize with MPSegment algorithm.
JiebaMode.HMM, tokenize with Hiddel Markov Model Segment algorithm.
JiebaMode.MIX, tokenize with a mix of MPSegment and HMMSegment algorithm.
with_offsets (bool, optional) – If or not output offsets of tokens (default=False).
Examples
>>> from mindspore.dataset.text import JiebaMode >>> # If with_offsets=False, default output one column {["text", dtype=str]} >>> jieba_hmm_file = "/path/to/jieba/hmm/file" >>> jieba_mp_file = "/path/to/jieba/mp/file" >>> tokenizer_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP, with_offsets=False) >>> text_file_dataset = text_file_dataset.map(operations=tokenizer_op) >>> # If with_offsets=False, then output three columns {["token", dtype=str], ["offsets_start", dtype=uint32], >>> # ["offsets_limit", dtype=uint32]} >>> tokenizer_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP, with_offsets=True) >>> text_file_dataset_1 = text_file_dataset_1.map(operations=tokenizer_op, input_columns=["text"], ... output_columns=["token", "offsets_start", "offsets_limit"], ... column_order=["token", "offsets_start", "offsets_limit"])
-
add_dict
(user_dict)[source]¶ Add user defined word to JiebaTokenizer’s dictionary.
- Parameters
user_dict (Union[str, dict]) –
One of the two loading methods is file path(str) loading (according to the Jieba dictionary format) and the other is Python dictionary(dict) loading, Python Dict format: {word1:freq1, word2:freq2,…}. Jieba dictionary format : word(required), freq(optional), such as:
word1 freq1 word2 None word3 freq3
Examples
>>> from mindspore.dataset.text import JiebaMode >>> jieba_hmm_file = "/path/to/jieba/hmm/file" >>> jieba_mp_file = "/path/to/jieba/mp/file" >>> user_dict = {"男默女泪": 10} >>> jieba_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP) >>> jieba_op.add_dict(user_dict) >>> text_file_dataset = text_file_dataset.map(operations=jieba_op, input_columns=["text"])
-
add_word
(word, freq=None)[source]¶ Add user defined word to JiebaTokenizer’s dictionary.
- Parameters
word (str) – The word to be added to the JiebaTokenizer instance. The added word will not be written into the built-in dictionary on disk.
freq (int, optional) – The frequency of the word to be added. The higher the frequency, the better chance the word will be tokenized (default=None, use default frequency).
Examples
>>> from mindspore.dataset.text import JiebaMode >>> jieba_hmm_file = "/path/to/jieba/hmm/file" >>> jieba_mp_file = "/path/to/jieba/mp/file" >>> jieba_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP) >>> sentence_piece_vocab_file = "/path/to/sentence/piece/vocab/file" >>> with open(sentence_piece_vocab_file, 'r') as f: ... for line in f: ... word = line.split(',')[0] ... jieba_op.add_word(word) >>> text_file_dataset = text_file_dataset.map(operations=jieba_op, input_columns=["text"])
-
class
tinyms.text.
UnicodeCharTokenizer
(with_offsets=False)[source]¶ Tokenize a scalar tensor of UTF-8 string to Unicode characters.
- Parameters
with_offsets (bool, optional) – If or not output offsets of tokens (default=False).
Examples
>>> # If with_offsets=False, default output one column {["text", dtype=str]} >>> tokenizer_op = text.UnicodeCharTokenizer(with_offsets=False) >>> text_file_dataset = text_file_dataset.map(operations=tokenizer_op) >>> # If with_offsets=True, then output three columns {["token", dtype=str], ["offsets_start", dtype=uint32], >>> # ["offsets_limit", dtype=uint32]} >>> tokenizer_op = text.UnicodeCharTokenizer(with_offsets=True) >>> text_file_dataset = text_file_dataset.map(operations=tokenizer_op, input_columns=["text"], ... output_columns=["token", "offsets_start", "offsets_limit"], ... column_order=["token", "offsets_start", "offsets_limit"])
-
class
tinyms.text.
Ngram
(n, left_pad=('', 0), right_pad=('', 0), separator=' ')[source]¶ TensorOp to generate n-gram from a 1-D string Tensor.
Refer to https://en.wikipedia.org/wiki/N-gram#Examples for an overview of what n-gram is and how it works.
- Parameters
n (list[int]) – n in n-gram, which is a list of positive integers. For example, if n=[4, 3], then the result would be a 4-gram followed by a 3-gram in the same tensor. If the number of words is not enough to make up for a n-gram, an empty string will be returned. For example, 3 grams on [“mindspore”, “best”] will result in an empty string produced.
left_pad (tuple, optional) – Padding performed on left side of the sequence shaped like (“pad_token”, pad_width). pad_width will be capped at n-1. For example, specifying left_pad=(“_”, 2) would pad left side of the sequence with “__” (default=None).
right_pad (tuple, optional) – Padding performed on right side of the sequence shaped like (“pad_token”, pad_width). pad_width will be capped at n-1. For example, specifying right_pad=(“_”, 2) would pad right side of the sequence with “__” (default=None).
separator (str, optional) – Symbol used to join strings together. For example. if 2-gram is [“mindspore”, “amazing”] with separator=”-“, the result would be [“mindspore-amazing”] (default=None, which will use whitespace as separator).
Examples
>>> ngram_op = text.Ngram(3, separator="-") >>> output = ngram_op(["WildRose Country", "Canada's Ocean Playground", "Land of Living Skies"]) >>> # output >>> # ["WildRose Country-Canada's Ocean Playground-Land of Living Skies"] >>> # same ngram_op called through map >>> text_file_dataset = text_file_dataset.map(operations=ngram_op)
-
class
tinyms.text.
WordpieceTokenizer
(vocab, suffix_indicator='##', max_bytes_per_token=100, unknown_token='[UNK]', with_offsets=False)[source]¶ Tokenize scalar token or 1-D tokens to 1-D subword tokens.
- Parameters
vocab (Vocab) – A vocabulary object.
suffix_indicator (str, optional) – Used to show that the subword is the last part of a word (default=’##’).
max_bytes_per_token (int, optional) – Tokens exceeding this length will not be further split (default=100).
unknown_token (str, optional) – When a token cannot be found: if ‘unknown_token’ is empty string, return the token directly, else return ‘unknown_token’ (default=’[UNK]’).
with_offsets (bool, optional) – If or not output offsets of tokens (default=False).
Examples
>>> vocab_list = ["book", "cholera", "era", "favor", "##ite", "my", "is", "love", "dur", "##ing", "the"] >>> vocab = text.Vocab.from_list(vocab_list) >>> # If with_offsets=False, default output one column {["text", dtype=str]} >>> tokenizer_op = text.WordpieceTokenizer(vocab=vocab, unknown_token='[UNK]', ... max_bytes_per_token=100, with_offsets=False) >>> text_file_dataset = text_file_dataset.map(operations=tokenizer_op) >>> # If with_offsets=True, then output three columns {["token", dtype=str], ["offsets_start", dtype=uint32], >>> # ["offsets_limit", dtype=uint32]} >>> tokenizer_op = text.WordpieceTokenizer(vocab=vocab, unknown_token='[UNK]', ... max_bytes_per_token=100, with_offsets=True) >>> text_file_dataset = text_file_dataset.map(operations=tokenizer_op, input_columns=["text"], ... output_columns=["token", "offsets_start", "offsets_limit"], ... column_order=["token", "offsets_start", "offsets_limit"])
-
class
tinyms.text.
TruncateSequencePair
(max_length)[source]¶ Truncate a pair of rank-1 tensors such that the total length is less than max_length.
This operation takes two input tensors and returns two output Tensors.
- Parameters
max_length (int) – Maximum length required.
Examples
>>> dataset = ds.NumpySlicesDataset(data={"col1": [[1, 2, 3]], "col2": [[4, 5]]}) >>> # Data before >>> # | col1 | col2 | >>> # +-----------+-----------| >>> # | [1, 2, 3] | [4, 5] | >>> # +-----------+-----------+ >>> truncate_sequence_pair_op = text.TruncateSequencePair(max_length=4) >>> dataset = dataset.map(operations=truncate_sequence_pair_op) >>> # Data after >>> # | col1 | col2 | >>> # +-----------+-----------+ >>> # | [1, 2] | [4, 5] | >>> # +-----------+-----------+
-
class
tinyms.text.
ToNumber
(data_type)[source]¶ Tensor operation to convert every element of a string tensor to a number.
Strings are casted according to the rules specified in the following links: https://en.cppreference.com/w/cpp/string/basic_string/stof, https://en.cppreference.com/w/cpp/string/basic_string/stoul, except that any strings which represent negative numbers cannot be cast to an unsigned integer type.
- Parameters
data_type (mindspore.dtype) – mindspore.dtype to be casted to. Must be a numeric type.
- Raises
RuntimeError – If strings are invalid to cast, or are out of range after being casted.
Examples
>>> import mindspore.common.dtype as mstype >>> data = [["1", "2", "3"]] >>> dataset = ds.NumpySlicesDataset(data) >>> to_number_op = text.ToNumber(mstype.int8) >>> dataset = dataset.map(operations=to_number_op)
-
class
tinyms.text.
SlidingWindow
(width, axis=0)[source]¶ TensorOp to construct a tensor from data (only 1-D for now), where each element in the dimension axis is a slice of data starting at the corresponding position, with a specified width.
- Parameters
Examples
>>> dataset = ds.NumpySlicesDataset(data=[[1, 2, 3, 4, 5]], column_names="col1") >>> # Data before >>> # | col1 | >>> # +--------------+ >>> # | [[1, 2, 3, 4, 5]] | >>> # +--------------+ >>> dataset = dataset.map(operations=text.SlidingWindow(3, 0)) >>> # Data after >>> # | col1 | >>> # +--------------+ >>> # | [[1, 2, 3], | >>> # | [2, 3, 4], | >>> # | [3, 4, 5]] | >>> # +--------------+
-
class
tinyms.text.
SentencePieceTokenizer
(mode, out_type)[source]¶ Tokenize scalar token or 1-D tokens to tokens by sentencepiece.
- Parameters
mode (Union[str, SentencePieceVocab]) – If the input parameter is a file, then it is of type string. If the input parameter is a SentencePieceVocab object, then it is of type SentencePieceVocab.
Examples
>>> from mindspore.dataset.text import SentencePieceModel, SPieceTokenizerOutType >>> sentence_piece_vocab_file = "/path/to/sentence/piece/vocab/file" >>> vocab = text.SentencePieceVocab.from_file([sentence_piece_vocab_file], 5000, 0.9995, ... SentencePieceModel.UNIGRAM, {}) >>> tokenizer = text.SentencePieceTokenizer(vocab, out_type=SPieceTokenizerOutType.STRING) >>> text_file_dataset = text_file_dataset.map(operations=tokenizer)
-
class
tinyms.text.
PythonTokenizer
(tokenizer)[source]¶ Callable class to be used for user-defined string tokenizer.
- Parameters
tokenizer (Callable) – Python function that takes a str and returns a list of str as tokens.
Examples
>>> def my_tokenizer(line): ... return line.split() >>> text_file_dataset = text_file_dataset.map(operations=text.PythonTokenizer(my_tokenizer))
-
tinyms.text.
to_str
(array, encoding='utf8')[source]¶ Convert NumPy array of bytes to array of str by decoding each element based on charset encoding.
- Parameters
array (numpy.ndarray) – Array of type bytes representing strings.
encoding (str) – Indicating the charset for decoding.
- Returns
numpy.ndarray, NumPy array of str.
-
tinyms.text.
to_bytes
(array, encoding='utf8')[source]¶ Convert NumPy array of str to array of bytes by encoding each element based on charset encoding.
- Parameters
array (numpy.ndarray) – Array of type str representing strings.
encoding (str) – Indicating the charset for encoding.
- Returns
numpy.ndarray, NumPy array of bytes.
-
class
tinyms.text.
Vocab
[source]¶ Vocab object that is used to lookup a word.
It contains a map that maps each word(str) to an id (int).
-
classmethod
from_dataset
(dataset, columns=None, freq_range=None, top_k=None, special_tokens=None, special_first=True)[source]¶ Build a vocab from a dataset.
This would collect all unique words in a dataset and return a vocab within the frequency range specified by user in freq_range. User would be warned if no words fall into the frequency. Words in vocab are ordered from highest frequency to lowest frequency. Words with the same frequency would be ordered lexicographically.
- Parameters
dataset (Dataset) – dataset to build vocab from.
columns (list[str], optional) – column names to get words from. It can be a list of column names. (default=None, where all columns will be used. If any column isn’t string type, will return error).
freq_range (tuple, optional) – A tuple of integers (min_frequency, max_frequency). Words within the frequency range would be kept. 0 <= min_frequency <= max_frequency <= total_words. min_frequency=0 is the same as min_frequency=1. max_frequency > total_words is the same as max_frequency = total_words. min_frequency/max_frequency can be None, which corresponds to 0/total_words separately (default=None, all words are included).
top_k (int, optional) – top_k > 0. Number of words to be built into vocab. top_k most frequent words are taken. top_k is taken after freq_range. If not enough top_k, all words will be taken (default=None, all words are included).
special_tokens (list, optional) – a list of strings, each one is a special token. for example special_tokens=[“<pad>”,”<unk>”] (default=None, no special tokens will be added).
special_first (bool, optional) – whether special_tokens will be prepended/appended to vocab. If special_tokens is specified and special_first is set to True, special_tokens will be prepended (default=True).
- Returns
Vocab, vocab built from the dataset.
-
classmethod
from_dict
(word_dict)[source]¶ Build a vocab object from a dict.
- Parameters
word_dict (dict) – dict contains word and id pairs, where word should be str and id be int. id is recommended to start from 0 and be continuous. ValueError will be raised if id is negative.
- Returns
Vocab, vocab built from the dict.
-
classmethod
from_file
(file_path, delimiter='', vocab_size=None, special_tokens=None, special_first=True)[source]¶ Build a vocab object from a list of word.
- Parameters
file_path (str) – path to the file which contains the vocab list.
delimiter (str, optional) – a delimiter to break up each line in file, the first element is taken to be the word (default=””).
vocab_size (int, optional) – number of words to read from file_path (default=None, all words are taken).
special_tokens (list, optional) – a list of strings, each one is a special token. for example special_tokens=[“<pad>”,”<unk>”] (default=None, no special tokens will be added).
special_first (bool, optional) – whether special_tokens will be prepended/appended to vocab, If special_tokens is specified and special_first is set to True, special_tokens will be prepended (default=True).
- Returns
Vocab, vocab built from the file.
-
classmethod
from_list
(word_list, special_tokens=None, special_first=True)[source]¶ Build a vocab object from a list of word.
- Parameters
word_list (list) – a list of string where each element is a word of type string.
special_tokens (list, optional) – a list of strings, each one is a special token. for example special_tokens=[“<pad>”,”<unk>”] (default=None, no special tokens will be added).
special_first (bool, optional) – whether special_tokens will be prepended/appended to vocab, If special_tokens is specified and special_first is set to True, special_tokens will be prepended (default=True).
- Returns
Vocab, vocab built from the list.
-
classmethod
-
class
tinyms.text.
SentencePieceVocab
[source]¶ SentencePiece obiect that is used to segmentate words
-
classmethod
from_dataset
(dataset, col_names, vocab_size, character_coverage, model_type, params)[source]¶ Build a sentencepiece from a dataset
- Parameters
dataset (Dataset) – Dataset to build sentencepiece.
col_names (list) – The list of the col name.
vocab_size (int) – Vocabulary size.
character_coverage (float) – Amount of characters covered by the model, good defaults are: 0.9995 for languages. with rich character set like Japanese or Chinese and 1.0 for other languages with small character set.
model_type (SentencePieceModel) – Choose from UNIGRAM (default), BPE, CHAR, or WORD. The input sentence must be pretokenized when using word type.
params (dict) – A dictionary with no incoming parameters.
- Returns
SentencePieceVocab, vocab built from the dataset.
-
classmethod
from_file
(file_path, vocab_size, character_coverage, model_type, params)[source]¶ Build a SentencePiece object from a list of word.
- Parameters
file_path (list) – Path to the file which contains the sentencepiece list.
vocab_size (int) – Vocabulary size, the type of uint32_t.
character_coverage (float) – Amount of characters covered by the model, good defaults are: 0.9995 for languages. with rich character set like Japanse or Chinese and 1.0 for other languages with small character set.
model_type (SentencePieceModel) – Choose from unigram (default), bpe, char, or word. The input sentence must be pretokenized when using word type.
params (dict) –
A dictionary with no incoming parameters(The parameters are derived from SentencePiece library).
input_sentence_size 0 max_sentencepiece_length 16
- Returns
SentencePieceVocab, vocab built from the file.
-
classmethod
save_model
(vocab, path, filename)[source]¶ Save model to filepath
- Parameters
vocab (SentencePieceVocab) – A sentencepiece object.
path (str) – Path to store model.
filename (str) – The name of the file.
-
classmethod
-
class
tinyms.text.
SentencePieceModel
[source]¶ An enumeration for SentencePieceModel, effective enumeration types are UNIGRAM, BPE, CHAR, WORD.
-
class
tinyms.text.
SPieceTokenizerOutType
[source]¶ An enumeration for SPieceTokenizerOutType, effective enumeration types are STRING, INT.
-
class
tinyms.text.
SPieceTokenizerLoadType
[source]¶ An enumeration for SPieceTokenizerLoadType, effective enumeration types are FILE, MODEL.
-
class
tinyms.text.
Compose
(transforms)[source]¶ Compose a list of transforms into a single transform.
- Parameters
transforms (list) – List of transformations to be applied.
Examples
>>> compose = c_transforms.Compose([c_vision.Decode(), c_vision.RandomCrop(512)]) >>> image_folder_dataset = image_folder_dataset.map(operations=compose)
-
class
tinyms.text.
Concatenate
(axis=0, prepend=None, append=None)[source]¶ Tensor operation that concatenates all columns into a single tensor.
- Parameters
axis (int, optional) – Concatenate the tensors along given axis (Default=0).
prepend (numpy.array, optional) – NumPy array to be prepended to the already concatenated tensors (Default=None).
append (numpy.array, optional) – NumPy array to be appended to the already concatenated tensors (Default=None).
Examples
>>> import numpy as np >>> # concatenate string >>> prepend_tensor = np.array(["dw", "df"], dtype='S') >>> append_tensor = np.array(["dwsdf", "df"], dtype='S') >>> concatenate_op = c_transforms.Concatenate(0, prepend_tensor, append_tensor) >>> data = [["This","is","a","string"]] >>> dataset = ds.NumpySlicesDataset(data) >>> dataset = dataset.map(operations=concatenate_op)
-
class
tinyms.text.
Duplicate
[source]¶ Duplicate the input tensor to output, only support transform one column each time.
Examples
>>> # Data before >>> # | x | >>> # +---------+ >>> # | [1,2,3] | >>> # +---------+ >>> data = [[1,2,3]] >>> numpy_slices_dataset = ds.NumpySlicesDataset(data, ["x"]) >>> numpy_slices_dataset = numpy_slices_dataset.map(operations=c_transforms.Duplicate(), ... input_columns=["x"], ... output_columns=["x", "y"], ... column_order=["x", "y"]) >>> # Data after >>> # | x | y | >>> # +---------+---------+ >>> # | [1,2,3] | [1,2,3] | >>> # +---------+---------+
-
class
tinyms.text.
Fill
(fill_value)[source]¶ Tensor operation to fill all elements in the tensor with the specified value. The output tensor will have the same shape and type as the input tensor.
- Parameters
fill_value (Union[str, bytes, int, float, bool])) – scalar value to fill the tensor with.
Examples
>>> import numpy as np >>> # generate a 1D integer numpy array from 0 to 4 >>> def generator_1d(): ... for i in range(5): ... yield (np.array([i]),) >>> generator_dataset = ds.GeneratorDataset(generator_1d, column_names="col1") >>> # [[0], [1], [2], [3], [4]] >>> fill_op = c_transforms.Fill(3) >>> generator_dataset = generator_dataset.map(operations=fill_op) >>> # [[3], [3], [3], [3], [3]]
-
class
tinyms.text.
Mask
(operator, constant, dtype=mindspore.bool)[source]¶ Mask content of the input tensor with the given predicate. Any element of the tensor that matches the predicate will be evaluated to True, otherwise False.
- Parameters
Examples
>>> from mindspore.dataset.transforms.c_transforms import Relational >>> # Data before >>> # | col | >>> # +---------+ >>> # | [1,2,3] | >>> # +---------+ >>> data = [[1, 2, 3]] >>> numpy_slices_dataset = ds.NumpySlicesDataset(data, ["col"]) >>> numpy_slices_dataset = numpy_slices_dataset.map(operations=c_transforms.Mask(Relational.EQ, 2)) >>> # Data after >>> # | col | >>> # +--------------------+ >>> # | [False,True,False] | >>> # +--------------------+
-
class
tinyms.text.
OneHot
(num_classes)[source]¶ Tensor operation to apply one hot encoding.
- Parameters
num_classes (int) – Number of classes of objects in dataset. It should be larger than the largest label number in the dataset.
- Raises
RuntimeError – feature size is bigger than num_classes.
Examples
>>> # Assume that dataset has 10 classes, thus the label ranges from 0 to 9 >>> onehot_op = c_transforms.OneHot(num_classes=10) >>> mnist_dataset = mnist_dataset.map(operations=onehot_op, input_columns=["label"])
-
class
tinyms.text.
PadEnd
(pad_shape, pad_value=None)[source]¶ Pad input tensor according to pad_shape, need to have same rank.
- Parameters
pad_shape (list(int)) – List of integers representing the shape needed. Dimensions that set to None will not be padded (i.e., original dim will be used). Shorter dimensions will truncate the values.
pad_value (Union[str, bytes, int, float, bool]), optional) – Value used to pad. Default to 0 or empty string in case of tensors of strings.
Examples
>>> # Data before >>> # | col | >>> # +---------+ >>> # | [1,2,3] | >>> # +---------| >>> data = [[1, 2, 3]] >>> numpy_slices_dataset = ds.NumpySlicesDataset(data, ["col"]) >>> numpy_slices_dataset = numpy_slices_dataset.map(operations=c_transforms.PadEnd(pad_shape=[4], ... pad_value=10)) >>> # Data after >>> # | col | >>> # +------------+ >>> # | [1,2,3,10] | >>> # +------------|
-
class
tinyms.text.
RandomApply
(transforms, prob=0.5)[source]¶ Randomly perform a series of transforms with a given probability.
- Parameters
Examples
>>> rand_apply = c_transforms.RandomApply([c_vision.RandomCrop(512)]) >>> image_folder_dataset = image_folder_dataset.map(operations=rand_apply)
-
class
tinyms.text.
RandomChoice
(transforms)[source]¶ Randomly select one transform from a list of transforms to perform operation.
- Parameters
transforms (list) – List of transformations to be chosen from to apply.
Examples
>>> rand_choice = c_transforms.RandomChoice([c_vision.CenterCrop(50), c_vision.RandomCrop(512)]) >>> image_folder_dataset = image_folder_dataset.map(operations=rand_choice)
-
class
tinyms.text.
Slice
(*slices)[source]¶ Slice operation to extract a tensor out using the given n slices.
The functionality of Slice is similar to NumPy’s indexing feature. (Currently only rank-1 tensors are supported).
- Parameters
slices (Union[int, list[int], slice, None, Ellipsis]) –
Maximum n number of arguments to slice a tensor of rank n . One object in slices can be one of:
int
: Slice this index only along the first dimension. Negative index is supported.list(int)
: Slice these indices along the first dimension. Negative indices are supported.slice
: Slice the generated indices from the slice object along the first dimension. Similar to start:stop:step.None
: Slice the whole dimension. Similar to[:]
in Python indexing.Ellipsis
: Slice the whole dimension, same result with None.
Examples
>>> # Data before >>> # | col | >>> # +---------+ >>> # | [1,2,3] | >>> # +---------| >>> data = [[1, 2, 3]] >>> numpy_slices_dataset = ds.NumpySlicesDataset(data, ["col"]) >>> # slice indices 1 and 2 only >>> numpy_slices_dataset = numpy_slices_dataset.map(operations=c_transforms.Slice(slice(1,3))) >>> # Data after >>> # | col | >>> # +---------+ >>> # | [2,3] | >>> # +---------|
-
class
tinyms.text.
TypeCast
(data_type)[source]¶ Tensor operation to cast to a given MindSpore data type.
- Parameters
data_type (mindspore.dtype) – mindspore.dtype to be cast to.
Examples
>>> import numpy as np >>> import mindspore.common.dtype as mstype >>> >>> # Generate 1d int numpy array from 0 - 63 >>> def generator_1d(): ... for i in range(64): ... yield (np.array([i]),) >>> >>> dataset = ds.GeneratorDataset(generator_1d, column_names='col') >>> type_cast_op = c_transforms.TypeCast(mstype.int32) >>> dataset = dataset.map(operations=type_cast_op)
-
class
tinyms.text.
Unique
[source]¶ Perform the unique operation on the input tensor, only support transform one column each time.
Return 3 tensor: unique output tensor, index tensor, count tensor.
Unique output tensor contains all the unique elements of the input tensor in the same order that they occur in the input tensor.
Index tensor that contains the index of each element of the input tensor in the unique output tensor.
Count tensor that contains the count of each element of the output tensor in the input tensor.
Note
Call batch op before calling this function.
Examples
>>> # Data before >>> # | x | >>> # +--------------------+ >>> # | [[0,1,2], [1,2,3]] | >>> # +--------------------+ >>> data = [[[0,1,2], [1,2,3]]] >>> dataset = ds.NumpySlicesDataset(data, ["x"]) >>> dataset = dataset.map(operations=c_transforms.Unique(), ... input_columns=["x"], ... output_columns=["x", "y", "z"], ... column_order=["x", "y", "z"]) >>> # Data after >>> # | x | y |z | >>> # +---------+-----------------+---------+ >>> # | [0,1,2,3] | [0,1,2,1,2,3] | [1,2,2,1] >>> # +---------+-----------------+---------+