tinyms.text

This module is to support text processing for NLP tasks. It is a high performance NLP text processing module which is developed with ICU4C and cppjieba.

class tinyms.text.BertDatasetTransform[source]

Apply preprocess operation on GeneratorDataset instance.

class tinyms.text.Lookup(vocab, unknown_token=None, data_type=mindspore.int32)[source]

Look up a word into an id according to the input vocabulary table.

Parameters:
  • vocab (Vocab) – A vocabulary object.

  • unknown_token (str, optional) – Word is used for lookup. In case of the word is out of vocabulary (OOV), the result of lookup will be replaced with unknown_token. If the unknown_token is not specified or it is OOV, runtime error will be thrown. Default: None, means no unknown_token is specified.

  • data_type (mindspore.dtype, optional) – The data type that lookup operation maps string to. Default: mindspore.int32.

Raises:
  • TypeError – If vocab is not of type text.Vocab.

  • TypeError – If unknown_token is not of type string.

  • TypeError – If data_type is not of type mindspore.dtype.

Supported Platforms:

CPU

Examples

>>> import mindspore.dataset.text as text
>>> # Load vocabulary from list
>>> vocab = text.Vocab.from_list(['深', '圳', '欢', '迎', '您'])
>>> # Use Lookup operation to map tokens to ids
>>> lookup = text.Lookup(vocab)
>>> text_file_dataset = text_file_dataset.map(operations=[lookup])
class tinyms.text.JiebaTokenizer(hmm_path, mp_path, mode=<JiebaMode.MIX: 0>, with_offsets=False)[source]

Tokenize Chinese string into words based on dictionary.

Note

The integrity of the HMMSEgment algorithm and MPSegment algorithm files must be confirmed.

Parameters:
  • hmm_path (str) – Dictionary file is used by HMMSegment algorithm. The dictionary can be obtained on the official website of cppjieba.

  • mp_path (str) – Dictionary file is used by MPSegment algorithm. The dictionary can be obtained on the official website of cppjieba.

  • mode (JiebaMode, optional) –

    Valid values can be any of [JiebaMode.MP, JiebaMode.HMM, JiebaMode.MIX]. Default: JiebaMode.MIX.

    • JiebaMode.MP, tokenize with MPSegment algorithm.

    • JiebaMode.HMM, tokenize with Hidden Markov Model Segment algorithm.

    • JiebaMode.MIX, tokenize with a mix of MPSegment and HMMSegment algorithm.

  • with_offsets (bool, optional) – Whether or not output offsets of tokens. Default: False.

Raises:
  • ValueError – If path of HMMSegment dict is not provided.

  • ValueError – If path of MPSegment dict is not provided.

  • TypeError – If hmm_path or mp_path is not of type string.

  • TypeError – If with_offsets is not of type bool.

Supported Platforms:

CPU

Examples

>>> import mindspore.dataset.text as text
>>> from mindspore.dataset.text import JiebaMode
>>> # If with_offsets=False, default output one column {["text", dtype=str]}
>>> jieba_hmm_file = "/path/to/jieba/hmm/file"
>>> jieba_mp_file = "/path/to/jieba/mp/file"
>>> tokenizer_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP, with_offsets=False)
>>> text_file_dataset = text_file_dataset.map(operations=tokenizer_op)
>>> # If with_offsets=False, then output three columns {["token", dtype=str], ["offsets_start", dtype=uint32],
>>> #                                                   ["offsets_limit", dtype=uint32]}
>>> tokenizer_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP, with_offsets=True)
>>> text_file_dataset_1 = text_file_dataset_1.map(operations=tokenizer_op, input_columns=["text"],
...                                               output_columns=["token", "offsets_start", "offsets_limit"])
add_dict(user_dict)[source]

Add a user defined word to JiebaTokenizer’s dictionary.

Parameters:

user_dict (Union[str, dict]) –

One of the two loading methods is file path(str) loading (according to the Jieba dictionary format) and the other is Python dictionary(dict) loading, Python Dict format: {word1:freq1, word2:freq2,…}. Jieba dictionary format : word(required), freq(optional), such as:

word1 freq1
word2 None
word3 freq3

Only valid word-freq pairs in user provided file will be added into the dictionary. Rows containing invalid input will be ignored. No error nor warning Status is returned.

Examples

>>> from mindspore.dataset.text import JiebaMode
>>> jieba_hmm_file = "/path/to/jieba/hmm/file"
>>> jieba_mp_file = "/path/to/jieba/mp/file"
>>> user_dict = {"男默女泪": 10}
>>> jieba_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP)
>>> jieba_op.add_dict(user_dict)
>>> text_file_dataset = text_file_dataset.map(operations=jieba_op, input_columns=["text"])
add_word(word, freq=None)[source]

Add a user defined word to JiebaTokenizer’s dictionary.

Parameters:
  • word (str) – The word to be added to the JiebaTokenizer instance. The added word will not be written into the built-in dictionary on disk.

  • freq (int, optional) – The frequency of the word to be added. The higher the frequency, the better chance the word will be tokenized. Default: None, use default frequency.

Examples

>>> import mindspore.dataset.text as text
>>> from mindspore.dataset.text import JiebaMode
>>> jieba_hmm_file = "/path/to/jieba/hmm/file"
>>> jieba_mp_file = "/path/to/jieba/mp/file"
>>> jieba_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP)
>>> sentence_piece_vocab_file = "/path/to/sentence/piece/vocab/file"
>>> with open(sentence_piece_vocab_file, 'r') as f:
...     for line in f:
...         word = line.split(',')[0]
...         jieba_op.add_word(word)
>>> text_file_dataset = text_file_dataset.map(operations=jieba_op, input_columns=["text"])
class tinyms.text.UnicodeCharTokenizer(with_offsets=False)[source]

Tokenize a scalar tensor of UTF-8 string to Unicode characters.

Parameters:

with_offsets (bool, optional) – Whether or not output offsets of tokens. Default: False.

Raises:

TypeError – If with_offsets is not of type bool.

Supported Platforms:

CPU

Examples

>>> import mindspore.dataset.text as text
>>> # If with_offsets=False, default output one column {["text", dtype=str]}
>>> tokenizer_op = text.UnicodeCharTokenizer(with_offsets=False)
>>> text_file_dataset = text_file_dataset.map(operations=tokenizer_op)
>>> # If with_offsets=True, then output three columns {["token", dtype=str], ["offsets_start", dtype=uint32],
>>> #                                                   ["offsets_limit", dtype=uint32]}
>>> tokenizer_op = text.UnicodeCharTokenizer(with_offsets=True)
>>> text_file_dataset = text_file_dataset.map(operations=tokenizer_op, input_columns=["text"],
...                                           output_columns=["token", "offsets_start", "offsets_limit"])
class tinyms.text.Ngram(n, left_pad=('', 0), right_pad=('', 0), separator=' ')[source]

Generate n-gram from a 1-D string Tensor.

Refer to N-gram for an overview of what n-gram is and how it works.

Parameters:
  • n (list[int]) – n in n-gram, which is a list of positive integers. For example, if n=[4, 3], then the result would be a 4-gram followed by a 3-gram in the same tensor. If the number of words is not enough to make up for a n-gram, an empty string will be returned. For example, 3 grams on [“mindspore”, “best”] will result in an empty string produced.

  • left_pad (tuple, optional) – Padding performed on left side of the sequence shaped like (“pad_token”, pad_width). pad_width will be capped at n-1. For example, specifying left_pad=(“_”, 2) would pad left side of the sequence with “__”. Default: (‘’, 0).

  • right_pad (tuple, optional) – Padding performed on right side of the sequence shaped like (“pad_token”, pad_width). pad_width will be capped at n-1. For example, specifying right_pad=(“_”, 2) would pad right side of the sequence with “__”. Default: (‘’, 0).

  • separator (str, optional) – Symbol used to join strings together. For example, if 2-gram is [“mindspore”, “amazing”] with separator=”-”, the result would be [“mindspore-amazing”]. Default: ‘ ‘, which will use whitespace as separator.

Raises:
  • TypeError – If values of n not positive is not of type int.

  • ValueError – If values of n not positive.

  • ValueError – If left_pad is not a tuple of length 2.

  • ValueError – If right_pad is not a tuple of length 2.

  • TypeError – If separator is not of type string.

Supported Platforms:

CPU

Examples

>>> import mindspore.dataset.text as text
>>> ngram_op = text.Ngram(3, separator="-")
>>> output = ngram_op(["WildRose Country", "Canada's Ocean Playground", "Land of Living Skies"])
>>> # output
>>> # ["WildRose Country-Canada's Ocean Playground-Land of Living Skies"]
>>> # same ngram_op called through map
>>> text_file_dataset = text_file_dataset.map(operations=ngram_op)
class tinyms.text.WordpieceTokenizer(vocab, suffix_indicator='##', max_bytes_per_token=100, unknown_token='[UNK]', with_offsets=False)[source]

Tokenize the input text to subword tokens.

Parameters:
  • vocab (Vocab) – Vocabulary used to look up words.

  • suffix_indicator (str, optional) – Prefix flags used to indicate subword suffixes. Default: ‘##’.

  • max_bytes_per_token (int, optional) – The maximum length of tokenization, words exceeding this length will not be split. Default: 100.

  • unknown_token (str, optional) – The output for unknown words. When set to an empty string, the corresponding unknown word will be directly returned as the output. Otherwise, the set string will be returned as the output. Default: ‘[UNK]’.

  • with_offsets (bool, optional) – Whether to return the offsets of tokens. Default: False.

Raises:
  • TypeError – If vocab is not of type mindspore.dataset.text.Vocab .

  • TypeError – If suffix_indicator is not of type str.

  • TypeError – If max_bytes_per_token is not of type int.

  • TypeError – If unknown_token is not of type str.

  • TypeError – If with_offsets is not of type bool.

  • ValueError – If max_bytes_per_token is negative.

Supported Platforms:

CPU

Examples

>>> import mindspore.dataset.text as text
>>> vocab_list = ["book", "cholera", "era", "favor", "##ite", "my", "is", "love", "dur", "##ing", "the"]
>>> vocab = text.Vocab.from_list(vocab_list)
>>> # If with_offsets=False, default output one column {["text", dtype=str]}
>>> tokenizer_op = text.WordpieceTokenizer(vocab=vocab, unknown_token='[UNK]',
...                                        max_bytes_per_token=100, with_offsets=False)
>>> text_file_dataset = text_file_dataset.map(operations=tokenizer_op)
>>> # If with_offsets=True, then output three columns {["token", dtype=str], ["offsets_start", dtype=uint32],
>>> #                                                   ["offsets_limit", dtype=uint32]}
>>> tokenizer_op = text.WordpieceTokenizer(vocab=vocab, unknown_token='[UNK]',
...                                       max_bytes_per_token=100, with_offsets=True)
>>> text_file_dataset = text_file_dataset.map(operations=tokenizer_op, input_columns=["text"],
...                                           output_columns=["token", "offsets_start", "offsets_limit"])
class tinyms.text.TruncateSequencePair(max_length)[source]

Truncate a pair of rank-1 tensors such that the total length is less than max_length.

This operation takes two input tensors and returns two output Tensors.

Parameters:

max_length (int) – Maximum length required.

Raises:

TypeError – If max_length is not of type int.

Supported Platforms:

CPU

Examples

>>> import mindspore.dataset.text as text
>>> dataset = ds.NumpySlicesDataset(data={"col1": [[1, 2, 3]], "col2": [[4, 5]]})
>>> # Data before
>>> # |   col1    |   col2    |
>>> # +-----------+-----------|
>>> # | [1, 2, 3] |  [4, 5]   |
>>> # +-----------+-----------+
>>> truncate_sequence_pair_op = text.TruncateSequencePair(max_length=4)
>>> dataset = dataset.map(operations=truncate_sequence_pair_op)
>>> # Data after
>>> # |   col1    |   col2    |
>>> # +-----------+-----------+
>>> # |  [1, 2]   |  [4, 5]   |
>>> # +-----------+-----------+
class tinyms.text.ToNumber(data_type)[source]

Tensor operation to convert every element of a string tensor to a number.

Strings are cast according to the rules specified in the following links, except that any strings which represent negative numbers cannot be cast to an unsigned integer type, rules links are as follows: https://en.cppreference.com/w/cpp/string/basic_string/stof, https://en.cppreference.com/w/cpp/string/basic_string/stoul.

Parameters:

data_type (mindspore.dtype) – Type to be cast to. Must be a numeric type in mindspore.dtype.

Raises:
  • TypeError – If data_type is not of type mindspore.dtype.

  • RuntimeError – If strings are invalid to cast, or are out of range after being cast.

Supported Platforms:

CPU

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>> from mindspore import dtype as mstype
>>> data = [["1", "2", "3"]]
>>> dataset = ds.NumpySlicesDataset(data)
>>> to_number_op = text.ToNumber(mstype.int8)
>>> dataset = dataset.map(operations=to_number_op)
class tinyms.text.SlidingWindow(width, axis=0)[source]

Construct a tensor from given data (only support 1-D for now), where each element in the dimension axis is a slice of data starting at the corresponding position, with a specified width.

Parameters:
  • width (int) – The width of the window. It must be an integer and greater than zero.

  • axis (int, optional) – The axis along which the sliding window is computed. Default: 0.

Raises:
Supported Platforms:

CPU

Examples

>>> import mindspore.dataset as ds
>>> dataset = ds.NumpySlicesDataset(data=[[1, 2, 3, 4, 5]], column_names="col1")
>>> # Data before
>>> # |     col1     |
>>> # +--------------+
>>> # | [[1, 2, 3, 4, 5]] |
>>> # +--------------+
>>> dataset = dataset.map(operations=text.SlidingWindow(3, 0))
>>> # Data after
>>> # |     col1     |
>>> # +--------------+
>>> # |  [[1, 2, 3], |
>>> # |   [2, 3, 4], |
>>> # |   [3, 4, 5]] |
>>> # +--------------+
class tinyms.text.SentencePieceTokenizer(mode, out_type)[source]

Tokenize scalar token or 1-D tokens to tokens by sentencepiece.

Parameters:
  • mode (Union[str, SentencePieceVocab]) – SentencePiece model. If the input parameter is a file, it represents the path of SentencePiece mode to be loaded. If the input parameter is a SentencePieceVocab object, it should be constructed in advanced.

  • out_type (SPieceTokenizerOutType) –

    The type of output, it can be any of [SPieceTokenizerOutType.STRING, SPieceTokenizerOutType.INT].

    • SPieceTokenizerOutType.STRING, means output type of SentencePice Tokenizer is string.

    • SPieceTokenizerOutType.INT, means output type of SentencePice Tokenizer is int.

Raises:
  • TypeError – If mode is not of type string or SentencePieceVocab.

  • TypeError – If out_type is not of type SPieceTokenizerOutType.

Supported Platforms:

CPU

Examples

>>> import mindspore.dataset.text as text
>>> from mindspore.dataset.text import SentencePieceModel, SPieceTokenizerOutType
>>> sentence_piece_vocab_file = "/path/to/sentence/piece/vocab/file"
>>> vocab = text.SentencePieceVocab.from_file([sentence_piece_vocab_file], 5000, 0.9995,
...                                           SentencePieceModel.UNIGRAM, {})
>>> tokenizer = text.SentencePieceTokenizer(vocab, out_type=SPieceTokenizerOutType.STRING)
>>> text_file_dataset = text_file_dataset.map(operations=tokenizer)
class tinyms.text.PythonTokenizer(tokenizer)[source]

Class that applies user-defined string tokenizer into input string.

Parameters:

tokenizer (Callable) – Python function that takes a str and returns a list of str as tokens.

Raises:

TypeError – If tokenizer is not a callable Python function.

Supported Platforms:

CPU

Examples

>>> def my_tokenizer(line):
...     return line.split()
>>> text_file_dataset = text_file_dataset.map(operations=text.PythonTokenizer(my_tokenizer))
tinyms.text.to_str(array, encoding='utf8')[source]

Convert NumPy array of bytes to array of str by decoding each element based on charset encoding .

Parameters:
  • array (numpy.ndarray) – Array of bytes type representing strings.

  • encoding (str) – Indicating the charset for decoding. Default: ‘utf8’.

Returns:

numpy.ndarray, NumPy array of str .

Examples

>>> import numpy as np
>>> import mindspore.dataset as ds
>>>
>>> data = np.array([["1", "2", "3"]], dtype=np.bytes_)
>>> dataset = ds.NumpySlicesDataset(data, column_names=["text"])
>>> for item in dataset.create_dict_iterator(num_epochs=1, output_numpy=True):
...     str_data = text.to_str(item["text"])
tinyms.text.to_bytes(array, encoding='utf8')[source]

Convert NumPy array of str to array of bytes by encoding each element based on charset encoding .

Parameters:
  • array (numpy.ndarray) – Array of str type representing strings.

  • encoding (str) – Indicating the charset for encoding. Default: ‘utf8’.

Returns:

numpy.ndarray, NumPy array of bytes .

Examples

>>> import numpy as np
>>> import mindspore.dataset as ds
>>>
>>> data = np.array([["1", "2", "3"]], dtype=np.str_)
>>> dataset = ds.NumpySlicesDataset(data, column_names=["text"])
>>> for item in dataset.create_dict_iterator(num_epochs=1, output_numpy=True):
...     bytes_data = text.to_bytes(item["text"])
class tinyms.text.Vocab[source]

Vocab object that is used to save pairs of words and ids.

It contains a map that maps each word(str) to an id(int) or reverse.

classmethod from_dataset(dataset, columns=None, freq_range=None, top_k=None, special_tokens=None, special_first=True)[source]

Build a Vocab from a dataset.

This would collect all unique words in a dataset and return a vocab within the frequency range specified by user in freq_range. User would be warned if no words fall into the frequency. Words in vocab are ordered from the highest frequency to the lowest frequency. Words with the same frequency would be ordered lexicographically.

Parameters:
  • dataset (Dataset) – dataset to build vocab from.

  • columns (list[str], optional) – column names to get words from. It can be a list of column names. Default: None.

  • freq_range (tuple, optional) – A tuple of integers (min_frequency, max_frequency). Words within the frequency range would be kept. 0 <= min_frequency <= max_frequency <= total_words. min_frequency=0 is the same as min_frequency=1. max_frequency > total_words is the same as max_frequency = total_words. min_frequency/max_frequency can be None, which corresponds to 0/total_words separately. Default: None, all words are included.

  • top_k (int, optional) – top_k is greater than 0. Number of words to be built into vocab. top_k means most frequent words are taken. top_k is taken after freq_range. If not enough top_k, all words will be taken. Default: None, all words are included.

  • special_tokens (list, optional) – A list of strings, each one is a special token. For example special_tokens=[“<pad>”,”<unk>”]. Default: None, no special tokens will be added.

  • special_first (bool, optional) – Whether special_tokens will be prepended/appended to vocab. If special_tokens is specified and special_first is set to True, special_tokens will be prepended. Default: True.

Returns:

Vocab, Vocab object built from the dataset.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>> dataset = ds.TextFileDataset("/path/to/sentence/piece/vocab/file", shuffle=False)
>>> vocab = text.Vocab.from_dataset(dataset, "text", freq_range=None, top_k=None,
...                                 special_tokens=["<pad>", "<unk>"],
...                                 special_first=True)
>>> dataset = dataset.map(operations=text.Lookup(vocab, "<unk>"), input_columns=["text"])
classmethod from_dict(word_dict)[source]

Build a vocab object from a dict.

Parameters:

word_dict (dict) – Dict contains word and id pairs, where word should be str and id be int. id is recommended to start from 0 and be continuous. ValueError will be raised if id is negative.

Returns:

Vocab, Vocab object built from the dict.

Examples

>>> import mindspore.dataset.text as text
>>> vocab = text.Vocab.from_dict({"home": 3, "behind": 2, "the": 4, "world": 5, "<unk>": 6})
classmethod from_file(file_path, delimiter='', vocab_size=None, special_tokens=None, special_first=True)[source]

Build a vocab object from a file.

Parameters:
  • file_path (str) – Path to the file which contains the vocab list.

  • delimiter (str, optional) – A delimiter to break up each line in file, the first element is taken to be the word. Default: ‘’, the whole line will be treated as a word.

  • vocab_size (int, optional) – Number of words to read from file_path. Default: None, all words are taken.

  • special_tokens (list, optional) – A list of strings, each one is a special token. For example special_tokens=[“<pad>”,”<unk>”]. Default: None, no special tokens will be added.

  • special_first (bool, optional) – Whether special_tokens will be prepended/appended to vocab, If special_tokens is specified and special_first is set to True, special_tokens will be prepended. Default: True.

Returns:

Vocab, Vocab object built from the file.

Examples

>>> import mindspore.dataset.text as text
>>> # Assume vocab file contains the following content:
>>> # --- begin of file ---
>>> # apple,apple2
>>> # banana, 333
>>> # cat,00
>>> # --- end of file ---
>>>
>>> # Read file through this API and specify "," as delimiter.
>>> # The delimiter will break up each line in file, then the first element is taken to be the word.
>>> vocab = text.Vocab.from_file("/path/to/simple/vocab/file", ",", None, ["<pad>", "<unk>"], True)
>>>
>>> # Finally, there are 5 words in the vocab: "<pad>", "<unk>", "apple", "banana", "cat".
>>> vocabulary = vocab.vocab()
classmethod from_list(word_list, special_tokens=None, special_first=True)[source]

Build a vocab object from a list of word.

Parameters:
  • word_list (list) – A list of string where each element is a word of type string.

  • special_tokens (list, optional) – A list of strings, each one is a special token. For example special_tokens=[“<pad>”,”<unk>”]. Default: None, no special tokens will be added.

  • special_first (bool, optional) – Whether special_tokens is prepended or appended to vocab. If special_tokens is specified and special_first is set to True, special_tokens will be prepended. Default: True.

Returns:

Vocab, Vocab object built from the list.

Examples

>>> import mindspore.dataset.text as text
>>> vocab = text.Vocab.from_list(["w1", "w2", "w3"], special_tokens=["<unk>"], special_first=True)
ids_to_tokens(ids)[source]

Converts a single index or a sequence of indices in a token or a sequence of tokens. If id does not exist, return empty string.

Parameters:

ids (Union[int, list[int]]) – The token id (or token ids) to convert to tokens.

Returns:

The decoded token(s).

Examples

>>> import mindspore.dataset.text as text
>>> vocab = text.Vocab.from_list(["w1", "w2", "w3"], special_tokens=["<unk>"], special_first=True)
>>> token = vocab.ids_to_tokens(0)
tokens_to_ids(tokens)[source]

Converts a token string or a sequence of tokens in a single integer id or a sequence of ids. If token does not exist, return id with value -1.

Parameters:

tokens (Union[str, list[str]]) – One or several token(s) to convert to token id(s).

Returns:

The token id or list of token ids.

Examples

>>> import mindspore.dataset.text as text
>>> vocab = text.Vocab.from_list(["w1", "w2", "w3"], special_tokens=["<unk>"], special_first=True)
>>> ids = vocab.tokens_to_ids(["w1", "w3"])
vocab()[source]

Get the vocabory table in dict type.

Returns:

A vocabulary consisting of word and id pairs.

Examples

>>> import mindspore.dataset.text as text
>>> vocab = text.Vocab.from_list(["word_1", "word_2", "word_3", "word_4"])
>>> vocabory_dict = vocab.vocab()
class tinyms.text.SentencePieceVocab[source]

SentencePiece object that is used to do words segmentation.

classmethod from_dataset(dataset, col_names, vocab_size, character_coverage, model_type, params)[source]

Build a SentencePiece from a dataset.

Parameters:
  • dataset (Dataset) – Dataset to build SentencePiece.

  • col_names (list) – The list of the col name.

  • vocab_size (int) – Vocabulary size.

  • character_coverage (float) – Amount of characters covered by the model, good defaults are: 0.9995 for languages with rich character set like Japanese or Chinese and 1.0 for other languages with small character set.

  • model_type (SentencePieceModel) –

    It can be any of [SentencePieceModel.UNIGRAM, SentencePieceModel.BPE, SentencePieceModel.CHAR, SentencePieceModel.WORD], default is SentencePieceModel.UNIGRAM. The input sentence must be pre-tokenized when using SentencePieceModel.WORD type.

    • SentencePieceModel.UNIGRAM, Unigram Language Model means the next word in the sentence is assumed to be independent of the previous words generated by the model.

    • SentencePieceModel.BPE, refers to byte pair encoding algorithm, which replaces the most frequent pair of bytes in a sentence with a single, unused byte.

    • SentencePieceModel.CHAR, refers to char based sentencePiece Model type.

    • SentencePieceModel.WORD, refers to word based sentencePiece Model type.

  • params (dict) – A dictionary with no incoming parameters.

Returns:

SentencePieceVocab, vocab built from the dataset.

Examples

>>> import mindspore.dataset as ds
>>> from mindspore.dataset.text import SentencePieceVocab, SentencePieceModel
>>> dataset = ds.TextFileDataset("/path/to/sentence/piece/vocab/file", shuffle=False)
>>> vocab = SentencePieceVocab.from_dataset(dataset, ["text"], 5000, 0.9995,
...                                         SentencePieceModel.UNIGRAM, {})
classmethod from_file(file_path, vocab_size, character_coverage, model_type, params)[source]

Build a SentencePiece object from a file.

Parameters:
  • file_path (list) – Path to the file which contains the SentencePiece list.

  • vocab_size (int) – Vocabulary size.

  • character_coverage (float) – Amount of characters covered by the model, good defaults are: 0.9995 for languages with rich character set like Japanese or Chinese and 1.0 for other languages with small character set.

  • model_type (SentencePieceModel) –

    It can be any of [SentencePieceModel.UNIGRAM, SentencePieceModel.BPE, SentencePieceModel.CHAR, SentencePieceModel.WORD], default is SentencePieceModel.UNIGRAM. The input sentence must be pre-tokenized when using SentencePieceModel.WORD type.

    • SentencePieceModel.UNIGRAM, Unigram Language Model means the next word in the sentence is assumed to be independent of the previous words generated by the model.

    • SentencePieceModel.BPE, refers to byte pair encoding algorithm, which replaces the most frequent pair of bytes in a sentence with a single, unused byte.

    • SentencePieceModel.CHAR, refers to char based sentencePiece Model type.

    • SentencePieceModel.WORD, refers to word based sentencePiece Model type.

  • params (dict) – A dictionary with no incoming parameters(The parameters are derived from SentencePiece library).

Returns:

SentencePieceVocab, vocab built from the file.

Examples

>>> from mindspore.dataset.text import SentencePieceVocab, SentencePieceModel
>>> vocab = SentencePieceVocab.from_file(["/path/to/sentence/piece/vocab/file"], 5000, 0.9995,
...                                      SentencePieceModel.UNIGRAM, {})
classmethod save_model(vocab, path, filename)[source]

Save model into given filepath.

Parameters:
  • vocab (SentencePieceVocab) – A SentencePiece object.

  • path (str) – Path to store model.

  • filename (str) – The name of the file.

Examples

>>> from mindspore.dataset.text import SentencePieceVocab, SentencePieceModel
>>> vocab = SentencePieceVocab.from_file(["/path/to/sentence/piece/vocab/file"], 5000, 0.9995,
...                                      SentencePieceModel.UNIGRAM, {})
>>> SentencePieceVocab.save_model(vocab, "./", "m.model")
class tinyms.text.SentencePieceModel[source]

An enumeration for SentencePieceModel.

Possible enumeration values are: SentencePieceModel.UNIGRAM, SentencePieceModel.BPE, SentencePieceModel.CHAR, SentencePieceModel.WORD.

  • SentencePieceModel.UNIGRAM: Unigram Language Model means the next word in the sentence is assumed to be independent of the previous words generated by the model.

  • SentencePieceModel.BPE: refers to byte pair encoding algorithm, which replaces the most frequent pair of bytes in a sentence with a single, unused byte.

  • SentencePieceModel.CHAR: refers to char based sentencePiece Model type.

  • SentencePieceModel.WORD: refers to word based sentencePiece Model type.

class tinyms.text.SPieceTokenizerOutType[source]

An enumeration for mindspore.dataset.text.SentencePieceTokenizer .

Possible enumeration values are: SPieceTokenizerOutType.STRING, SPieceTokenizerOutType.INT.

  • SPieceTokenizerOutType.STRING: means output type of SentencePiece Tokenizer is string.

  • SPieceTokenizerOutType.INT: means output type of SentencePiece Tokenizer is int.

class tinyms.text.SPieceTokenizerLoadType[source]

An enumeration for loading type of mindspore.dataset.text.SentencePieceTokenizer .

Possible enumeration values are: SPieceTokenizerLoadType.FILE, SPieceTokenizerLoadType.MODEL.

  • SPieceTokenizerLoadType.FILE: Load SentencePiece tokenizer from a Vocab file.

  • SPieceTokenizerLoadType.MODEL: Load SentencePiece tokenizer from a SentencePieceVocab object.

class tinyms.text.Compose(**kwargs)[source]

Compose a list of transforms into a single transform.

Parameters:

transforms (list) – List of transformations to be applied.

Raises:
  • TypeError – If transforms is not of type list.

  • ValueError – If transforms is empty.

  • TypeError – If elements of transforms are neither Python callable objects nor data processing operations in c_transforms.

Supported Platforms:

CPU

Examples

>>> compose = c_transforms.Compose([c_vision.Decode(), c_vision.RandomCrop(512)])
>>> image_folder_dataset = image_folder_dataset.map(operations=compose)
class tinyms.text.Concatenate(**kwargs)[source]

Tensor operation that concatenates all columns into a single tensor.

Parameters:
  • axis (int, optional) – Concatenate the tensors along given axis. Default: 0.

  • prepend (numpy.array, optional) – NumPy array to be prepended to the already concatenated tensors. Default: None.

  • append (numpy.array, optional) – NumPy array to be appended to the already concatenated tensors. Default: None.

Raises:
  • TypeError – If axis is not of type int.

  • TypeError – If prepend is not of type numpy.ndarray.

  • TypeError – If append is not of type numpy.ndarray.

Supported Platforms:

CPU

Examples

>>> import numpy as np
>>> # concatenate string
>>> prepend_tensor = np.array(["dw", "df"], dtype='S')
>>> append_tensor = np.array(["dwsdf", "df"], dtype='S')
>>> concatenate_op = c_transforms.Concatenate(0, prepend_tensor, append_tensor)
>>> data = [["This","is","a","string"]]
>>> dataset = ds.NumpySlicesDataset(data)
>>> dataset = dataset.map(operations=concatenate_op)
class tinyms.text.Duplicate(**kwargs)[source]

Duplicate the input tensor to output, only support transform one column each time.

Raises:

RuntimeError – If given tensor has two columns.

Supported Platforms:

CPU

Examples

>>> # Data before
>>> # |  x      |
>>> # +---------+
>>> # | [1,2,3] |
>>> # +---------+
>>> data = [[1,2,3]]
>>> numpy_slices_dataset = ds.NumpySlicesDataset(data, ["x"])
>>> numpy_slices_dataset = numpy_slices_dataset.map(operations=c_transforms.Duplicate(),
...                                                 input_columns=["x"],
...                                                 output_columns=["x", "y"])
>>> # Data after
>>> # |  x      |  y      |
>>> # +---------+---------+
>>> # | [1,2,3] | [1,2,3] |
>>> # +---------+---------+
class tinyms.text.Fill(**kwargs)[source]

Tensor operation to fill all elements in the tensor with the specified value. The output tensor will have the same shape and type as the input tensor.

Parameters:

fill_value (Union[str, bytes, int, float, bool]) – scalar value to fill the tensor with.

Raises:

TypeError – If fill_value is not of type str, float, bool, int or bytes.

Supported Platforms:

CPU

Examples

>>> import numpy as np
>>> # generate a 1D integer numpy array from 0 to 4
>>> def generator_1d():
...     for i in range(5):
...         yield (np.array([i]),)
>>> generator_dataset = ds.GeneratorDataset(generator_1d, column_names="col1")
>>> # [[0], [1], [2], [3], [4]]
>>> fill_op = c_transforms.Fill(3)
>>> generator_dataset = generator_dataset.map(operations=fill_op)
>>> # [[3], [3], [3], [3], [3]]
class tinyms.text.Mask(**kwargs)[source]

Mask content of the input tensor with the given predicate. Any element of the tensor that matches the predicate will be evaluated to True, otherwise False.

Parameters:
  • operator (Relational) – relational operators, it can be any of [Relational.EQ, Relational.NE, Relational.LT, Relational.GT, Relational.LE, Relational.GE], take Relational.EQ as example, EQ refers to equal.

  • constant (Union[str, int, float, bool]) – Constant to be compared to.

  • dtype (mindspore.dtype, optional) – Type of the generated mask. Default: mstype.bool_.

Raises:
  • TypeErroroperator is not of type Relational.

  • TypeErrorconstant is not of type string int, float or bool.

  • TypeErrordtype is not of type mindspore.dtype.

Supported Platforms:

CPU

Examples

>>> from mindspore.dataset.transforms.c_transforms import Relational
>>> # Data before
>>> # |  col   |
>>> # +---------+
>>> # | [1,2,3] |
>>> # +---------+
>>> data = [[1, 2, 3]]
>>> numpy_slices_dataset = ds.NumpySlicesDataset(data, ["col"])
>>> numpy_slices_dataset = numpy_slices_dataset.map(operations=c_transforms.Mask(Relational.EQ, 2))
>>> # Data after
>>> # |       col         |
>>> # +--------------------+
>>> # | [False,True,False] |
>>> # +--------------------+
class tinyms.text.OneHot(**kwargs)[source]

Tensor operation to apply one hot encoding.

Parameters:

num_classes (int) – Number of classes of objects in dataset. It should be larger than the largest label number in the dataset.

Raises:
Supported Platforms:

CPU

Examples

>>> # Assume that dataset has 10 classes, thus the label ranges from 0 to 9
>>> onehot_op = c_transforms.OneHot(num_classes=10)
>>> mnist_dataset = mnist_dataset.map(operations=onehot_op, input_columns=["label"])
class tinyms.text.PadEnd(**kwargs)[source]

Pad input tensor according to pad_shape, input tensor needs to have same rank.

Parameters:
  • pad_shape (list(int)) – List of integers representing the shape needed. Dimensions that set to None will not be padded (i.e., original dim will be used). Shorter dimensions will truncate the values.

  • pad_value (Union[str, bytes, int, float, bool], optional) – Value used to pad. Default to 0 or empty string in case of tensors of strings.

Raises:
  • TypeError – If pad_shape is not of type list.

  • TypeError – If pad_value is not of type str, float, bool, int or bytes.

  • TypeError – If elements of pad_shape is not of type int.

  • ValueError – If elements of pad_shape is not of positive.

Supported Platforms:

CPU

Examples

>>> # Data before
>>> # |   col   |
>>> # +---------+
>>> # | [1,2,3] |
>>> # +---------|
>>> data = [[1, 2, 3]]
>>> numpy_slices_dataset = ds.NumpySlicesDataset(data, ["col"])
>>> numpy_slices_dataset = numpy_slices_dataset.map(operations=c_transforms.PadEnd(pad_shape=[4],
...                                                                                pad_value=10))
>>> # Data after
>>> # |    col     |
>>> # +------------+
>>> # | [1,2,3,10] |
>>> # +------------|
class tinyms.text.RandomApply(**kwargs)[source]

Randomly perform a series of transforms with a given probability.

Parameters:
  • transforms (list) – List of transformations to be applied.

  • prob (float, optional) – The probability to apply the transformation list. Default: 0.5.

Raises:
  • TypeError – If transforms is not of type list.

  • ValueError – If transforms is empty.

  • TypeError – If elements of transforms are neither Python callable objects nor data processing operations in c_transforms.

  • TypeError – If prob is not of type float.

  • ValueError – If prob is not in range [0.0, 1.0].

Supported Platforms:

CPU

Examples

>>> rand_apply = c_transforms.RandomApply([c_vision.RandomCrop(512)])
>>> image_folder_dataset = image_folder_dataset.map(operations=rand_apply)
class tinyms.text.RandomChoice(**kwargs)[source]

Randomly select one transform from a list of transforms to perform operation.

Parameters:

transforms (list) – List of transformations to be chosen from to apply.

Raises:
  • TypeError – If transforms is not of type list.

  • ValueError – If transforms is empty.

  • TypeError – If elements of transforms are neither Python callable objects nor data processing operations in c_transforms.

Supported Platforms:

CPU

Examples

>>> rand_choice = c_transforms.RandomChoice([c_vision.CenterCrop(50), c_vision.RandomCrop(512)])
>>> image_folder_dataset = image_folder_dataset.map(operations=rand_choice)
class tinyms.text.Slice(**kwargs)[source]

Slice operation to extract a tensor out using the given n slices.

The functionality of Slice is similar to NumPy’s indexing feature (Currently only rank-1 tensors are supported).

Parameters:

slices (Union[int, list[int], slice, None, Ellipsis]) –

Maximum n number of arguments to slice a tensor of rank n . One object in slices can be one of:

  1. int: Slice this index only along the first dimension. Negative index is supported.

  2. list(int): Slice these indices along the first dimension. Negative indices are supported.

  3. slice: Slice the generated indices from the slice object along the first dimension. Similar to start:stop:step.

  4. None: Slice the whole dimension. Similar to [:] in Python indexing.

  5. Ellipsis: Slice the whole dimension, same result with None .

Raises:

TypeError – If slices is not of type int, list[int], slice , None or Ellipsis .

Supported Platforms:

CPU

Examples

>>> # Data before
>>> # |   col   |
>>> # +---------+
>>> # | [1,2,3] |
>>> # +---------|
>>> data = [[1, 2, 3]]
>>> numpy_slices_dataset = ds.NumpySlicesDataset(data, ["col"])
>>> # slice indices 1 and 2 only
>>> numpy_slices_dataset = numpy_slices_dataset.map(operations=c_transforms.Slice(slice(1,3)))
>>> # Data after
>>> # |   col   |
>>> # +---------+
>>> # |  [2,3]  |
>>> # +---------|
class tinyms.text.TypeCast(**kwargs)[source]

Tensor operation to cast to a given MindSpore data type.

Note

This operation supports running on Ascend or GPU platforms by Offload.

Parameters:

data_type (mindspore.dtype) – mindspore.dtype to be cast to.

Raises:

TypeError – If data_type is not of type bool, int, float or string.

Supported Platforms:

Ascend GPU CPU

Examples

>>> import numpy as np
>>> from mindspore import dtype as mstype
>>>
>>> # Generate 1d int numpy array from 0 - 63
>>> def generator_1d():
...     for i in range(64):
...         yield (np.array([i]),)
>>>
>>> dataset = ds.GeneratorDataset(generator_1d, column_names='col')
>>> type_cast_op = c_transforms.TypeCast(mstype.int32)
>>> dataset = dataset.map(operations=type_cast_op)
class tinyms.text.Unique(**kwargs)[source]

Perform the unique operation on the input tensor, only support transform one column each time.

Return 3 tensor: unique output tensor, index tensor, count tensor.

  • Output tensor contains all the unique elements of the input tensor in the same order that they occur in the input tensor.

  • Index tensor that contains the index of each element of the input tensor in the unique output tensor.

  • Count tensor that contains the count of each element of the output tensor in the input tensor.

Note

Call batch op before calling this function.

Raises:

RuntimeError – If given Tensor has two columns.

Supported Platforms:

CPU

Examples

>>> # Data before
>>> # |  x                 |
>>> # +--------------------+
>>> # | [[0,1,2], [1,2,3]] |
>>> # +--------------------+
>>> data = [[[0,1,2], [1,2,3]]]
>>> dataset = ds.NumpySlicesDataset(data, ["x"])
>>> dataset = dataset.map(operations=c_transforms.Unique(),
...                       input_columns=["x"],
...                       output_columns=["x", "y", "z"])
>>> # Data after
>>> # |  x      |  y              |z        |
>>> # +---------+-----------------+---------+
>>> # | [0,1,2,3] | [0,1,2,1,2,3] | [1,2,2,1]
>>> # +---------+-----------------+---------+