o ZŽh¬$ã@s¬dZddlZddlZddlZddlmZmZmZmZm Z m Z ddlmZddl mZmZmZeƒr5ddlZeƒrúNTFÚreturnc s’t|dd} t | ¡|_Wdƒn1swYdd„|j ¡Dƒ|_||_||_||_||_ ||_ tƒjd|||||||dœ| ¤ŽdS)Núutf-8©ÚencodingcSsi|]\}}||“qSrr)Ú.0ÚkÚvrrrÚ Wsz*VitsTokenizer.__init__..)Ú pad_tokenÚ unk_tokenÚlanguageÚ add_blankÚ normalizeÚ phonemizeÚ is_uromanr) ÚopenÚjsonÚloadÚencoderÚitemsÚdecoderr#r$r%r&r'ÚsuperÚ__init__)Úselfr r!r"r#r$r%r&r'ÚkwargsZvocab_handle©Ú __class__rrr/Hs(ÿù øzVitsTokenizer.__init__cCs t|jƒS©N)Úlenr+©r0rrrÚ vocab_sizejs zVitsTokenizer.vocab_sizecs(‡fdd„tˆjƒDƒ}| ˆj¡|S)Ncsi|]}ˆ |¡|“qSr)Zconvert_ids_to_tokens)rÚir6rrr osz+VitsTokenizer.get_vocab..)Úranger7ÚupdateÚadded_tokens_encoder)r0Zvocabrr6rÚ get_vocabnszVitsTokenizer.get_vocabcCsžt|j ¡ƒt|j ¡ƒ}d}d}|t|ƒkrMd}|D]}|||t|ƒ…|kr8||7}|t|ƒ7}d}nq|sG||| ¡7}|d7}|t|ƒks|S)zfLowercase the input string, respecting any special token ids that may be part or entirely upper-cased.ÚrFTé)Úlistr+Úkeysr;r5Úlower)r0rZall_vocabularyÚ filtered_textr8Úfound_matchÚwordrrrÚnormalize_textss"üõ zVitsTokenizer.normalize_textcCs|jdkr| dd¡}|S)z4Special treatment of characters in certain languagesZronuÈ›uÅ£)r#Úreplace)r0ÚtextrrrÚ_preprocess_charˆs zVitsTokenizer._preprocess_charrGÚis_split_into_wordsr%csÈ|dur|nˆj}|rˆ |¡}ˆ |¡}t|ƒr.ˆjr.tƒs%t d¡n t ¡}| |¡}ˆjrNtƒs8t dƒ‚tj|dddddd}t dd |¡}||fS|r`d tt‡fdd„|ƒƒ¡ ¡}||fS) a Performs any necessary transformations before tokenization. This method should pop the arguments from kwargs and return the remaining `kwargs` as well. We test the `kwargs` at the end of the encoding process to be sure all the arguments have been used. Args: text (`str`): The text to prepare. is_split_into_words (`bool`, *optional*, defaults to `False`): Whether or not the input is already pre-tokenized (e.g., split into words). If set to `True`, the tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace) which it will tokenize. normalize (`bool`, *optional*, defaults to `None`): Whether or not to apply punctuation and casing normalization to the text inputs. Typically, VITS is trained on lower-cased and un-punctuated text. Hence, normalization is used to ensure that the input text consists only of lower-case characters. kwargs (`Dict[str, Any]`, *optional*): Keyword arguments to use for the tokenization. Returns: `Tuple[str, Dict[str, Any]]`: The prepared text and the unused kwargs. NaCText to the tokenizer contains non-Roman characters. To apply the `uroman` pre-processing step automatically, ensure the `uroman` Romanizer is installed with: `pip install uroman` Note `uroman` requires python version >= 3.10Otherwise, apply the Romanizer manually as per the instructions: https://github.com/isi-nlp/uromanzEPlease install the `phonemizer` Python package to use this tokenizer.zen-usZespeakT)r#ÚbackendÚstripZpreserve_punctuationZwith_stressz\s+ú r=cs |ˆjvSr4)r+)Úcharr6rrÚËs z8VitsTokenizer.prepare_for_tokenization..)r%rErHrr'rÚloggerÚwarningÚurZUromanZromanize_stringr&r ÚImportErrorÚ phonemizerrÚsubÚjoinr?ÚfilterrK)r0rGrIr%r1rBÚuromanrr6rÚprepare_for_tokenizationŽs6 ÿ úü z&VitsTokenizer.prepare_for_tokenizationcCs@t|ƒ}|jr| d¡gt|ƒdd}||ddd…<|}|S)z]Tokenize a string by inserting the `` token at the boundary between adjacent characters.rér>N)r?r$Ú_convert_id_to_tokenr5)r0rGÚtokensZinterspersedrrrÚ _tokenizeÏszVitsTokenizer._tokenizer[cCs*|jrt|ƒdkr|ddd…}d |¡S)Nr>rYr=)r$r5rU)r0r[rrrÚconvert_tokens_to_stringÚs z&VitsTokenizer.convert_tokens_to_stringcCs|j ||j |j¡¡S)z0Converts a token (str) in an id using the vocab.)r+Úgetr")r0ÚtokenrrrÚ_convert_token_to_idßsz"VitsTokenizer._convert_token_to_idcCs|j |¡S)z=Converts an index (integer) in a token (str) using the vocab.)r-r^)r0ÚindexrrrrZãsz"VitsTokenizer._convert_id_to_tokenÚsave_directoryÚfilename_prefixc Csštj |¡st d|›d¡dStj ||r|dndtd¡}t|ddd}| t j |jd d ddd ¡Wdƒ|fS1sEwY|fS)NzVocabulary path (z) should be a directoryú-r=r ÚwrrrYTF)ÚindentÚ sort_keysÚensure_asciiÚ )ÚosÚpathÚisdirrOÚerrorrUÚVOCAB_FILES_NAMESr(Úwriter)Údumpsr+)r0rbrcr ÚfrrrÚsave_vocabularyçsÿ ÿýzVitsTokenizer.save_vocabulary)rrNTTTF)rN)FNr4)Ú__name__Ú __module__Ú__qualname__Ú__doc__rnZvocab_files_namesZmodel_input_namesr/Úpropertyr7r<rErHÚstrÚboolrrrrrXrr\r]r`rZrrrÚ __classcell__rrr2rr/sD÷õ" ÿÿÿÿ þA0r)rvr)rjrÚtypingrrrrrrZtokenization_utilsr Úutilsr rrrSrWrQZ get_loggerrsrOrnrrÚ__all__rrrrÚs" H