o ZŽhÃ<ã@sšdZddlZddlZddlZddlmZddlmZmZm Z m Z mZddlm Z mZddlmZe e¡Zdd iZGd d„dƒZGdd „d eƒZd gZdS)z"Tokenization class for model MyT5.éN)Údefaultdict)ÚDictÚListÚOptionalÚTupleÚUnioné)Ú AddedTokenÚPreTrainedTokenizer)ÚloggingÚ vocab_filezbyte_maps.jsonc @sÈeZdZdZdZdeeeeefffdd„Zdeeee e effdedefd d „Zdeeefdeeee e efffdd „Zde edede effdd„Z dde ede efdd„ZdS)ÚByteRewriteraZ Byte rewriter class for MyT5 tokenizer. This class is used to rewrite bytes using a hash tree. The hash tree is constructed from a set of rewriting rules. Args: rewriting_rules (`str` or `Dict[str, str]`): A path to a json file containing the rewriting rules or a dictionary containing the rewriting rules. z[LEAF]Úrewriting_rulescCsŠt|tƒr t|dƒ }t |¡}Wdƒn1swYnt|tƒs.tdt|ƒ›ƒ‚| |¡|_ dd„| ¡Dƒ}| |¡|_dS)NÚrzDrewriting_rules should be either a path to json file or a dict, got cSsi|]\}}||“qS©r)Ú.0ÚkÚvrrúY/var/www/auris/lib/python3.10/site-packages/transformers/models/myt5/tokenization_myt5.pyÚ 8sz)ByteRewriter.__init__..)Ú isinstanceÚstrÚopenÚjsonÚloadÚdictÚ ValueErrorÚtypeÚconstruct_hash_treeÚ hash_treeÚitemsÚreverse_hash_tree)ÚselfrÚfZreverse_rewriting_rulesrrrÚ__init__.s ÿ€ ÿzByteRewriter.__init__rÚbyte_in_sequenceÚbyte_out_sequencecCsH| d¡}| d¡}|}|D]}||vri||<||}q|||j<dS)zL Add a leaf with the output byte sequence to the hash tree. ú N)ÚsplitÚLEAF)r"rr%r&Zbyte_in_listZ byte_out_listÚtree_pointerÚbrrrÚadd_leaf;s zByteRewriter.add_leafÚreturncCsTttƒ}dd„tdƒDƒD] }|g|||j<q | ¡D]\}}| |||¡q|S)zE Construct a hash tree for rewritten byte sequences. css|]}|d›VqdS)Ú02xNr)rÚxrrrÚ Os€z3ByteRewriter.construct_hash_tree..é)rrÚranger)r r,)r"rrr+Zin_sequenceZout_sequencerrrrJsz ByteRewriter.construct_hash_treeÚ byte_sequenceNcCs0|j}|D] }||vr||}qdS||jS)zW Search the hash tree and return the rewritten byte sequence if found. N)rr))r"r3r*r+rrrÚsearch_hash_treeWs zByteRewriter.search_hash_treeFÚin_bytesc Csªg}d}d}|t|ƒkrS|s|jn|j}t|t|ƒƒD](}||}||vr*||}n ||kr5|g} |}nn |j|vrC||j} |}q| | ¡|d}|t|ƒks|S)a6 Rewrite a sequence of bytes using the hash tree. Args: in_bytes (`List[str]`): A list of bytes to be rewritten. reverse (`bool`): If True, decoding is performed with the reverse hash tree. Returns: `List[str]`: The rewritten byte sequence. ré)Úlenrr!r2r)Úextend) r"r5ÚreverseZ out_bytesZb_startZb_endr*Újr+Zcur_leafrrrÚ rewrite_bytesds, € ðzByteRewriter.rewrite_bytes)F)Ú__name__Ú __module__Ú__qualname__Ú__doc__r)rrrr$rrr,rr4r;rrrrr !s * ." r c sreZdZdZddgZeZ d/ d0‡fd d„ Zedd „ƒZ dd„Z d1deede eeded eef‡fdd„ Zdeed eefdd„Z d2deede eed eefdd„Z d2deede eed eefdd„Zded eefdd„Zd d!„Zd"d#„Zd$eed eefd%d&„Zd$eed eefd'd(„Zd)d*„Zd2d+ed,e ed eefd-d.„Z‡ZS)3Ú MyT5Tokenizeraè Construct a MyT5 tokenizer. This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to this superclass for more information regarding those methods. Args: vocab_file (`str`): The file containing the byte rewriting rules. eos_token (`str`, *optional*, defaults to `""`): The end of sequence token. unk_token (`str`, *optional*, defaults to `""`): The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. pad_token (`str`, *optional*, defaults to `""`): The token used for padding, for example when batching sequences of different lengths. extra_ids (`int`, *optional*, defaults to 125): Add a number of extra ids added to the end of the vocabulary for use as sentinels. These tokens are accessible as "" where "{%d}" is a number between 0 and extra_ids-1. Extra tokens are indexed from the end of the vocabulary up to beginning ("" is the last token in the vocabulary like in ByT5 preprocessing see [here](https://github.com/google-research/text-to-text-transfer-transformer/blob/9fd7b14a769417be33bc6c850f9598764913c833/t5/data/preprocessors.py#L2117)). additional_special_tokens (`List[str]`, *optional*): Additional special tokens used by the tokenizer. Z input_idsZattention_maskúúúé}Nr-c s<|dkr|durdd„t|ƒDƒ}n(|dkr:|dur:t|ƒdkr:tttdd„|ƒƒƒ}||kr:td|›d|›dƒ‚t|tƒrFt|d d d n|}t|tƒrTt|d d d n|}t|tƒrbt|d d d n|}|||dœ|_t|jƒ|_ d|_ t t |d ƒ¡|_t|jdƒ|_t|jdƒ|_tƒjd|||d|dœ|¤ŽdS)NrcSsg|]}d|›d‘qS)z r©rÚirrrÚ ±óz*MyT5Tokenizer.__init__..cSstdt|ƒvƒS)NZextra_id)Úboolr)r/rrrÚ´sz(MyT5Tokenizer.__init__..zBoth extra_ids (z!) and additional_special_tokens (zm) are provided to MyT5Tokenizer. In this case the additional_special_tokens must include the extra_ids tokensT)ÚlstripÚrstrip)rr6ér1rZ decompose_mapZ merge_map)Ú eos_tokenÚ unk_tokenÚ pad_tokenÚ extra_idsÚadditional_special_tokensr)r2r7ÚsetÚfilterrrrr Z_added_tokens_decoderÚoffsetÚ_utf_vocab_sizerrrÚ byte_mapsr Údecompose_rewriterÚmerge_rewriterÚsuperr$) r"rrOrPrQrRrSÚkwargsZextra_tokens©Ú __class__rrr$¥s4ÿû úzMyT5Tokenizer.__init__cCs|jS©N)rW©r"rrrÚ vocab_sizeÓszMyT5Tokenizer.vocab_sizecs.‡fdd„tˆjˆjƒDƒ}| ˆj¡|S)Ncsi|]}ˆ |¡|“qSr)Zconvert_ids_to_tokensrFr`rrrÙrIz+MyT5Tokenizer.get_vocab..)r2rarVÚupdateÚadded_tokens_encoder)r"Zvocabrr`rÚ get_vocabØszMyT5Tokenizer.get_vocabFÚtoken_ids_0Útoken_ids_1Úalready_has_special_tokenscsZ|rtƒj||ddS|durdgt|ƒdgSdgt|ƒdgdgt|ƒdgS)aÄ Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer `prepare_for_model` method. Args: token_ids_0 (`List[int]`): List of IDs. token_ids_1 (`List[int]`, *optional*): Optional second list of IDs for sequence pairs. already_has_special_tokens (`bool`, *optional*, defaults to `False`): Whether or not the token list is already formatted with special tokens for the model. Returns: `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token. T)rerfrgNrr6)r[Úget_special_tokens_maskr7)r"rerfrgr]rrrhÞsÿ(z%MyT5Tokenizer.get_special_tokens_maskÚ token_idscCs>t|ƒdkr|d|jkrt d|j›d¡|S||jgS)z.Do not add eos again if user already added it.réÿÿÿÿzThis sequence already has zQ. In future versions this behavior may lead to duplicated eos tokens being added.)r7Úeos_token_idÚwarningsÚwarnrO)r"rirrrÚ_add_eos_if_not_presentúsÿz%MyT5Tokenizer._add_eos_if_not_presentcCs<|jg}|durt||ƒdgSt||||ƒdgS)aÉ Create a mask from the two sequences passed to be used in a sequence-pair classification task. MyT5 does not make use of token type ids, therefore a list of zeros is returned. Args: token_ids_0 (`List[int]`): List of IDs. token_ids_1 (`List[int]`, *optional*): Optional second list of IDs for sequence pairs. Returns: `List[int]`: List of zeros. Nr)rkr7)r"rerfZeosrrrÚ$create_token_type_ids_from_sequencessz2MyT5Tokenizer.create_token_type_ids_from_sequencescCs(| |¡}|dur|S| |¡}||S)a‚ Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A sequence has the following format: - single sequence: `X ` - pair of sequences: `A B ` Args: token_ids_0 (`List[int]`): List of IDs to which the special tokens will be added. token_ids_1 (`List[int]`, *optional*): Optional second list of IDs for sequence pairs. Returns: `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens. N)rn)r"rerfrrrÚ build_inputs_with_special_tokenss z.MyT5Tokenizer.build_inputs_with_special_tokensÚtextcKs"dd„| d¡Dƒ}| |¡}|S)z‡Take as input a string and return a list of strings (tokens) for words/sub-words. Represents tokens in two character hex formatcSsg|]}|d›‘qS)r.rrFrrrrH:sz+MyT5Tokenizer._tokenize..úutf-8)ÚencodeÚmorphological_encode)r"rqr\ÚtokensrrrÚ _tokenize6s zMyT5Tokenizer._tokenizecCs(t|ƒdkr d}|St|dƒ|j}|S)z0Converts a token (str) in an id using the vocab.rNNé)r7ÚintrV)r"ÚtokenZtoken_idrrrÚ_convert_token_to_id>s þz"MyT5Tokenizer._convert_token_to_idcCs||jd›}|S)z=Converts an index (integer) in a token (str) using the vocab.r.)rV)r"ÚindexryrrrÚ_convert_id_to_tokenHsz"MyT5Tokenizer._convert_id_to_tokenÚindicescCó$|jj|dd}|jj|dd}|S)NF©r9)rYr;rZ©r"r}rrrrtMóz"MyT5Tokenizer.morphological_encodecCr~)NTr)rZr;rYr€rrrÚmorphological_decodeSrz"MyT5Tokenizer.morphological_decodecCs²d}g}|D] }||jvr| |j|¡q||jvr!| |¡q| |¡q| |¡}t|j ¡ƒt|jƒB}|D]}||vrH|t|dƒ7}q:|t |¡7}q:|jddd}|S)z:Converts a sequence of tokens (string) in a single string.órrÚignore)Úerrors) Zadded_tokens_decoderÚappendrcr‚rTÚvaluesÚbytesÚfromhexÚdecode)r"ruÚbstringZ out_tokensryZ _added_tokensÚstringrrrÚconvert_tokens_to_stringYs z&MyT5Tokenizer.convert_tokens_to_stringÚsave_directoryÚfilename_prefixcCs”tj |¡rtj ||r|dndtd¡}n |r|dnd|}t|ddd}| tj|j ddd ¡Wdƒ|fS1sBwY|fS) Nú-ÚrÚwrr)ÚencodingrNF)ÚindentÚensure_ascii) ÚosÚpathÚisdirÚjoinÚVOCAB_FILES_NAMESrÚwriterÚdumpsrX)r"rŽrrÚwriterrrrÚsave_vocabularypsÿ ÿþzMyT5Tokenizer.save_vocabulary)rArBrCrDN)r-N)NFr_)r<r=r>r?Zmodel_input_namesršZvocab_files_namesr$ÚpropertyrardrrxrrJrhrnrorprrvrzr|rtr‚rrržÚ __classcell__rrr]rr@‡sbù ÷. ÿÿ ÿÿþÿÿ ÿ þÿÿ ÿ þ (r@)r?rr–rlÚcollectionsrÚtypingrrrrrZtokenization_utilsr r ÚutilsrZ get_loggerr<Úloggerršr r@Ú__all__rrrrÚs f v