o ZŽh<]ã@s0dZddlmZmZmZmZmZddlmZddl m Z ddlmZddl mZmZmZmZmZmZddlmZmZdd lmZmZdd lmZeƒrOddlZeƒrVddlZdZGd d„deddZ Gdd„deddZ!Gdd„deddZ"d"dd„Z#dd„Z$dd„Z%dd„Z&dd„Z'Gd d!„d!eƒZ(d!gZ)dS)#z Processor class for IDEFICS. é)ÚCallableÚDictÚListÚOptionalÚUnion)Úurlparseé)ÚBatchFeature)Ú ImageInput)ÚImagesKwargsÚProcessingKwargsÚProcessorMixinÚ TextKwargsÚUnpackÚ!_validate_images_text_input_order)ÚPreTokenizedInputÚ TextInput)Úis_tf_availableÚis_torch_available)Údeprecate_kwargNúc@s^eZdZUeeed<eeeefed<ee e ee fed<ee e ee fed<dS)ÚIdeficsImagesKwargsZ transformÚ image_sizeZ image_meanZ image_stdN)Ú__name__Ú __module__Ú__qualname__rrÚ__annotations__rÚstrÚintrÚfloatr©r r ú]/var/www/auris/lib/python3.10/site-packages/transformers/models/idefics/processing_idefics.pyr.s rF)Útotalc@s&eZdZUeeed<eeed<dS)ÚIdeficsTextKwargsÚ add_eos_tokenÚadd_end_of_utterance_tokenN)rrrrÚboolrr r r r!r#5s r#c@s6eZdZUeed<eed<ddddœiddidœZd S) ÚIdeficsProcessorKwargsÚtext_kwargsÚ images_kwargsFÚlongest)Zadd_special_tokensÚpaddingr$Úreturn_tensorsÚpt)r(r)Z common_kwargsN)rrrr#rrÚ _defaultsr r r r!r':s ý ùr'éÿÿÿÿcCsÊ|dkr|dkrd|||k<n |dkrt ||kd|¡}|dkr;|dk}d||<tjjj||d}d||dd…f<|S|dkrct |d¡}t |d|¡}tj||d}t |d¡}t |t |¡|¡}|S)Nr/r-Útfr©Únum_classes)Údepth) r0ÚwhereÚtorchÚnnZ functionalZone_hotÚequalZexpand_dimsZ zeros_like)Zincremental_maskr,r2Z negativesZ attn_maskZnegatives_expandedr r r!Ú$incremental_to_binary_attention_maskIs$ ør8cCs(|dkr t||ƒS|dkrt||ƒSdS)Nr-r0)Ú,image_attention_mask_for_packed_input_ids_ptÚ,image_attention_mask_for_packed_input_ids_tf)Ú input_idsÚ tokenizerr,r r r!Ú)image_attention_mask_for_packed_input_idscs ÿr=cCsvtj|dd}tj|dd}| t¡}|j}t| d¡ƒD]6}d}d}t||ƒD])\} } | |kr>|d7}|||| <d}n|||| <|rLd||| <| |krRd}q)qt| d¡ƒD][}d}d}t|| d¡dddƒD]-} ||| } | |kr‡|d7}|||| <d}n|||| <| |kr“d}|r›d||| <qn||dk}||||8<|||d9<q[||fS)Nr/)Z fill_valuerFéT)r5Z full_likeÚconvert_tokens_to_idsÚIMAGE_TOKENÚeos_token_idÚrangeÚsizeÚ enumerate)r;r<Úimage_attention_maskÚnext_image_attention_maskÚimage_token_idÚeod_token_idÚ batch_idxÚcountÚseen_eodÚidxÚtoken_idZnon_negative_indicesr r r!r9jsL €ô€r9cCs.| t¡}|j}t |¡d}t t |¡d¡}t t |¡d¡}t|ƒD]m}d}d} t |¡d} t| dddƒD]W}|||f ¡}||krc|d7}||gg} |g}t || |¡}t || |¡}n||kr|| s|d} d}||gg} |g}t || |¡}| r‘||kr‘||gg} dg}t || |¡}q:q%||fS)Nrr/Fr>T) r?r@rAr0ÚshapeÚfillrBÚnumpyÚtensor_scatter_nd_update)r;r<rGrHZ batch_sizerErFrIrJrKZ seq_lengthrLrMÚindicesÚupdatesr r r!r:™s< €ïr:cCs$d|vrdSt|ƒ}t|j|jgƒS)z…Checks if the passed string contains a valid url and nothing else. e.g. if space is included it's immediately invalidated the urlú F)rÚallÚschemeÚnetloc)ÚstringÚresultr r r!Úis_urlºsrZcsÔeZdZdZddgZddgZdZdZd‡fd d„ Ze dd ddd dde eeee ee eee fde eeeeeeeeeeeefdeedefdd„ƒZdd„Zdd„Zedd„ƒZ‡ZS)ÚIdeficsProcessorah Constructs a IDEFICS processor which wraps a LLama tokenizer and IDEFICS image processor into a single processor. [`IdeficsProcessor`] offers all the functionalities of [`IdeficsImageProcessor`] and [`LlamaTokenizerFast`]. See the docstring of [`~IdeficsProcessor.__call__`] and [`~IdeficsProcessor.decode`] for more information. Args: image_processor (`IdeficsImageProcessor`): An instance of [`IdeficsImageProcessor`]. The image processor is a required input. tokenizer (`LlamaTokenizerFast`): An instance of [`LlamaTokenizerFast`]. The tokenizer is a required input. image_size (`int`, *optional*, defaults to 224): Image size (assuming a square image) add_end_of_utterance_token (`str`, *optional*): The string representation of token representing end of utterance Úimage_processorr<rr%ZIdeficsImageProcessorZLlamaTokenizerFastNéàcs’|durtdƒ‚|durtdƒ‚tƒ ||¡|j|_t|dƒr#|jn| t¡|_|jj |jj |jj f|_d|jj dg¡vrDd|_dSd|_dS)Nz)You need to specify an `image_processor`.z"You need to specify a `tokenizer`.Úimage_tokenúZadditional_special_tokensTF)Ú ValueErrorÚsuperÚ__init__r\Zcurrent_processorÚhasattrrGr?r@Zimage_num_channelsrÚdefault_image_dimsr<Zspecial_tokens_mapÚgetÚ1tokenizer_was_trained_with_end_of_utterance_token)Úselfr\r<rr%Úkwargs©Ú __class__r r!rbÚs&ÿýýÿÿýzIdeficsProcessor.__init__Úpromptsz5.0.0ÚtextT)Zold_nameÚversionÚnew_nameZraise_if_both_namesÚimagesrhÚreturnc.sF|dur|durtdƒ‚t||ƒ\}}|dur|}nM|durgt|ttfƒs(|g}t|tƒr0|g}t|ttfƒrCt|ƒt|ƒkrCtdƒ‚tdd„|DƒƒsPtdƒ‚t|dttfƒr`dd „|Dƒ}tt||ƒƒ}|j t fd |jji|¤Ž}|d dd ¡}|d dd¡} | dur‹|j} tdd„|Dƒƒs—|g}d‰d‰d} ‡‡fdd„}g}g} |D]|}|jj›}g}d }d }t|ƒD]L\}}|dkrÉ|sÇdnd }t|tƒrú| d¡}t|ƒrë|j |¡}|||ƒ7}| |¡d}q»| ró|ró|| 7}||7}d }q»|||ƒ7}| |¡d}q»|r||jj7}|j|fi|d¤Ž}| |¡| |¡qª|d dd¡}|j|fi|d¤Ž}|d}|d}tdd„| Dƒƒ}td|ƒ}tdd„| Dƒƒdk}g}g}g}t||| ƒD]È\}} }!|}"|" |j¡}#t|#|ƒ}$|!d|$…}%t|%ƒdkrÞ|dkr¤tj|g|% ¡dd…¢RŽ}&|%|&d|% d¡…<nY|dkrÝt !|%¡dd…}'t j"|g|'gdd }(t j|(|%j#d!}&t !|%¡d})t $t %|)¡d"¡}*|%}+t &|&|*|+¡}&n|dkrîtj|g|j'¢RŽ}&n|dkrýt |g|j'¢R¡}&| |&¡|dkr| t (|"¡¡| t (| ¡¡qg|dkr.| t j)|"t j*d!¡| | ¡qg|dkrEt +|¡}t +|¡}t +|¡}n|dkrYt +|¡}t +|¡}t +|¡}|rmt,||j|ƒ\},}-t-|,||d#},n,|dkrƒtj|j!d|j!ddtj.d!},n|dkr™t j|j!d|j!ddft j.d!},t/||||,d$œd%S)&a§This method takes batched or non-batched prompts made of text and images and converts them into prompts that the model was trained on and prepares the image pixel values for the model to process. Args: images (`Union[ImageInput, List[ImageInput], str, List[str], List[List[str]]]`): either a single image or a batched list of images - can be passed in when text contains only text prompts, in order to use the image-text-to-text behavior. text (`Union[List[TextInput], [List[List[TextInput]]]]`): either a single prompt or a batched list of prompts - see the detailed description immediately after the end of the arguments doc section. return_tensors (`str` or `TensorType`, *optional*, defaults to `TensorType.PYTORCH`): The type of tensors to return. Can be one of: - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`. Returns: a dict with entries: `input_ids`, `attention_mask`, `pixel_values`, `image_attention_mask` which can be directly passed to `model.generate` Detailed explanation: Each entry in `text` is either a text to be passed as is or an image that will be processed. An image can be either an image object (`PIL.Image`) or a url from which the image can be retrieved. When the processor encounters an image it'll inject `` entry into the prompt. Example: ```python checkpoint = "HuggingFaceM4/idefics-9b" processor = AutoProcessor.from_pretrained(checkpoint) url = "https://hips.hearstapps.com/hmg-prod/images/cute-photos-of-cats-in-grass-1593184777.jpg" img = processor.image_processor.fetch_images([url])[0] prompts = [ "User:", img, "Describe this image. Assistant: An image of two kittens in grass. ", "User:", "https://hips.hearstapps.com/hmg-prod/images/dog-puns-1581708208.jpg", "Describe this image. Assistant:", ] inputs = processor(text=prompts, return_tensors="pt") generated_ids = model.generate(**inputs, max_length=100) generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] ``` In this example the `prompts` will be converted into: ``` User:Describe this image. Assistant: An image of two kittens in grass. User:Describe this image. Assistant:' ``` and the two images will be massaged using [`IdeficsImageProcessor.__call__`] method and placed inside the `pixel_values` dict entry of the return value. This example also exemplifies that images can be passed as objects or as text urls. It can be seen that the first image is passed as object and the second one as a url. To do training do: ```python image_transform = transforms.Compose( [ transforms.RandomResizedCrop( (w, h), scale=(0.9, 1.0), interpolation=transforms.InterpolationMode.BICUBIC ), transforms.ToTensor(), transforms.Normalize(mean=self.image_mean, std=self.image_std), ] ) inputs = processor(text=prompts, transform=image_transform, return_tensors="pt") ``` In order to help debug prompt generation enable `debug=True` which will show you what's happening. Nz9You need to specify either `text` or `images` and `text`.aWhen providing both images and text arguments, the number of text prompts should be the same as the number of images.If you want to have several images per prompt, images should be nested as such: images=[[img1, img2], [img3, img4], ...] for text=[prompt1, prompt2, ...].css|]}t|tƒVqdS©N)Ú isinstancer©Ú.0Úir r r!Ú ls€z,IdeficsProcessor.__call__..zQWhen using the image-text-to-text behavior, the prompts should only contain text.rcSsg|]}|g‘qSr r rsr r r!Ú psz-IdeficsProcessor.__call__..Ztokenizer_init_kwargsr(r$Fr%css|] }t|ttfƒVqdSrq)rrÚlistÚtuplersr r r!rv€s€zrr_cs|rˆˆSˆˆˆSrqr )Úlast_was_image©Z fake_tokenr^r r!Úimage_tokens‡sz/IdeficsProcessor.__call__..image_tokensTrTr)r,r-r;Úattention_maskcsó|]}t|ƒVqdSrq©Úlen©rtÚxr r r!rv½ó€r>csr~rqrrr r r!rvÀrƒr0)Zaxis)Údtype)r/r>r1)r;r}Zpixel_valuesrE)Údata)0r`rrrrxryrr€rUÚzipZ _merge_kwargsr'r<Zinit_kwargsÚpoprfÚanyZ bos_tokenrDÚstriprZr\Zfetch_imagesÚappendZ eos_tokenÚmaxÚsumrJrGÚminr5ZzerosrCr0rNÚconcatr„ZreshaperBrQrdZtensorZconvert_to_tensorZint32Ústackr=r8r&r ).rgrorlZaudioZvideosrhrkZ output_kwargsr$r%Zend_of_utterance_tokenr|Zall_promptsZ all_imagesÚsampleÚ full_textZ image_objectsrzZ last_was_textruÚitemÚimager,Ú text_encodingZ all_textsZall_attention_masksZmax_num_imagesZat_least_one_imageZoutput_input_idsZ output_imagesZoutput_attention_masksZtext_singler}Zextracted_imagesZpadded_input_idsZimage_countZlocal_max_num_imagesZcurrent_imagesZpadded_image_tensorZimage_shapeZpadded_shapeZ num_imagesrRrSrEÚ_r r{r!Ú__call__ôsb ÿÿþý € € ÿÿ ÿ ÿüÿzIdeficsProcessor.__call__cOó|jj|i|¤ŽS)zÂ This method forwards all its arguments to LlamaTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please refer to the docstring of this method for more information. )r<Úbatch_decode©rgÚargsrhr r r!r˜ózIdeficsProcessor.batch_decodecOr—)z¼ This method forwards all its arguments to LlamaTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to the docstring of this method for more information. )r<Údecoder™r r r!rœr›zIdeficsProcessor.decodecCs"|jj}|jj}tt ||¡ƒSrq)r<Úmodel_input_namesr\rxÚdictÚfromkeys)rgZtokenizer_input_namesZimage_processor_input_namesr r r!rsz"IdeficsProcessor.model_input_names)Nr]N)NNNN)rrrÚ__doc__Ú attributesZvalid_kwargsZimage_processor_classZtokenizer_classrbrrr rrrrrr'r r–r˜rœÚpropertyrÚ __classcell__r r rir!r[ÃsFô þ ûÿý óòr[)r/)*r ÚtypingrrrrrÚurllib.parserZfeature_extraction_utilsr Zimage_utilsr Zprocessing_utilsrrr rrrZtokenization_utils_baserrÚutilsrrZutils.deprecationrr5Z tensorflowr0r@rr#r'r8r=r9r:rZr[Ú__all__r r r r!Ús4 /! c