Skip to content

PhotoMakerEncode

Documentation

  • Class name: PhotoMakerEncode
  • Category: _for_testing/photomaker
  • Output node: False

The PhotoMakerEncode node is designed to integrate visual information from images with textual embeddings, enhancing the latter with visual context. This process involves projecting image pixel values into an embedding space, merging these with existing text prompt embeddings, and applying a fusion module to effectively blend the visual and textual information.

Input types

Required

  • photomaker
    • Specifies the photomaker model to be used for encoding the visual information from images into a format that can be integrated with text embeddings.
    • Comfy dtype: PHOTOMAKER
    • Python dtype: PhotoMakerIDEncoder
  • image
    • The image input whose visual information is to be encoded and integrated with text embeddings.
    • Comfy dtype: IMAGE
    • Python dtype: torch.Tensor
  • clip
    • The CLIP model used for text tokenization and encoding, facilitating the integration of visual information with textual context.
    • Comfy dtype: CLIP
    • Python dtype: CLIPModel
  • text
    • The text input that provides the context or description for the image, which will be enhanced with visual information.
    • Comfy dtype: STRING
    • Python dtype: str

Output types

  • conditioning
    • Comfy dtype: CONDITIONING
    • The output after integrating visual information from the image with the text embeddings, resulting in a conditioning vector that combines both textual and visual cues.
    • Python dtype: torch.Tensor

Usage tips

  • Infra type: GPU
  • Common nodes: unknown

Source code

class PhotoMakerEncode:
    @classmethod
    def INPUT_TYPES(s):
        return {"required": { "photomaker": ("PHOTOMAKER",),
                              "image": ("IMAGE",),
                              "clip": ("CLIP", ),
                              "text": ("STRING", {"multiline": True, "dynamicPrompts": True, "default": "photograph of photomaker"}),
                             }}

    RETURN_TYPES = ("CONDITIONING",)
    FUNCTION = "apply_photomaker"

    CATEGORY = "_for_testing/photomaker"

    def apply_photomaker(self, photomaker, image, clip, text):
        special_token = "photomaker"
        pixel_values = comfy.clip_vision.clip_preprocess(image.to(photomaker.load_device)).float()
        try:
            index = text.split(" ").index(special_token) + 1
        except ValueError:
            index = -1
        tokens = clip.tokenize(text, return_word_ids=True)
        out_tokens = {}
        for k in tokens:
            out_tokens[k] = []
            for t in tokens[k]:
                f = list(filter(lambda x: x[2] != index, t))
                while len(f) < len(t):
                    f.append(t[-1])
                out_tokens[k].append(f)

        cond, pooled = clip.encode_from_tokens(out_tokens, return_pooled=True)

        if index > 0:
            token_index = index - 1
            num_id_images = 1
            class_tokens_mask = [True if token_index <= i < token_index+num_id_images else False for i in range(77)]
            out = photomaker(id_pixel_values=pixel_values.unsqueeze(0), prompt_embeds=cond.to(photomaker.load_device),
                            class_tokens_mask=torch.tensor(class_tokens_mask, dtype=torch.bool, device=photomaker.load_device).unsqueeze(0))
        else:
            out = cond

        return ([[out, {"pooled_output": pooled}]], )