  • Class name: LLMMultiModalImageEvaluation
  • Category: SALT/Language Toolkit/Querying
  • Output node: False

This node is designed to evaluate images based on a given query using a specified LLM model. It processes image paths or documents, loads the images, and sends them to the LLM model along with the query for evaluation, providing a text-based assessment of the images.

Input types


  • llm_model
    • The LLM model used for evaluating the images. It is crucial for performing the evaluation as it processes the images and the query to generate the evaluation result.
    • Comfy dtype: LLM_MODEL
    • Python dtype: dict
  • image_documents
    • A list of image documents to be evaluated. These documents are processed and evaluated by the LLM model in conjunction with the query.
    • Comfy dtype: DOCUMENT
    • Python dtype: list
  • llm_message
    • The query or message to be evaluated by the LLM model. This message guides the evaluation process of the images.
    • Comfy dtype: LIST
    • Python dtype: list


  • max_tokens
    • The maximum number of tokens to be generated by the LLM model during the evaluation. It limits the length of the evaluation result.
    • Comfy dtype: INT
    • Python dtype: int

Output types

  • response
    • Comfy dtype: STRING
    • The evaluation result text from the LLM model, providing a text-based assessment of the images.
    • Python dtype: str

Usage tips

  • Infra type: GPU
  • Common nodes: unknown

Source code

class LLMMultiModalImageEvaluation:
    def INPUT_TYPES(cls):
        return {
            "required": {
                "llm_model": ("LLM_MODEL",),
                "image_documents": ("DOCUMENT",),
                "llm_message": ("LIST",),
            "optional": {
                "max_tokens": ("INT", {"min": 1, "max": 4096, "default": 1024})

    RETURN_NAMES = ("response", )

    FUNCTION = "complete"

    def complete(self, llm_model, image_documents, llm_message, max_tokens=1024):

        if not max_tokens or not isinstance(max_tokens, int):
            max_tokens = 1024

        model = llm_model.get("llm", None)

        if not model:
            raise ValueError("LLMMultiModalImageEvaluation unable to detect valid model")

        prompt = ""
        llm_message = sorted(llm_message, key=lambda message: message.role.value)
        for msg in llm_message:
            if isinstance(msg, ChatMessage) and msg.role == MessageRole.SYSTEM:
                if "SYSTEM:" not in prompt:
                    prompt += "SYSTEM: "
                prompt += msg.content + "\n\n"
            if isinstance(msg, ChatMessage) and msg.role == MessageRole.USER:
                if "USER:" not in prompt:
                    prompt += "USER: "
                prompt += msg.content + "\n\n"
            if isinstance(msg, str):
                prompt += msg + "\n\n"

        response = model.complete(

        return (response.text, )