SVD_img2vid_Conditioning¶
Documentation¶
- Class name:
SVD_img2vid_Conditioning
- Category:
conditioning/video_models
- Output node:
False
This node is designed for generating conditioning data for video generation tasks, specifically tailored for use with SVD_img2vid models. It takes various inputs including initial images, video parameters, and a VAE model to produce conditioning data that can be used to guide the generation of video frames.
Input types¶
Required¶
clip_vision
- Represents the CLIP vision model used for encoding visual features from the initial image, playing a crucial role in understanding the content and context of the image for video generation.
- Comfy dtype:
CLIP_VISION
- Python dtype:
torch.nn.Module
init_image
- The initial image from which the video will be generated, serving as the starting point for the video generation process.
- Comfy dtype:
IMAGE
- Python dtype:
torch.Tensor
vae
- A Variational Autoencoder (VAE) model used for encoding the initial image into a latent space, facilitating the generation of coherent and continuous video frames.
- Comfy dtype:
VAE
- Python dtype:
torch.nn.Module
width
- The desired width of the video frames to be generated, allowing for customization of the video's resolution.
- Comfy dtype:
INT
- Python dtype:
int
height
- The desired height of the video frames, enabling control over the video's aspect ratio and resolution.
- Comfy dtype:
INT
- Python dtype:
int
video_frames
- Specifies the number of frames to be generated for the video, determining the video's length.
- Comfy dtype:
INT
- Python dtype:
int
motion_bucket_id
- An identifier for categorizing the type of motion to be applied in the video generation, aiding in the creation of dynamic and engaging videos.
- Comfy dtype:
INT
- Python dtype:
int
fps
- The frames per second (fps) rate for the video, influencing the smoothness and realism of the generated video.
- Comfy dtype:
INT
- Python dtype:
int
augmentation_level
- A parameter controlling the level of augmentation applied to the initial image, affecting the diversity and variability of the generated video frames.
- Comfy dtype:
FLOAT
- Python dtype:
float
Output types¶
positive
- Comfy dtype:
CONDITIONING
- The positive conditioning data, consisting of encoded features and parameters for guiding the video generation process in a desired direction.
- Python dtype:
List[Tuple[torch.Tensor, Dict]]
- Comfy dtype:
negative
- Comfy dtype:
CONDITIONING
- The negative conditioning data, providing a contrast to the positive conditioning, which can be used to avoid certain patterns or features in the generated video.
- Python dtype:
List[Tuple[torch.Tensor, Dict]]
- Comfy dtype:
latent
- Comfy dtype:
LATENT
- Latent representations generated for each frame of the video, serving as a foundational component for the video generation process.
- Python dtype:
Dict
- Comfy dtype:
Usage tips¶
- Infra type:
GPU
- Common nodes:
Source code¶
class SVD_img2vid_Conditioning:
@classmethod
def INPUT_TYPES(s):
return {"required": { "clip_vision": ("CLIP_VISION",),
"init_image": ("IMAGE",),
"vae": ("VAE",),
"width": ("INT", {"default": 1024, "min": 16, "max": nodes.MAX_RESOLUTION, "step": 8}),
"height": ("INT", {"default": 576, "min": 16, "max": nodes.MAX_RESOLUTION, "step": 8}),
"video_frames": ("INT", {"default": 14, "min": 1, "max": 4096}),
"motion_bucket_id": ("INT", {"default": 127, "min": 1, "max": 1023}),
"fps": ("INT", {"default": 6, "min": 1, "max": 1024}),
"augmentation_level": ("FLOAT", {"default": 0.0, "min": 0.0, "max": 10.0, "step": 0.01})
}}
RETURN_TYPES = ("CONDITIONING", "CONDITIONING", "LATENT")
RETURN_NAMES = ("positive", "negative", "latent")
FUNCTION = "encode"
CATEGORY = "conditioning/video_models"
def encode(self, clip_vision, init_image, vae, width, height, video_frames, motion_bucket_id, fps, augmentation_level):
output = clip_vision.encode_image(init_image)
pooled = output.image_embeds.unsqueeze(0)
pixels = comfy.utils.common_upscale(init_image.movedim(-1,1), width, height, "bilinear", "center").movedim(1,-1)
encode_pixels = pixels[:,:,:,:3]
if augmentation_level > 0:
encode_pixels += torch.randn_like(pixels) * augmentation_level
t = vae.encode(encode_pixels)
positive = [[pooled, {"motion_bucket_id": motion_bucket_id, "fps": fps, "augmentation_level": augmentation_level, "concat_latent_image": t}]]
negative = [[torch.zeros_like(pooled), {"motion_bucket_id": motion_bucket_id, "fps": fps, "augmentation_level": augmentation_level, "concat_latent_image": torch.zeros_like(t)}]]
latent = torch.zeros([video_frames, 4, height // 8, width // 8])
return (positive, negative, {"samples":latent})