Struggling to get this to run on a 24gb gpu

#3
by CodeExplode - opened

The default script seemingly loads everything to the GPU which is a bit too heavy for a 24gb GPU due to the large text encoder. I've tried to write a standalone basic inference script, which loads and moves models manually in sequence, by looking through the pipeline code and trying to work out what options are used and what's training related, versus what's needed for basic inference.

I'm currently up to where it seems to use CFG by default, but I can't find a guidance scale to use (and still aren't sure about things like implementing the shift yet).

Is it possible to use the pipeline with a sequential offloading strategy? The model being 2B is incredibly promising, and it seems it should be doable, but some of the (very interesting) training tricks used are making it a bit hard to get through the code to implement a more basic bare bones inference script.

edit: My very early attempt at basic inference with offloading in sequence: https://pastebin.com/BcEVTiXH

Motif Technologies org

Thanks for raising this, and also for sharing your attempt.

We have a ComfyUI-based lower-memory setup that seems to be working well so far. We’re currently validating it properly, and once that’s done we’ll update the repo/docs with the recommended way to run it under lower VRAM. @gkalstn0

Really appreciate the feedback.

Motif Technologies org

I appreciate your interest! As beomgyu mentioned,we're working on cleaning up a lower-memory setup we've been experimenting with. You are correct, current inference script is vanilla version of inference. Sorry for misleading README.

btw:

some of the (very interesting) training tricks used are making it a bit hard to get through the code to implement a more basic bare bones inference script.

The "training tricks" we have used (to fill the gap in resources, both for data and computing) are not used during inference in fact. The methods like TREAD and REPA only applied during training time, so inference code remains unchanged. If you have any interest in details of these, please refer to our technical report

Again, thank you for the feedback and interest!

Thanks, a Comfy implementation sounds perfect. I suspect this model might be really well received given its size.

But of a rambling thought while waiting for this, but the way that REPA works got me thinking about similar optimizations regarding the text encoder, in this case for size. Could it theoretically be possible to remove the text model altogether, by just cloning the input embeddings and tokenizer, and then training a mini network to just produce the same cross-attentions vectors (keys / values?) (trained against what is produced by running prompts through Gemma and then I think the first few isolated text layers in Motif to get a target to train towards). It seems that potentially a lot of the functionality of the larger model might not being needed for video gen, especially with a 512 max length, and so a sort of post-training step could potentially remove the text encoder entirely after it was used to create a stable basis. And then further training of the full model (including the new text path) against video targets again could potentially achieve something better suited to the end task.

The encoder used is a bit overkill compared to the model.

It's an interesting experiment that I've not seen tested, to see if a SD1.X sized model (and a video model at that) with a powerful text encoder and presumably higher quality captions works well, which it appears it does (particularly given that it's video, and yet relatively tiny).

Motif Technologies org

Hi @CodeExplode
we've just pushed enable_model_cpu_offload() support to the repo. You no longer need to write a custom offloading script. Memory-efficient Inference

ComfyUI nodes are also coming in the next 1–2 days.

Thanks for looking into it. enable_model_cpu_offload is what I tried originally, though for some reason the vram usage still peaks at 100% on a 24gb GPU, and it seems to using 100% cuda but not getting past step 0 which made me think the whole model isn't fitting on the GPU (since it seems it shouldn't be too hard to run the Gemma model fairly fast on an RTX 3090).
However thinking about it further, it might be due to the extra vram requirements to run the Gemma model on top of just loading the weights, and I might need to wait for a Comfy implementation and then use an fp8 scaled version of the Gemma model (or even a q4 version, which is how I run ~30b size Gemma models locally). That being said, the fact that your tests showed a 19gb peak is interesting, and I'll keep investigating.
edit: Running just Gemma locally with this script peaked at 15.2gb, so it's definitely possible https://pastebin.com/qpNHTyhy

Looking at the inference code again I see a guider initialization which I missed, so I can probably implement a pipeline-free inference script which should work now.
edit 2: It's not related to the text encoder sorry, or rather once offloading is enabled it then runs into the next issue which is that the 2B model seems to require >24GB of vram to run at the default res. I'll probably just have to wait for the ComfyUI nodes.

edit 3: It definitely runs with a single frame, and struggles at 5 frames. Comfy must have some trick to make this doable with larger models.

Motif Technologies org
β€’
edited 28 days ago

Hi @CodeExplode
we just pushed a fix for enable_model_cpu_offload(). There was a bug where the text encoder's offload hook wasn't firing properly, which would explain the 100% VRAM usage you saw.

Two things to try:

  1. Clear your HF cache so you get the updated pipeline:
    rm -rf ~/.cache/huggingface/modules/diffusers_modules/local/

  2. Set this env var before running β€” it prevents CUDA allocator fragmentation that can push VRAM over 24GB:
    export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

With both of these, we measured ~19GB peak at 720p/121 frames on our end. Should fit on your 4090/3090.

ComfyUI nodes are also coming in the next couple of days.

Sorry I should have mentioned that I get a warning "expandable_segments not supported on this platform," though from what I understand of it, it doesn't seem likely to be able to cause a major difference here as the GPU is starting completely empty when the transformer loads.
The offloading definitely works now and allowed the text model to run and then unload, but even with it offloaded (only loading the 2B DiT model) it just can't handle processing sequence lengths more than ~5 frames long, at the default sample resolution.
It's possible that some core torch optimization like a newer version of flash attention isn't working on my machine, which might explain the difference. The paper mentioned an attention trick of doing a second cross-attention step in the single-stream blocks with the tokens against the video latents, which might have also had a custom implementation which means not getting the usual vram savings of torch's scaled_dot_product_attention implementation, but I haven't checked how it's implemented in the code yet. (Just checked and the code is using scaled_dot_product_attention)

Motif Technologies org
β€’
edited 30 days ago

We’ve removed the T5Gemma2 decoder weights from the checkpoint, so loading the model should now require less GPU memory.
Also, please update to transformers==5.5.4 to support importing T5Gemma2Encoder in our code.

Hello, i also ran into the same issue where i used enable_model_cpu_offload() , i saw all pipeline building all good but then it get stuck at step 0... I am on rtx 5090m

It says modular diffusers is currently an experimental feature under active development. I see it eats around 19gb ram and at 100% utilization.

I have it running, its just need some time..

2%|▏ | 1/50 [02:05<1:42:45, 125.83s/it][HAMI-core Msg(31:133165477787328:olares_client.c:479)]: GPU Utilization = 100 %
But its super slow, i.e. need 2 hrs?
Is this expected? or need quantization?

If you try a small number of frames (e.g. 1-5) it should be fast. The issue seems to be that the frames and necessary model activations can't be fit on the GPU at once. Somehow Comfy manages it with much larger models which use the same VAE to encode the frames, so hopefully the Comfy implementation will be much faster.

Motif Technologies org

Hi @slai1988 @CodeExplode ,

Thank you for your interest and the detailed reports β€” they've been really helpful in identifying and prioritizing these issues.
We've pushed two updates that should help:

  1. Attention mask removal
    We removed the attention_mask argument from scaled_dot_product_attention calls for batch_size=1 inference.
    This allows PyTorch SDPA to select the Flash Attention backend (requires torch >= 2.0, CUDA compute capability >= 8.0) instead of falling back to the math/memory-efficient kernel. Previously the mask was forcing a slower attention path.

  2. FP8 weight quantization guide
    Using torchao's Float8WeightOnlyConfig, you can store transformer weights in FP8 (8-bit) while keeping all computation in BF16. see fp8-weight-quantization-optional
    On our hopper, this reduced peak VRAM from ~19 GB to ~15 GB. We're currently verifying on RTX GPUs and will update the guide accordingly.

  3. ComfyUI
    Slightly delayed, but we're planning to release official custom nodes this week.

@slai1988 The 2-hour inference time on RTX 5090m is likely caused by SDPA falling back to the math kernel instead of Flash Attention.
Please try again after clearing your HF cache (rm -rf ~/.cache/huggingface/modules/diffusers_modules/) to pick up the latest pipeline update.

Thanks for update,

I deleted the modules folder, and also clear the motif pipline py so it will redownload
I logged the below in hugging face

Loading pipeline components...: 0%| | 0/6 [00:00<?, ?it/s]Instantiating MotifVideoTransformer3DModel model under default dtype torch.bfloat16.
Updating config from {'in_channels': 33, 'out_channels': 16, 'num_attention_heads': 12, 'attention_head_dim': 128, 'num_layers': 12, 'num_single_layers': 24, 'num_decoder_layers': 8, 'mlp_ratio': 4.0, 'patch_size': 2, 'patch_size_t': 1, 'qk_norm': 'rms_norm', 'norm_type': 'layer_norm', 'text_embed_dim': 2560, 'image_embed_dim': 1152, 'pooled_projection_dim': None, 'rope_theta': 10000.0, 'rope_axes_dim': [16, 56, 56], 'base_latent_size': None, 'enable_text_cross_attention_dual': False, 'enable_text_cross_attention_single': True} to {'in_channels': 33, 'out_channels': 16, 'num_attention_heads': 12, 'attention_head_dim': 128, 'num_layers': 12, 'num_single_layers': 24, 'num_decoder_layers': 8, 'mlp_ratio': 4.0, 'patch_size': 2, 'patch_size_t': 1, 'qk_norm': 'rms_norm', 'norm_type': 'layer_norm', 'text_embed_dim': 2560, 'image_embed_dim': 1152, 'pooled_projection_dim': None, 'rope_theta': 10000.0, 'rope_axes_dim': [16, 56, 56], 'base_latent_size': None, 'enable_text_cross_attention_dual': False, 'enable_text_cross_attention_single': True, '_class_name': 'MotifVideoTransformer3DModel', '_diffusers_version': '0.36.0', '_library': 'diffusers'}
All model checkpoint weights were used when initializing MotifVideoTransformer3DModel.

All the weights of MotifVideoTransformer3DModel were initialized from the model checkpoint at /models/huggingface/hub/models--Motif-Technologies--Motif-Video-2B/snapshots/dd6b2de778909218c72ad7b9541086e2cfe5977b/transformer.
If your task is similar to the task the model of the checkpoint was trained on, you can already use MotifVideoTransformer3DModel for predictions without further training.
Updating config from {'in_channels': 33, 'out_channels': 16, 'num_attention_heads': 12, 'attention_head_dim': 128, 'num_layers': 12, 'num_single_layers': 24, 'num_decoder_layers': 8, 'mlp_ratio': 4.0, 'patch_size': 2, 'patch_size_t': 1, 'qk_norm': 'rms_norm', 'norm_type': 'layer_norm', 'text_embed_dim': 2560, 'image_embed_dim': 1152, 'pooled_projection_dim': None, 'rope_theta': 10000.0, 'rope_axes_dim': [16, 56, 56], 'base_latent_size': None, 'enable_text_cross_attention_dual': False, 'enable_text_cross_attention_single': True, '_class_name': 'MotifVideoTransformer3DModel', '_diffusers_version': '0.36.0', '_library': 'diffusers'} to {'in_channels': 33, 'out_channels': 16, 'num_attention_heads': 12, 'attention_head_dim': 128, 'num_layers': 12, 'num_single_layers': 24, 'num_decoder_layers': 8, 'mlp_ratio': 4.0, 'patch_size': 2, 'patch_size_t': 1, 'qk_norm': 'rms_norm', 'norm_type': 'layer_norm', 'text_embed_dim': 2560, 'image_embed_dim': 1152, 'pooled_projection_dim': None, 'rope_theta': 10000.0, 'rope_axes_dim': [16, 56, 56], 'base_latent_size': None, 'enable_text_cross_attention_dual': False, 'enable_text_cross_attention_single': True, '_class_name': 'MotifVideoTransformer3DModel', '_diffusers_version': '0.36.0', '_library': 'diffusers', '_name_or_path': '/models/huggingface/hub/models--Motif-Technologies--Motif-Video-2B/snapshots/dd6b2de778909218c72ad7b9541086e2cfe5977b/transformer'}
Loaded transformer as MotifVideoTransformer3DModel from transformer subfolder of Motif-Technologies/Motif-Video-2B.

Loading pipeline components...: 17%|β–ˆβ–‹ | 1/6 [00:00<00:02, 1.67it/s]Loaded tokenizer as GemmaTokenizer from tokenizer subfolder of Motif-Technologies/Motif-Video-2B.

Loading pipeline components...: 33%|β–ˆβ–ˆβ–ˆβ–Ž | 2/6 [00:02<00:04, 1.09s/it]Instantiating AutoencoderKLWan model under default dtype torch.bfloat16.
Updating config from {'base_dim': 96, 'decoder_base_dim': None, 'z_dim': 16, 'dim_mult': [1, 2, 4, 4], 'num_res_blocks': 2, 'attn_scales': [], 'temperal_downsample': [False, True, True], 'dropout': 0.0, 'latents_mean': [-0.7571, -0.7089, -0.9113, 0.1075, -0.1745, 0.9653, -0.1517, 1.5508, 0.4134, -0.0715, 0.5517, -0.3632, -0.1922, -0.9497, 0.2503, -0.2921], 'latents_std': [2.8184, 1.4541, 2.3275, 2.6558, 1.2196, 1.7708, 2.6052, 2.0743, 3.2687, 2.1526, 2.8652, 1.5579, 1.6382, 1.1253, 2.8251, 1.916], 'is_residual': False, 'in_channels': 3, 'out_channels': 3, 'patch_size': None, 'scale_factor_temporal': 4, 'scale_factor_spatial': 8} to {'base_dim': 96, 'decoder_base_dim': None, 'z_dim': 16, 'dim_mult': [1, 2, 4, 4], 'num_res_blocks': 2, 'attn_scales': [], 'temperal_downsample': [False, True, True], 'dropout': 0.0, 'latents_mean': [-0.7571, -0.7089, -0.9113, 0.1075, -0.1745, 0.9653, -0.1517, 1.5508, 0.4134, -0.0715, 0.5517, -0.3632, -0.1922, -0.9497, 0.2503, -0.2921], 'latents_std': [2.8184, 1.4541, 2.3275, 2.6558, 1.2196, 1.7708, 2.6052, 2.0743, 3.2687, 2.1526, 2.8652, 1.5579, 1.6382, 1.1253, 2.8251, 1.916], 'is_residual': False, 'in_channels': 3, 'out_channels': 3, 'patch_size': None, 'scale_factor_temporal': 4, 'scale_factor_spatial': 8, '_class_name': 'AutoencoderKLWan', '_diffusers_version': '0.35.2', '_name_or_path': 'Wan-AI/Wan2.2-T2V-A14B-Diffusers'}
All model checkpoint weights were used when initializing AutoencoderKLWan.

All the weights of AutoencoderKLWan were initialized from the model checkpoint at /models/huggingface/hub/models--Motif-Technologies--Motif-Video-2B/snapshots/dd6b2de778909218c72ad7b9541086e2cfe5977b/vae.
If your task is similar to the task the model of the checkpoint was trained on, you can already use AutoencoderKLWan for predictions without further training.
Updating config from {'base_dim': 96, 'decoder_base_dim': None, 'z_dim': 16, 'dim_mult': [1, 2, 4, 4], 'num_res_blocks': 2, 'attn_scales': [], 'temperal_downsample': [False, True, True], 'dropout': 0.0, 'latents_mean': [-0.7571, -0.7089, -0.9113, 0.1075, -0.1745, 0.9653, -0.1517, 1.5508, 0.4134, -0.0715, 0.5517, -0.3632, -0.1922, -0.9497, 0.2503, -0.2921], 'latents_std': [2.8184, 1.4541, 2.3275, 2.6558, 1.2196, 1.7708, 2.6052, 2.0743, 3.2687, 2.1526, 2.8652, 1.5579, 1.6382, 1.1253, 2.8251, 1.916], 'is_residual': False, 'in_channels': 3, 'out_channels': 3, 'patch_size': None, 'scale_factor_temporal': 4, 'scale_factor_spatial': 8, '_class_name': 'AutoencoderKLWan', '_diffusers_version': '0.35.2', '_name_or_path': 'Wan-AI/Wan2.2-T2V-A14B-Diffusers'} to {'base_dim': 96, 'decoder_base_dim': None, 'z_dim': 16, 'dim_mult': [1, 2, 4, 4], 'num_res_blocks': 2, 'attn_scales': [], 'temperal_downsample': [False, True, True], 'dropout': 0.0, 'latents_mean': [-0.7571, -0.7089, -0.9113, 0.1075, -0.1745, 0.9653, -0.1517, 1.5508, 0.4134, -0.0715, 0.5517, -0.3632, -0.1922, -0.9497, 0.2503, -0.2921], 'latents_std': [2.8184, 1.4541, 2.3275, 2.6558, 1.2196, 1.7708, 2.6052, 2.0743, 3.2687, 2.1526, 2.8652, 1.5579, 1.6382, 1.1253, 2.8251, 1.916], 'is_residual': False, 'in_channels': 3, 'out_channels': 3, 'patch_size': None, 'scale_factor_temporal': 4, 'scale_factor_spatial': 8, '_class_name': 'AutoencoderKLWan', '_diffusers_version': '0.35.2', '_name_or_path': '/models/huggingface/hub/models--Motif-Technologies--Motif-Video-2B/snapshots/dd6b2de778909218c72ad7b9541086e2cfe5977b/vae'}
Loaded vae as AutoencoderKLWan from vae subfolder of Motif-Technologies/Motif-Video-2B.
loading configuration file /models/huggingface/hub/models--Motif-Technologies--Motif-Video-2B/snapshots/dd6b2de778909218c72ad7b9541086e2cfe5977b/text_encoder/config.json
text_config is None, using default T5Gemma2EncoderTextConfig text config.
vision_config is None, using default SiglipVisionConfig vision config.
Model config T5Gemma2EncoderConfig {
"architectures": [
"T5Gemma2Encoder"
],
"attention_dropout": 0.0,
"boi_token_index": 255999,
"dropout_rate": 0.0,
"dtype": "bfloat16",
"eoi_token_index": 256000,
"image_token_index": 256001,
"initializer_range": 0.02,
"mm_tokens_per_image": 256,
"model_type": "t5gemma2_encoder",
"text_config": {
"_sliding_window_pattern": 6,
"add_cross_attention": false,
"attention_bias": false,
"attention_dropout": 0.0,
"attn_logit_softcapping": null,
"bos_token_id": 2,
"cross_attention_hidden_size": null,
"decoder_start_token_id": null,
"dropout_rate": 0.0,
"dtype": "bfloat16",
"eos_token_id": 1,
"final_logit_softcapping": null,
"finetuning_task": null,
"head_dim": 256,
"hidden_activation": "gelu_pytorch_tanh",
"hidden_size": 2560,
"initializer_range": 0.02,
"intermediate_size": 10240,
"is_decoder": false,
"layer_types": [
"sliding_attention",
"sliding_attention",
"sliding_attention",
"sliding_attention",
"sliding_attention",
"full_attention",
"sliding_attention",
"sliding_attention",
"sliding_attention",
"sliding_attention",
"sliding_attention",
"full_attention",
"sliding_attention",
"sliding_attention",
"sliding_attention",
"sliding_attention",
"sliding_attention",
"full_attention",
"sliding_attention",
"sliding_attention",
"sliding_attention",
"sliding_attention",
"sliding_attention",
"full_attention",
"sliding_attention",
"sliding_attention",
"sliding_attention",
"sliding_attention",
"sliding_attention",
"full_attention",
"sliding_attention",
"sliding_attention",
"sliding_attention",
"sliding_attention"
],
"max_position_embeddings": 131072,
"model_type": "t5gemma2_text",
"num_attention_heads": 8,
"num_hidden_layers": 34,
"num_key_value_heads": 4,
"pad_token_id": 0,
"prefix": null,
"query_pre_attn_scalar": 256,
"rms_norm_eps": 1e-06,
"rope_parameters": {
"full_attention": {
"factor": 8.0,
"rope_theta": 1000000,
"rope_type": "linear"
},
"sliding_attention": {
"rope_theta": 10000,
"rope_type": "default"
}
},
"sep_token_id": null,
"sliding_window": 1024,
"task_specific_params": null,
"tie_encoder_decoder": false,
"tie_word_embeddings": true,
"tokenizer_class": null,
"use_bidirectional_attention": false,
"use_cache": true,
"vocab_size": 262144
},
"tie_word_embeddings": true,
"transformers_version": "5.5.4",
"vision_config": {
"add_cross_attention": false,
"attention_dropout": 0.0,
"bos_token_id": null,
"cross_attention_hidden_size": null,
"decoder_start_token_id": null,
"dropout_rate": 0.0,
"dtype": "bfloat16",
"eos_token_id": null,
"finetuning_task": null,
"hidden_act": "gelu_pytorch_tanh",
"hidden_size": 1152,
"image_size": 896,
"intermediate_size": 4304,
"is_decoder": false,
"layer_norm_eps": 1e-06,
"model_type": "siglip_vision_model",
"num_attention_heads": 16,
"num_channels": 3,
"num_hidden_layers": 27,
"pad_token_id": null,
"patch_size": 14,
"prefix": null,
"sep_token_id": null,
"task_specific_params": null,
"tie_encoder_decoder": false,
"tie_word_embeddings": true,
"tokenizer_class": null,
"vision_use_head": false,
"vocab_size": 262144
},
"vocab_size": 262144
}

loading weights file /models/huggingface/hub/models--Motif-Technologies--Motif-Video-2B/snapshots/dd6b2de778909218c72ad7b9541086e2cfe5977b/text_encoder/model.safetensors

Loading weights: 0%| | 0/884 [00:00<?, ?it/s]
Loading weights: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 884/884 [00:00<00:00, 13182.23it/s]
Loaded text_encoder as T5Gemma2Encoder from text_encoder subfolder of Motif-Technologies/Motif-Video-2B.

Loading pipeline components...: 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 4/6 [00:02<00:00, 2.14it/s]loading configuration file /models/huggingface/hub/models--Motif-Technologies--Motif-Video-2B/snapshots/dd6b2de778909218c72ad7b9541086e2cfe5977b/feature_extractor/preprocessor_config.json
Image processor SiglipImageProcessor {
"do_convert_rgb": true,
"do_normalize": true,
"do_rescale": false,
"do_resize": true,
"image_mean": [
0.5,
0.5,
0.5
],
"image_processor_type": "SiglipImageProcessor",
"image_std": [
0.5,
0.5,
0.5
],
"resample": 3,
"rescale_factor": 0.00392156862745098,
"size": {
"height": 896,
"width": 896
}
}

Loaded feature_extractor as SiglipImageProcessor from feature_extractor subfolder of Motif-Technologies/Motif-Video-2B.
Updating config from {'num_train_timesteps': 1000, 'shift': 15.0, 'use_dynamic_shifting': False, 'base_shift': 0.5, 'max_shift': 1.15, 'base_image_seq_len': 256, 'max_image_seq_len': 4096, 'invert_sigmas': False, 'shift_terminal': None, 'use_karras_sigmas': False, 'use_exponential_sigmas': False, 'use_beta_sigmas': False, 'time_shift_type': 'exponential', 'stochastic_sampling': False} to {'num_train_timesteps': 1000, 'shift': 15.0, 'use_dynamic_shifting': False, 'base_shift': 0.5, 'max_shift': 1.15, 'base_image_seq_len': 256, 'max_image_seq_len': 4096, 'invert_sigmas': False, 'shift_terminal': None, 'use_karras_sigmas': False, 'use_exponential_sigmas': False, 'use_beta_sigmas': False, 'time_shift_type': 'exponential', 'stochastic_sampling': False, '_class_name': 'FlowMatchEulerDiscreteScheduler', '_diffusers_version': '0.36.0'}
Loaded scheduler as FlowMatchEulerDiscreteScheduler from scheduler subfolder of Motif-Technologies/Motif-Video-2B.

Loading pipeline components...: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 6/6 [00:02<00:00, 2.70it/s]
Updating config from {'vae': ('diffusers', 'AutoencoderKLWan')} to {'vae': ('diffusers', 'AutoencoderKLWan'), 'text_encoder': ('transformers', 'T5Gemma2Encoder')}
Updating config from {'vae': ('diffusers', 'AutoencoderKLWan'), 'text_encoder': ('transformers', 'T5Gemma2Encoder')} to {'vae': ('diffusers', 'AutoencoderKLWan'), 'text_encoder': ('transformers', 'T5Gemma2Encoder'), 'tokenizer': ('transformers', 'GemmaTokenizer')}
Updating config from {'vae': ('diffusers', 'AutoencoderKLWan'), 'text_encoder': ('transformers', 'T5Gemma2Encoder'), 'tokenizer': ('transformers', 'GemmaTokenizer')} to {'vae': ('diffusers', 'AutoencoderKLWan'), 'text_encoder': ('transformers', 'T5Gemma2Encoder'), 'tokenizer': ('transformers', 'GemmaTokenizer'), 'transformer': ('diffusers_modules.local.transformer_motif_video', 'MotifVideoTransformer3DModel')}
Updating config from {'vae': ('diffusers', 'AutoencoderKLWan'), 'text_encoder': ('transformers', 'T5Gemma2Encoder'), 'tokenizer': ('transformers', 'GemmaTokenizer'), 'transformer': ('diffusers_modules.local.transformer_motif_video', 'MotifVideoTransformer3DModel')} to {'vae': ('diffusers', 'AutoencoderKLWan'), 'text_encoder': ('transformers', 'T5Gemma2Encoder'), 'tokenizer': ('transformers', 'GemmaTokenizer'), 'transformer': ('diffusers_modules.local.transformer_motif_video', 'MotifVideoTransformer3DModel'), 'scheduler': ('diffusers', 'FlowMatchEulerDiscreteScheduler')}
Updating config from {'vae': ('diffusers', 'AutoencoderKLWan'), 'text_encoder': ('transformers', 'T5Gemma2Encoder'), 'tokenizer': ('transformers', 'GemmaTokenizer'), 'transformer': ('diffusers_modules.local.transformer_motif_video', 'MotifVideoTransformer3DModel'), 'scheduler': ('diffusers', 'FlowMatchEulerDiscreteScheduler')} to {'vae': ('diffusers', 'AutoencoderKLWan'), 'text_encoder': ('transformers', 'T5Gemma2Encoder'), 'tokenizer': ('transformers', 'GemmaTokenizer'), 'transformer': ('diffusers_modules.local.transformer_motif_video', 'MotifVideoTransformer3DModel'), 'scheduler': ('diffusers', 'FlowMatchEulerDiscreteScheduler'), 'feature_extractor': ('transformers', 'SiglipImageProcessor')}
Updating config from {'vae': ('diffusers', 'AutoencoderKLWan'), 'text_encoder': ('transformers', 'T5Gemma2Encoder'), 'tokenizer': ('transformers', 'GemmaTokenizer'), 'transformer': ('diffusers_modules.local.transformer_motif_video', 'MotifVideoTransformer3DModel'), 'scheduler': ('diffusers', 'FlowMatchEulerDiscreteScheduler'), 'feature_extractor': ('transformers', 'SiglipImageProcessor')} to {'vae': ('diffusers', 'AutoencoderKLWan'), 'text_encoder': ('transformers', 'T5Gemma2Encoder'), 'tokenizer': ('transformers', 'GemmaTokenizer'), 'transformer': ('diffusers_modules.local.transformer_motif_video', 'MotifVideoTransformer3DModel'), 'scheduler': ('diffusers', 'FlowMatchEulerDiscreteScheduler'), 'feature_extractor': ('transformers', 'SiglipImageProcessor'), '_name_or_path': 'Motif-Technologies/Motif-Video-2B'}
2026-04-22 15:08:22,682 INFO motifvideo2bone: from_pretrained done in 5.0s
2026-04-22 15:08:22,774 INFO motifvideo2bone: after_from_pretrained cuda_mem allocated=0.000GiB reserved=0.000GiB
2026-04-22 15:08:22,774 INFO motifvideo2bone: enable_model_cpu_offload()
2026-04-22 15:08:22,793 INFO motifvideo2bone: device placement done in 0.0s
2026-04-22 15:08:22,793 INFO motifvideo2bone: after_device_placement cuda_mem allocated=0.000GiB reserved=0.000GiB
2026-04-22 15:08:22,793 INFO motifvideo2bone: pipeline ready in 5.1s
2026-04-22 15:08:22,793 INFO motifvideo2bone: starting inference w=1280 h=736 frames=121 steps=50 offload=True seed=42
2026-04-22 15:08:22,793 INFO motifvideo2bone: before_inference cuda_mem allocated=0.000GiB reserved=0.000GiB
Guiders are currently an experimental feature under active development. The API is subject to breaking changes in future releases.

0%| | 0/50 [00:00<?, ?it/s]Modular Diffusers is currently an experimental feature under active development. The API is subject to breaking changes in future releases.
2026-04-22 15:10:31,032 INFO motifvideo2bone: diffusion step 1/50 timestep=tensor(1000., device='cuda:0') dt=128.239s

2%|▏ | 1/50 [02:05<1:42:36, 125.65s/it]
2026-04-22 15:11:55,880 INFO motifvideo2bone: diffusion step 2/50 timestep=tensor(998.6413, device='cuda:0') dt=84.848s

4%|▍ | 2/50 [04:08<1:39:17, 124.11s/it]

Still projecting to 2 hrs..

Am i doing anything wrong?

Motif Technologies org

Hi @slai1988 @CodeExplode

Sorry for the late reply, and thanks for your patience!
We just added GGUF quantized weights and SageAttention support. GGUF saves up to 2.7 GB VRAM, and SageAttention gives ~1.6x faster inference β€” both with no quality loss.
We're also currently testing on RTX GPUs.

Also for those waiting on ComfyUI β€” we're planning to open the custom nodes today.
There's a slight delay on the ComfyUI Manager PR merge, but we'll release it as-is for now.

Details here: GGUF + SageAttention

Sign up or log in to comment