Any plans for a new version?

Have you ever considered doing a version for a new great anime model called Anima that's actually co-developed by Comfy Org. It is in preview two stage but it's still very promising with some great checkpoints and LORAs already. However, it only uses Qwen3 0.6B base as the text encoder while it is miles better than the normal clip that you get with Illustrious/SDXL/NoobAI (I haven't tried your Gemma adaptor yet).

I think it could benefit from a model like Qwen 3.5 as to text encoder now, the actual creator of the model has said that he's looking into this on the community page on the official hugging face anima page I will quote his message "I am actually running a parallel experiment with changing the text encoder to Qwen3.5-2B-Base.

It is fairly straightforward to train a new LLM adapter from scratch to align to the existing text embeddings and produce coherent images. I've already done this and it works.

What takes (potentially) much longer, is fully recovering the character details and artist knowledge after switching to the new text encoder. A surprising amount of knowledge, especially for styles, is contained in the LLM adapter, not the DiT. It has to relearn all this knowledge from the full dataset. I will see how fast it is able to recover the knowledge.

It's entirely possible, and even likely, that it would just take way too much time to fully adapt to a new text encoder and so I don't go with this option. There's also some reasons to believe that the model is bottlenecked in other ways and that Qwen3-0.6b isn't actually hurting quality or prompt comprehension that much. But I am investigating whether it's feasible to switch to Qwen3.5-2B."

So it's definitely worth trying but again the creator might be doing this already..

I hope you're having a great year so far. Thank you for all your hard work.

Minthy

Owner 9 days ago

Not just considered but actually trained it. However, due to a lack of time and the desire to first achieve more complete dataset filling with fresh characters, content and several extra thing (which is proceeding in parallel) I decided to focus on training the LLM first.
In general, 'replacing' the TE in the Anima model can be done in two ways: creating a small adapter model to make tensors from the new encoder mimic original conditions, or reinitializing xattn layers and internal adapter layers to match the dimensions of the new encoder model and training them. The complexity and potential outcomes will differ between these options.
However, what seems both important and interesting is the possibility of combining these approaches by additionally incorporating a CLIP model. This could potentially improve the mixing of various styles, similar to what many users are used to in sdxl, leveraging the inherent capabilities of CLIP models themselves. The input should be optional and remain masked when not in use. Furthermore, instead of CLIP, a more complex network could be employed to provide additional control over composition and style, including features like vibe transfer.
Afaik, some work like this has already been done, which is awesome. I just think it would be better to leverage the CLIP part only for style and related conditions, leaving all the main work to the LLM.

There are many great and interesting possibilities. But for now, let's take things step by step, especially since the author is already working on this. Thanks for the interesting suggestions and kind words! I'd be glad to know more when you have any further results and ideas.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment