@omarkamali on Hugging Face: "I just might have cracked tokenizer-free LLMs. No vocab, no softmax. I'm…"

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

posted an update 4 days ago

Post

1827

I just might have cracked tokenizer-free LLMs. No vocab, no softmax.

I'm training a 22M params LLM rn to test this "thing" and it's able to formulate coherent sentences 🤯

Bear in mind, this is a completely new, tokenizer-free LLM architecture with built-in language universality.

Check the explainer video to understand what's happening. Feedback welcome on this approach!

unmodeled-tyler

3 days ago

Cool! any chance at sharing a repo so others can play around with it? I'd love to give it a try! 😀

omarkamali

3 days ago

That's planned! I'm just running a few more experiments before locking it in.

Will let you know first @unmodeled-tyler :)

kiidfreak

3 days ago

Nicee.. How do you map from continuous outputs to readable text?

omarkamali

3 days ago

I added a decoding head to the LLM, so the MLP generates a latent word vector that gets decoded by a GRU into a valid word.

I'm using the same input representation and train a joint encoder-decoder which gets further fine-tuned as part of the "Next Latent Prediction"(?) objective and it seems to be pretty decent for a first shot. Still working out some of the kinks.

omarkamali

3 days ago

Quick update, it seems to mostly work as intended 🤯

More details here:
https://x.com/OmarKamali/status/2036932984226320748

alfredo-ottomate

2 days ago

You just killed 23 dyslexic people (and counting) with that video, be ca use of the we ird wo rd split ting. hahaha

Jokes aside, this looks absolutely amazing, but I think tokenizers are there because this might not work fast enough at scale. I'd be excited and extremely happy to be proven wrong, because the concept is certainly great.

omarkamali

2 days ago

I knowww. Need to fix the video pipeline lol

Thanks @alfredo-ottomate ! In principle, it should be faster than a conventional LLM at the same scale while also using less VRAM. Mostly because it removes the softmax layer, which is one of the more expensive operations in standard language models. It also removes the embedding table, which usually accounts for roughly 10-20% of the parameters. For example, in Qwen 3.5 4B, that’s about 700M embedding parameters eliminated.

Raw performance-wise, I expect around ~10% generation speed up per-token, ~10% less VRAM usage, and better use of the context window since each token means a full word, not a subword piece.

The question then is how many parameters my replacement mechanism will ultimately need to stay competitive. The approach is already working surprisingly well at around 4M parameters, which is about 0.6% of the alternative at 4B total. Even if that number grows, the efficiency upside still looks very promising.

Fingers crossed! ✌︎

In this post