Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
omarkamali 
posted an update 4 days ago
Post
1827
I just might have cracked tokenizer-free LLMs. No vocab, no softmax.

I'm training a 22M params LLM rn to test this "thing" and it's able to formulate coherent sentences 🤯

Bear in mind, this is a completely new, tokenizer-free LLM architecture with built-in language universality.

Check the explainer video to understand what's happening. Feedback welcome on this approach!

Cool! any chance at sharing a repo so others can play around with it? I'd love to give it a try! 😀

·

That's planned! I'm just running a few more experiments before locking it in.

Will let you know first @unmodeled-tyler :)

Nicee.. How do you map from continuous outputs to readable text?

·

I added a decoding head to the LLM, so the MLP generates a latent word vector that gets decoded by a GRU into a valid word.

I'm using the same input representation and train a joint encoder-decoder which gets further fine-tuned as part of the "Next Latent Prediction"(?) objective and it seems to be pretty decent for a first shot. Still working out some of the kinks.

Screenshot 2026-03-25 at 14.26.43

Quick update, it seems to mostly work as intended 🤯

More details here:
https://x.com/OmarKamali/status/2036932984226320748

You just killed 23 dyslexic people (and counting) with that video, be ca use of the we ird wo rd split ting. hahaha

Jokes aside, this looks absolutely amazing, but I think tokenizers are there because this might not work fast enough at scale. I'd be excited and extremely happy to be proven wrong, because the concept is certainly great.

·

I knowww. Need to fix the video pipeline lol

Thanks @alfredo-ottomate ! In principle, it should be faster than a conventional LLM at the same scale while also using less VRAM. Mostly because it removes the softmax layer, which is one of the more expensive operations in standard language models. It also removes the embedding table, which usually accounts for roughly 10-20% of the parameters. For example, in Qwen 3.5 4B, that’s about 700M embedding parameters eliminated.

Raw performance-wise, I expect around ~10% generation speed up per-token, ~10% less VRAM usage, and better use of the context window since each token means a full word, not a subword piece.

The question then is how many parameters my replacement mechanism will ultimately need to stay competitive. The approach is already working surprisingly well at around 4M parameters, which is about 0.6% of the alternative at 4B total. Even if that number grows, the efficiency upside still looks very promising.

Fingers crossed! ✌︎