oh I can understand, your research is interesting, nice work!, keep going 😀 🤗
Ujjwal Tyagi
AI & ML interests
Recent Activity
Organizations
Oh nice! Good work
You're welcome. If you haven't already, you can review my master notes in the dataset repo card, https://huggingface.co/datasets/Ujjwal-Tyagi/ai-ml-foundations-book-collection#my-master-notes-and-main-concept-understanding-after-i-read-those-books
it looks interesting but like any implementation plan, or any kind of result by implementing it? in the simple easy way, could you please explain what is it for and how we can implement it?
Ujjwal-Tyagi/ai-ml-foundations-book-collection
Ujjwal-Tyagi/ai-ml-foundations-book-collection
oh i see thanks
Interesting, so why don't you create a research paper? Wanna see the training recipe and configuration, setup
glad to see them
Interesting
Glad to see
what is your model precision you are training on? BF16 or FP8?
I think the most likely reason you're seeing R1 hit 100% constantly is some training/validation overlap. When you moved from the 200k set to the 500k set, a portion of the validation samples probably ended up inside the training pool, so the model is essentially seeing the answers beforehand and memorizing them.
The best way to fix, is would be to rebuild the validation split from a completely separate dataset (or at least re-split the full dataset with strict deduplication so no caption/image pairs appear in both sets). Once the validation set is clean and never seen during training, the recall numbers should drop to something more realistic and you'll get a proper measure of generalization.
this happens to me many times..as model starts memorizing the training data, I used this common methods:
Dropout: randomly turns off some neurons during training so the model doesn’t rely on the same paths and just memorize the data.
Weight decay (L2 regularization): slightly penalizes big weights so the model learns simpler patterns instead of fitting exact samples.
Data augmentation: adds small variations to the data (image crops, jitter, caption noise) so the model sees slightly different versions instead of the exact same inputs.
Label smoothing: stops the model from being overly confident about the “correct” answer, which helps reduce memorization.
Early stopping: you stop training once validation stops improving so the model doesn’t keep training and start memorizing.
Hard negative mining: give the model harder wrong examples so it actually learns the differences instead of just remembering pairs.
Wow, amazing
Which hardware you are using to train that model, and if you ever release the distilled data of 5 berts teacher models that is also really helpful
That's great! Keep doing the work :)
Where is that model?
Oh wow
Interesting
that's good to hear but all of those guys are doing distillations at large scale of both open and closed source models, so like it's very common, but still these chinese model having too much censorships and full of chinese propagandas, so it is worthless to make them as a base model, but they are good for distillation anyway ;)