Releasing 8 multilingual datasets from the People's Archive of Rural India (PAARI). Indian languages represent 1B+ speakers but remain underrepresented in quality training data. These datasets help address that gap. Languages: Hindi, Urdu, Punjabi, Tamil, Telugu, Marathi, Gujarati, English Scripts: Devanagari, Arabic, Gurmukhi, Tamil, Telugu, Gujarati Total: 7,650 articles, 19.9M tokens, 51MB Content covers rural life, agriculture, social issues, and cultural traditions. Professionally written journalism, not web scrapes. Free to use. Collection: https://huggingface.co/collections/keplersystems/paari-datasets Technical details: https://kepler.systems/blog/introducing-paari-datasets
Liquid just released two 450M and 1.6B param VLMs!
They're super fast and leverage SigLIP2 NaFlex encoders to handle native resolutions without distortion. It's ideal for on-device deployment in constrained environments like phones.
It's available today on Hugging Face, with an inference and a fine-tuning Colab notebooks.
Open source models are immutable, this is a big pain.
When you open source a piece of software, users leave their feedbacks via issues or PRs. You can merge their feedbacks in semi real time, this creates a positive cycle. Then you have a community.
LLMs don't have these nice micro steps. There are no hot fixes. Even a minor version bump is an endeavor. I'm quite confident my model is being used by teams somewhere. But until next launch, it's awfully quiet.
I don't know the solution. Just a regular lament before weekend. π€
3 replies
Β·
reacted to nataliaElv's
post with πabout 1 year ago
Would you like to get a high-quality dataset to pre-train LLMs in your language? π
At Hugging Face we're preparing a collaborative annotation effort to build an open-source multilingual dataset as part of the Data is Better Together initiative.
Follow the link below, check if your language is listed and sign up to be a Language Lead!
KTO offers an easier way to preference train LLMs (only ππ ratings are required). As part of #DataIsBetterTogether, I've written a tutorial on creating a preference dataset using Argilla and Spaces.
Using this approach, you can create a dataset that anyone with a Hugging Face account can contribute to π€―
π New tutorial covers: π¬ Generating responses with open models π₯ Collecting human feedback (do you like this model response? Yes/No) π€ Preparing a TRL-compatible dataset for training aligned models