12 8 206

Aunali

Cossale

https://auna.li?q=hf

AI & ML interests

Text2Image and Text2Text generation.

Recent Activity

liked a model 4 days ago

YatharthS/MiraTTS

updated a dataset 4 days ago

Cossale/spotify-training

published a dataset 4 days ago

Cossale/spotify-training

View all activity

Organizations

posted an update about 1 month ago

Post

254

Releasing 8 multilingual datasets from the People's Archive of Rural India (PAARI).
Indian languages represent 1B+ speakers but remain underrepresented in quality training data. These datasets help address that gap.
Languages: Hindi, Urdu, Punjabi, Tamil, Telugu, Marathi, Gujarati, English
Scripts: Devanagari, Arabic, Gurmukhi, Tamil, Telugu, Gujarati
Total: 7,650 articles, 19.9M tokens, 51MB
Content covers rural life, agriculture, social issues, and cultural traditions. Professionally written journalism, not web scrapes.
Free to use.
Collection: https://huggingface.co/collections/keplersystems/paari-datasets
Technical details: https://kepler.systems/blog/introducing-paari-datasets

reacted to mlabonne's post with 🔥 5 months ago

Post

6872

Liquid just released two 450M and 1.6B param VLMs!

They're super fast and leverage SigLIP2 NaFlex encoders to handle native resolutions without distortion. It's ideal for on-device deployment in constrained environments like phones.

It's available today on Hugging Face, with an inference and a fine-tuning Colab notebooks.

LiquidAI/LFM2-VL-450M
LiquidAI/LFM2-VL-1.6B

reacted to onekq's post with 🤝👀 9 months ago

Post

2264

Open source models are immutable, this is a big pain.

When you open source a piece of software, users leave their feedbacks via issues or PRs. You can merge their feedbacks in semi real time, this creates a positive cycle. Then you have a community.

LLMs don't have these nice micro steps. There are no hot fixes. Even a minor version bump is an endeavor. I'm quite confident my model is being used by teams somewhere. But until next launch, it's awfully quiet.

I don't know the solution. Just a regular lament before weekend. 🤗

3 replies

reacted to nataliaElv's post with 👀 about 1 year ago

Post

1660

Would you like to get a high-quality dataset to pre-train LLMs in your language? 🌏

At Hugging Face we're preparing a collaborative annotation effort to build an open-source multilingual dataset as part of the Data is Better Together initiative.

Follow the link below, check if your language is listed and sign up to be a Language Lead!

https://forms.gle/s9nGajBh6Pb9G72J6

reacted to davanstrien's post with 👍 over 1 year ago

Post

KTO offers an easier way to preference train LLMs (only 👍👎 ratings are required). As part of #DataIsBetterTogether, I've written a tutorial on creating a preference dataset using Argilla and Spaces.

Using this approach, you can create a dataset that anyone with a Hugging Face account can contribute to 🤯

See an example of the kind of Space you can create following this tutorial here: davanstrien/haiku-preferences

🆕 New tutorial covers:
💬 Generating responses with open models
👥 Collecting human feedback (do you like this model response? Yes/No)
🤖 Preparing a TRL-compatible dataset for training aligned models

Check it out here: https://github.com/huggingface/data-is-better-together/tree/main/kto-preference

2 replies

Aunali

AI & ML interests

Recent Activity

Organizations

Cossale's activity