Instructions to use Voicelab/vlt5-base-keywords with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Voicelab/vlt5-base-keywords with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Voicelab/vlt5-base-keywords")# Load model directly from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained("Voicelab/vlt5-base-keywords") model = AutoModelForSeq2SeqLM.from_pretrained("Voicelab/vlt5-base-keywords") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Voicelab/vlt5-base-keywords with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Voicelab/vlt5-base-keywords" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Voicelab/vlt5-base-keywords", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Voicelab/vlt5-base-keywords
- SGLang
How to use Voicelab/vlt5-base-keywords with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Voicelab/vlt5-base-keywords" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Voicelab/vlt5-base-keywords", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Voicelab/vlt5-base-keywords" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Voicelab/vlt5-base-keywords", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Voicelab/vlt5-base-keywords with Docker Model Runner:
docker model run hf.co/Voicelab/vlt5-base-keywords
YAML Metadata Warning:The pipeline tag "text2text-generation" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other
Keyword Extraction from Short Texts with T5
Our vlT5 model is a keyword generation model based on encoder-decoder architecture using Transformer blocks presented by Google (https://huggingface.co/t5-base). The vlT5 was trained on scientific articles corpus to predict a given set of keyphrases based on the concatenation of the article’s abstract and title. It generates precise, yet not always complete keyphrases that describe the content of the article based only on the abstract.
Keywords generated with vlT5-base-keywords: encoder-decoder architecture, keyword generation
Results on demo model (different generation method, one model per language):
Our vlT5 model is a keyword generation model based on encoder-decoder architecture using Transformer blocks presented by Google (https://huggingface.co/t5-base). The vlT5 was trained on scientific articles corpus to predict a given set of keyphrases based on the concatenation of the article’s abstract and title. It generates precise, yet not always complete keyphrases that describe the content of the article based only on the abstract.
Keywords generated with vlT5-base-keywords: encoder-decoder architecture, vlT5, keyword generation, scientific articles corpus
vlT5
The biggest advantage is the transferability of the vlT5 model, as it works well on all domains and types of text. The downside is that the text length and the number of keywords are similar to the training data: the text piece of an abstract length generates approximately 3 to 5 keywords. It works both extractive and abstractively. Longer pieces of text must be split into smaller chunks, and then propagated to the model.
Overview
- Language model: t5-base
- Language: pl, en (but works relatively well with others)
- Training data: POSMAC
- Online Demo: Visit our online demo for better results https://nlp-demo-1.voicelab.ai/
- Paper: Keyword Extraction from Short Texts with a Text-To-Text Transfer Transformer, ACIIDS 2022
Corpus
The model was trained on a POSMAC corpus. Polish Open Science Metadata Corpus (POSMAC) is a collection of 216,214 abstracts of scientific publications compiled in the CURLICAT project.
| Domains | Documents | With keywords |
|---|---|---|
| Engineering and technical sciences | 58 974 | 57 165 |
| Social sciences | 58 166 | 41 799 |
| Agricultural sciences | 29 811 | 15 492 |
| Humanities | 22 755 | 11 497 |
| Exact and natural sciences | 13 579 | 9 185 |
| Humanities, Social sciences | 12 809 | 7 063 |
| Medical and health sciences | 6 030 | 3 913 |
| Medical and health sciences, Social sciences | 828 | 571 |
| Humanities, Medical and health sciences, Social sciences | 601 | 455 |
| Engineering and technical sciences, Humanities | 312 | 312 |
Tokenizer
As in the original plT5 implementation, the training dataset was tokenized into subwords using a sentencepiece unigram model with vocabulary size of 50k tokens.
Usage
from transformers import T5Tokenizer, T5ForConditionalGeneration
model = T5ForConditionalGeneration.from_pretrained("Voicelab/vlt5-base-keywords")
tokenizer = T5Tokenizer.from_pretrained("Voicelab/vlt5-base-keywords")
task_prefix = "Keywords: "
inputs = [
"Christina Katrakis, who spoke to the BBC from Vorokhta in western Ukraine, relays the account of one family, who say Russian soldiers shot at their vehicles while they were leaving their village near Chernobyl in northern Ukraine. She says the cars had white flags and signs saying they were carrying children.",
"Decays the learning rate of each parameter group by gamma every step_size epochs. Notice that such decay can happen simultaneously with other changes to the learning rate from outside this scheduler. When last_epoch=-1, sets initial lr as lr.",
"Hello, I'd like to order a pizza with salami topping.",
]
for sample in inputs:
input_sequences = [task_prefix + sample]
input_ids = tokenizer(
input_sequences, return_tensors="pt", truncation=True
).input_ids
output = model.generate(input_ids, no_repeat_ngram_size=3, num_beams=4)
predicted = tokenizer.decode(output[0], skip_special_tokens=True)
print(sample, "\n --->", predicted)
Inference
Our results showed that the best generation results were achieved with no_repeat_ngram_size=3, num_beams=4
Results
| Method | Rank | Micro | Macro | ||||
|---|---|---|---|---|---|---|---|
| P | R | F1 | P | R | F1 | ||
| extremeText | 1 | 0.175 | 0.038 | 0.063 | 0.007 | 0.004 | 0.005 |
| 3 | 0.117 | 0.077 | 0.093 | 0.011 | 0.011 | 0.011 | |
| 5 | 0.090 | 0.099 | 0.094 | 0.013 | 0.016 | 0.015 | |
| 10 | 0.060 | 0.131 | 0.082 | 0.015 | 0.025 | 0.019 | |
| vlT5kw | 1 | 0.345 | 0.076 | 0.124 | 0.054 | 0.047 | 0.050 |
| 3 | 0.328 | 0.212 | 0.257 | 0.133 | 0.127 | 0.129 | |
| 5 | 0.318 | 0.237 | 0.271 | 0.143 | 0.140 | 0.141 | |
| KeyBERT | 1 | 0.030 | 0.007 | 0.011 | 0.004 | 0.003 | 0.003 |
| 3 | 0.015 | 0.010 | 0.012 | 0.006 | 0.004 | 0.005 | |
| 5 | 0.011 | 0.012 | 0.011 | 0.006 | 0.005 | 0.005 | |
| TermoPL | 1 | 0.118 | 0.026 | 0.043 | 0.004 | 0.003 | 0.003 |
| 3 | 0.070 | 0.046 | 0.056 | 0.006 | 0.005 | 0.006 | |
| 5 | 0.051 | 0.056 | 0.053 | 0.007 | 0.007 | 0.007 | |
| all | 0.025 | 0.339 | 0.047 | 0.017 | 0.030 | 0.022 | |
| extremeText | 1 | 0.210 | 0.077 | 0.112 | 0.037 | 0.017 | 0.023 |
| 3 | 0.139 | 0.152 | 0.145 | 0.045 | 0.042 | 0.043 | |
| 5 | 0.107 | 0.196 | 0.139 | 0.049 | 0.063 | 0.055 | |
| 10 | 0.072 | 0.262 | 0.112 | 0.041 | 0.098 | 0.058 | |
| vlT5kw | 1 | 0.377 | 0.138 | 0.202 | 0.119 | 0.071 | 0.089 |
| 3 | 0.361 | 0.301 | 0.328 | 0.185 | 0.147 | 0.164 | |
| 5 | 0.357 | 0.316 | 0.335 | 0.188 | 0.153 | 0.169 | |
| KeyBERT | 1 | 0.018 | 0.007 | 0.010 | 0.003 | 0.001 | 0.001 |
| 3 | 0.009 | 0.010 | 0.009 | 0.004 | 0.001 | 0.002 | |
| 5 | 0.007 | 0.012 | 0.009 | 0.004 | 0.001 | 0.002 | |
| TermoPL | 1 | 0.076 | 0.028 | 0.041 | 0.002 | 0.001 | 0.001 |
| 3 | 0.046 | 0.051 | 0.048 | 0.003 | 0.001 | 0.002 | |
| 5 | 0.033 | 0.061 | 0.043 | 0.003 | 0.001 | 0.002 | |
| all | 0.021 | 0.457 | 0.040 | 0.004 | 0.008 | 0.005 |
License
CC BY 4.0
Citation
If you use this model, please cite the following paper: Pęzik, P., Mikołajczyk, A., Wawrzyński, A., Żarnecki, F., Nitoń, B., Ogrodniczuk, M. (2023). Transferable Keyword Extraction and Generation with Text-to-Text Language Models. In: Mikyška, J., de Mulatier, C., Paszynski, M., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M. (eds) Computational Science – ICCS 2023. ICCS 2023. Lecture Notes in Computer Science, vol 14074. Springer, Cham. https://doi.org/10.1007/978-3-031-36021-3_42
OR
Authors
The model was trained by NLP Research Team at Voicelab.ai.
You can contact us here.
- Downloads last month
- 38,944
docker model run hf.co/Voicelab/vlt5-base-keywords