Dataset Research Field Classifier
A fine-tuned static embedding model for classifying scientific datasets into research fields. It maps datasets to the 4,516 topics in the OpenAlex taxonomy, along with their hierarchical subfield, field, and domain classifications. This was developed as part of our NIH S-index Challenge Phase 2 proposal. We refer to the S-index Hub for more information about our S-index and the Challenge.
Model Description
This model is fine-tuned from minishlab/potion-base-32m on ground truth topic classifications aligned with the OpenAlex topics taxonomy. It uses Model2Vec's static embedding approach for fast, efficient inference without requiring a GPU.
Topic Hierarchy
The classifier uses a 4-level hierarchical classification system based on the OpenAlex topics taxonomy:
- 4 Domains: Physical Sciences, Life Sciences, Social Sciences, Health Sciences
- ~26 Fields: Chemistry, Physics, Medicine, Computer Science, etc.
- ~250 Subfields: More specific research areas
- 4,516 Topics: Granular research topics
Performance
Evaluated on a held-out test set (1,525 samples):
| Level | Accuracy |
|---|---|
| Domain | 92.6% |
| Field | 85.8% |
| Subfield | 73.6% |
| Topic (exact) | 62.6% |
Comparison with Base Model
| Model | Domain | Field | Subfield | Topic |
|---|---|---|---|---|
| Base (potion-32m) | 77.2% | 60.5% | 27.9% | 16.2% |
| Fine-tuned | 92.6% | 85.8% | 73.6% | 62.6% |
| Improvement | +15.4 | +25.3 | +45.7 | +46.4 |
Usage
from model2vec import StaticModel
import numpy as np
# Load model
model = StaticModel.from_pretrained("jimnoneill/dataset-to-field")
# Prepare your text
text = "Machine learning approaches for protein structure prediction using deep neural networks"
# Get embedding
embedding = model.encode([text])
# For full classification pipeline, see:
# https://github.com/data-S-index/dataset-to-field
Training Details
- Base Model: minishlab/potion-base-32m
- Training Samples: 8,636
- Test Samples: 1,525
- Topics Covered: 1,135 / 4,516 (25%)
- Training Time: ~2 minutes on RTX 4090
- Framework: Model2Vec + PyTorch Lightning
Training Data
The model was trained on ground truth topic classifications derived from the OpenAlex topics taxonomy. The training data includes scientific records with titles, subjects, and descriptions mapped to specific topics.
Dataset: jimnoneill/dataset-to-field-training-10k
Intended Use
- Classifying scientific publications, datasets, and research outputs into research fields
- Mapping DataCite records to research topics
- Bibliometric analysis and research trend identification
- Integration with the S-Index scientific indexing pipeline
Limitations
- Trained on ~25% of the 4,516 topics; rare topics may have lower accuracy
- Domain distribution is skewed toward Physical Sciences (53%)
- Best suited for English-language scientific content
Citation
If you use this model, please cite:
@software{dataset-to-field,
author = {O'Neill, James, Patel, Bhavesh},
title = {Dataset Research Field Classifier},
year = {2026},
url = {https://github.com/data-S-index/dataset-to-field}
}
License
MIT
Acknowledgments
- OpenAlex for the topic taxonomy
- minishlab/potion-base-32m base embedding model
- Model2Vec for model distillation and training
- Downloads last month
- 6
Dataset used to train jimnoneill/dataset-to-field
Evaluation results
- Domain Accuracy on Dataset Research Field Training Dataself-reported0.926
- Field Accuracy on Dataset Research Field Training Dataself-reported0.858
- Subfield Accuracy on Dataset Research Field Training Dataself-reported0.736
- Topic Accuracy on Dataset Research Field Training Dataself-reported0.626