Dataset Research Field Classifier

A fine-tuned static embedding model for classifying scientific datasets into research fields. It maps datasets to the 4,516 topics in the OpenAlex taxonomy, along with their hierarchical subfield, field, and domain classifications. This was developed as part of our NIH S-index Challenge Phase 2 proposal. We refer to the S-index Hub for more information about our S-index and the Challenge.

Model Description

This model is fine-tuned from minishlab/potion-base-32m on ground truth topic classifications aligned with the OpenAlex topics taxonomy. It uses Model2Vec's static embedding approach for fast, efficient inference without requiring a GPU.

Topic Hierarchy

The classifier uses a 4-level hierarchical classification system based on the OpenAlex topics taxonomy:

  • 4 Domains: Physical Sciences, Life Sciences, Social Sciences, Health Sciences
  • ~26 Fields: Chemistry, Physics, Medicine, Computer Science, etc.
  • ~250 Subfields: More specific research areas
  • 4,516 Topics: Granular research topics

Performance

Evaluated on a held-out test set (1,525 samples):

Level Accuracy
Domain 92.6%
Field 85.8%
Subfield 73.6%
Topic (exact) 62.6%

Comparison with Base Model

Model Domain Field Subfield Topic
Base (potion-32m) 77.2% 60.5% 27.9% 16.2%
Fine-tuned 92.6% 85.8% 73.6% 62.6%
Improvement +15.4 +25.3 +45.7 +46.4

Usage

from model2vec import StaticModel
import numpy as np

# Load model
model = StaticModel.from_pretrained("jimnoneill/dataset-to-field")

# Prepare your text
text = "Machine learning approaches for protein structure prediction using deep neural networks"

# Get embedding
embedding = model.encode([text])

# For full classification pipeline, see:
# https://github.com/data-S-index/dataset-to-field

Training Details

  • Base Model: minishlab/potion-base-32m
  • Training Samples: 8,636
  • Test Samples: 1,525
  • Topics Covered: 1,135 / 4,516 (25%)
  • Training Time: ~2 minutes on RTX 4090
  • Framework: Model2Vec + PyTorch Lightning

Training Data

The model was trained on ground truth topic classifications derived from the OpenAlex topics taxonomy. The training data includes scientific records with titles, subjects, and descriptions mapped to specific topics.

Dataset: jimnoneill/dataset-to-field-training-10k

Intended Use

  • Classifying scientific publications, datasets, and research outputs into research fields
  • Mapping DataCite records to research topics
  • Bibliometric analysis and research trend identification
  • Integration with the S-Index scientific indexing pipeline

Limitations

  • Trained on ~25% of the 4,516 topics; rare topics may have lower accuracy
  • Domain distribution is skewed toward Physical Sciences (53%)
  • Best suited for English-language scientific content

Citation

If you use this model, please cite:

@software{dataset-to-field,
  author = {O'Neill, James, Patel, Bhavesh},
  title = {Dataset Research Field Classifier},
  year = {2026},
  url = {https://github.com/data-S-index/dataset-to-field}
}

License

MIT

Acknowledgments

Downloads last month
6
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train jimnoneill/dataset-to-field

Evaluation results

  • Domain Accuracy on Dataset Research Field Training Data
    self-reported
    0.926
  • Field Accuracy on Dataset Research Field Training Data
    self-reported
    0.858
  • Subfield Accuracy on Dataset Research Field Training Data
    self-reported
    0.736
  • Topic Accuracy on Dataset Research Field Training Data
    self-reported
    0.626