Dataset Research Field Classifier

A fine-tuned static embedding model for classifying scientific datasets into research fields. It maps datasets to the 4,516 topics in the OpenAlex taxonomy, along with their hierarchical subfield, field, and domain classifications. This was developed as part of our NIH S-index Challenge Phase 2 proposal. We refer to the S-index Hub for more information about our S-index and the Challenge.

Model Description

This model is fine-tuned from minishlab/potion-base-32m on ground truth topic classifications aligned with the OpenAlex topics taxonomy. It uses Model2Vec's static embedding approach for fast, efficient inference without requiring a GPU.

Topic Hierarchy

The classifier uses a 4-level hierarchical classification system based on the OpenAlex topics taxonomy:

4 Domains: Physical Sciences, Life Sciences, Social Sciences, Health Sciences
~26 Fields: Chemistry, Physics, Medicine, Computer Science, etc.
~250 Subfields: More specific research areas
4,516 Topics: Granular research topics

Performance

Evaluated on a held-out test set (1,525 samples):

Level	Accuracy
Domain	92.6%
Field	85.8%
Subfield	73.6%
Topic (exact)	62.6%

Comparison with Base Model

Model	Domain	Field	Subfield	Topic
Base (potion-32m)	77.2%	60.5%	27.9%	16.2%
Fine-tuned	92.6%	85.8%	73.6%	62.6%
Improvement	+15.4	+25.3	+45.7	+46.4

Usage

from model2vec import StaticModel
import numpy as np

# Load model
model = StaticModel.from_pretrained("jimnoneill/dataset-to-field")

# Prepare your text
text = "Machine learning approaches for protein structure prediction using deep neural networks"

# Get embedding
embedding = model.encode([text])

# For full classification pipeline, see:
# https://github.com/data-S-index/dataset-to-field

Training Details

Base Model: minishlab/potion-base-32m
Training Samples: 8,636
Test Samples: 1,525
Topics Covered: 1,135 / 4,516 (25%)
Training Time: ~2 minutes on RTX 4090
Framework: Model2Vec + PyTorch Lightning

Training Data

The model was trained on ground truth topic classifications derived from the OpenAlex topics taxonomy. The training data includes scientific records with titles, subjects, and descriptions mapped to specific topics.

Dataset: jimnoneill/dataset-to-field-training-10k

Intended Use

Classifying scientific publications, datasets, and research outputs into research fields
Mapping DataCite records to research topics
Bibliometric analysis and research trend identification
Integration with the S-Index scientific indexing pipeline

Limitations

Trained on ~25% of the 4,516 topics; rare topics may have lower accuracy
Domain distribution is skewed toward Physical Sciences (53%)
Best suited for English-language scientific content

Citation

If you use this model, please cite:

@software{dataset-to-field,
  author = {O'Neill, James, Patel, Bhavesh},
  title = {Dataset Research Field Classifier},
  year = {2026},
  url = {https://github.com/data-S-index/dataset-to-field}
}

License

MIT

Acknowledgments

OpenAlex for the topic taxonomy
minishlab/potion-base-32m base embedding model
Model2Vec for model distillation and training

Downloads last month: 6

Dataset used to train jimnoneill/dataset-to-field

Evaluation results

Domain Accuracy on Dataset Research Field Training Data
self-reported

0.926
Field Accuracy on Dataset Research Field Training Data
self-reported

0.858
Subfield Accuracy on Dataset Research Field Training Data
self-reported

0.736
Topic Accuracy on Dataset Research Field Training Data
self-reported

0.626