Instructions to use oriel9p/protsent-esm2-35M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use oriel9p/protsent-esm2-35M with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("oriel9p/protsent-esm2-35M") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
ProtSent ESM-2 35M
Contrastively fine-tuned ESM-2 35M protein language model, producing fixed-length embeddings where biological similarity maps to embedding proximity.
This is the best-performing 35M variant, trained without hard negatives (which improved 20/23 downstream tasks vs. 16/23 for the full model).
Paper: ProtSent: Protein Sentence Transformers Code: github.com/oriel9p/ProtSent 150M model: oriel9p/protsent-esm2-150M
Training
ProtSent applies contrastive fine-tuning using the SentenceTransformers framework with MultipleNegativesRankingLoss (MNRL) and CoSENT on ESM-2 backbones.
This variant was trained on four complementary data sources with round-robin sampling:
| Dataset | Rows/Pairs | Loss |
|---|---|---|
| Pfam families (linclust@70%) | 32.9M domains | MNRL |
| AlphaFold DB structural pairs (Foldseek-grouped) | 133.9M sequences | MNRL |
| STRING-DB v12 PPI (score >= 400) | 36.5M pairs | MNRL |
| ProteinGym DMS / clinical | 2.2M pairs | CoSENT |
Key hyperparameters: AdamW optimizer, cosine LR schedule, batch size 1024, temperature 0.05, dropout 0.1. Trained on a single NVIDIA RTX 6000 Ada 48GB in ~3-4 hours.
Quick Start
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("oriel9p/protsent-esm2-35M")
sequences = [
"MKTLLLTLVVVTIVCLDLGYT",
"MKTLLLTLVVVTIVCLDLGYN", # similar
"AGWYRSPQEGLKPVDTFKDIV", # different
]
embeddings = model.encode(sequences)
Compute similarity:
from sentence_transformers.util import cos_sim
similarities = cos_sim(embeddings[0], embeddings[1:])
print(similarities)
Results
KNN probe (k=3, Euclidean) evaluation on 23 downstream tasks. This variant (w/o hard negatives) improves 20 of 23 tasks over baseline ESM-2 35M with a mean relative improvement of +7.9%.
Selected highlights vs. baseline ESM-2 35M:
| Task | Metric | Baseline | ProtSent | Change |
|---|---|---|---|---|
| Remote Homology (Fold) | F1 Macro | .223 | .313 | +40.5% |
| RhlA Enzyme Mutations | Spearman | .236 | .418 | +77.2% |
| Beta-lactamase (PEER) | Spearman | .670 | .793 | +18.5% |
| Fluorescence (TAPE) | Spearman | .490 | .567 | +15.6% |
| PPI (Bernett) | AUC | .560 | .589 | +5.3% |
Intended Use
General-purpose protein embeddings for downstream tasks including classification, regression, retrieval, clustering, and similarity search. The embeddings capture evolutionary, structural, and functional relationships.
Citation
@article{ofer2026protsent,
title={ProtSent: Protein Sentence Transformers},
author={Ofer, Dan and Perets, Oriel and Linial, Michal and Rappoport, Nadav},
journal={arXiv preprint arXiv:XXXX.XXXXX},
year={2026}
}