You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

ProtSent ESM-2 35M

Contrastively fine-tuned ESM-2 35M protein language model, producing fixed-length embeddings where biological similarity maps to embedding proximity.

This is the best-performing 35M variant, trained without hard negatives (which improved 20/23 downstream tasks vs. 16/23 for the full model).

Paper: ProtSent: Protein Sentence Transformers Code: github.com/oriel9p/ProtSent 150M model: oriel9p/protsent-esm2-150M

Training

ProtSent applies contrastive fine-tuning using the SentenceTransformers framework with MultipleNegativesRankingLoss (MNRL) and CoSENT on ESM-2 backbones.

This variant was trained on four complementary data sources with round-robin sampling:

Dataset Rows/Pairs Loss
Pfam families (linclust@70%) 32.9M domains MNRL
AlphaFold DB structural pairs (Foldseek-grouped) 133.9M sequences MNRL
STRING-DB v12 PPI (score >= 400) 36.5M pairs MNRL
ProteinGym DMS / clinical 2.2M pairs CoSENT

Key hyperparameters: AdamW optimizer, cosine LR schedule, batch size 1024, temperature 0.05, dropout 0.1. Trained on a single NVIDIA RTX 6000 Ada 48GB in ~3-4 hours.

Quick Start

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("oriel9p/protsent-esm2-35M")

sequences = [
    "MKTLLLTLVVVTIVCLDLGYT",
    "MKTLLLTLVVVTIVCLDLGYN",  # similar
    "AGWYRSPQEGLKPVDTFKDIV",  # different
]

embeddings = model.encode(sequences)

Compute similarity:

from sentence_transformers.util import cos_sim

similarities = cos_sim(embeddings[0], embeddings[1:])
print(similarities)

Results

KNN probe (k=3, Euclidean) evaluation on 23 downstream tasks. This variant (w/o hard negatives) improves 20 of 23 tasks over baseline ESM-2 35M with a mean relative improvement of +7.9%.

Selected highlights vs. baseline ESM-2 35M:

Task Metric Baseline ProtSent Change
Remote Homology (Fold) F1 Macro .223 .313 +40.5%
RhlA Enzyme Mutations Spearman .236 .418 +77.2%
Beta-lactamase (PEER) Spearman .670 .793 +18.5%
Fluorescence (TAPE) Spearman .490 .567 +15.6%
PPI (Bernett) AUC .560 .589 +5.3%

Intended Use

General-purpose protein embeddings for downstream tasks including classification, regression, retrieval, clustering, and similarity search. The embeddings capture evolutionary, structural, and functional relationships.

Citation

@article{ofer2026protsent,
  title={ProtSent: Protein Sentence Transformers},
  author={Ofer, Dan and Perets, Oriel and Linial, Michal and Rappoport, Nadav},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2026}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train oriel9p/protsent-esm2-35M

Collection including oriel9p/protsent-esm2-35M