You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

ProtSent ESM-2 35M

Contrastively fine-tuned ESM-2 35M protein language model, producing fixed-length embeddings where biological similarity maps to embedding proximity.

This is the best-performing 35M variant, trained without hard negatives (which improved 20/23 downstream tasks vs. 16/23 for the full model).

Paper: ProtSent: Protein Sentence Transformers Code: github.com/oriel9p/ProtSent 150M model: oriel9p/protsent-esm2-150M

Training

ProtSent applies contrastive fine-tuning using the SentenceTransformers framework with MultipleNegativesRankingLoss (MNRL) and CoSENT on ESM-2 backbones.

This variant was trained on four complementary data sources with round-robin sampling:

Dataset	Rows/Pairs	Loss
Pfam families (linclust@70%)	32.9M domains	MNRL
AlphaFold DB structural pairs (Foldseek-grouped)	133.9M sequences	MNRL
STRING-DB v12 PPI (score >= 400)	36.5M pairs	MNRL
ProteinGym DMS / clinical	2.2M pairs	CoSENT

Key hyperparameters: AdamW optimizer, cosine LR schedule, batch size 1024, temperature 0.05, dropout 0.1. Trained on a single NVIDIA RTX 6000 Ada 48GB in ~3-4 hours.

Quick Start

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("oriel9p/protsent-esm2-35M")

sequences = [
    "MKTLLLTLVVVTIVCLDLGYT",
    "MKTLLLTLVVVTIVCLDLGYN",  # similar
    "AGWYRSPQEGLKPVDTFKDIV",  # different
]

embeddings = model.encode(sequences)

Compute similarity:

from sentence_transformers.util import cos_sim

similarities = cos_sim(embeddings[0], embeddings[1:])
print(similarities)

Results

KNN probe (k=3, Euclidean) evaluation on 23 downstream tasks. This variant (w/o hard negatives) improves 20 of 23 tasks over baseline ESM-2 35M with a mean relative improvement of +7.9%.

Selected highlights vs. baseline ESM-2 35M:

Task	Metric	Baseline	ProtSent	Change
Remote Homology (Fold)	F1 Macro	.223	.313	+40.5%
RhlA Enzyme Mutations	Spearman	.236	.418	+77.2%
Beta-lactamase (PEER)	Spearman	.670	.793	+18.5%
Fluorescence (TAPE)	Spearman	.490	.567	+15.6%
PPI (Bernett)	AUC	.560	.589	+5.3%

Intended Use

General-purpose protein embeddings for downstream tasks including classification, regression, retrieval, clustering, and similarity search. The embeddings capture evolutionary, structural, and functional relationships.

Citation

@article{ofer2026protsent,
  title={ProtSent: Protein Sentence Transformers},
  author={Ofer, Dan and Perets, Oriel and Linial, Michal and Rappoport, Nadav},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2026}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

oriel9p
/

protsent-esm2-35M