DNA To Proteins Translator
GPT-2 finetuned model for translate DNA into proteins sequences, trained on a large cross-species GenBank dataset.
Model Architecture
- Base model: GPT-2
- Approach: DNA To Proteins Translation
Usage
You can use this model through its own custom pipeline:
from transformers import pipeline
pipe = pipeline(
task="gpt2-dna-translator",
model="GustavoHCruz/DNATranslatorGPT2",
trust_remote_code=True,
)
out = pipe({
"sequence": "GTTTCTTTGCTTTTTAMGCTTGTATCTATTCTTCCATCGTAGACTGACCTGGTCATTTCTTTGCATCCAACGTA",
"organism": "Homo sapiens"
})
print(out) # LTWSFLCIQR
out = pipe({
"sequence": "ACACCAGCCTAGTTCTATGTCAGGTTCTAAAATATTTTCTGGTTCAATAAATAAAACATCAACATCTCACATAAAAGAAGTACGGAAAAGATTTAAAGGCAGTAACATATGAACGTAGGACGTTTAGGAGAAAAATGCTAAAAAAGTAGCTATTGTTAATTGAACATTACTCAGGGATGATCGGTTGTTTTTGTATTGACTTACCAAGACCACCATTGCCGAGTGCTGCATCCATTTCACGTTCTTCTAATTCTTCAATATCTAAATTCAACTCATAAAGAGCTTAATCA",
"organism": "Rotaria socialis"
})
print(out) # MDAALGNGGL
This model uses the same maximum context length as the standard GPT-2 (1024 tokens). Its training was performed ensuring that the DNA sequence and the resulting protein would always fit within this context. An additional (and highly recommended) context is available: organism.
When using this pipeline, some rules will be applied to keep the model functioning in the same way as the training performed:
- DNA sequences will be limited to 1000 tokens (each nucleotide becomes a token).
- The organism (raw text) is limited to a maximum of 10 characters.
- The generated response is limited to 1024 โ the size of the received input. The minimum number of tokens generated, when the input is at its limit, is 11 new tokens.
Custom Usage Information
Prompt format:
The model expects the following input format:
<|DNA|>[DNA_G][DNA_T][DNA_T][DNA_T]...<|ORGANISM|>Homo sapiens
The model will generate a response in the following expected format:
<|PROTEIN|>[PROT_L][PROT_T][PROT_W]...<|END|>
Dataset
The model was trained on a processed version of GenBank sequences spanning multiple species, available at the DNA Coding Regions Dataset.
Training
- Trained on an architecture with 8x H100 GPUs.
Metrics
The model is still in the initial evaluation stages, and currently shows an average similarity of approximately 0.75 (calculated from the edit distance) with target sequences in the test set.
GitHub Repository
The full code for data processing, model training, and inference is available on GitHub:
CodingDNATransformers
You can find scripts for:
- Preprocessing GenBank sequences
- Fine-tuning models
- Evaluating and using the trained models
- Downloads last month
- 47
Model tree for GustavoHCruz/DNATranslatorGPT2
Base model
openai-community/gpt2