--- license: cc-by-nc-nd-4.0 pipeline_tag: token-classification language: - la --- **CaputEmendatoris** is a projection head for [Emendator](https://huggingface.com/aimgo/Emendator) trained to identify OCR artifacts in Latin text at a character level. The model is intended to be used on segments of **250** characters. Anything else will compromise performance. In initial testing, using **0.25** as a character probability threshold typically produced the best F1 score across all degrees of corruption. --- ### Light Corruption Orig: Antistes mihi milibus trecentis. OCR: Antiftes mihi milibus trecentis: " . .. .ijiscnn p inr: h ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^ ### Heavy Corruption Orig: Cognoscenda virtute circumscripta est scientia, quae ad experientiam pertinet et ad rationem. OCR: C0gn0fccndauirtutccircurnfcriptacftfcientia:quacadcxpcricntiarnpcrtinct&adrationcrn« ^ ^^^^ ^ ^^ ^^^ ^ ^^^^ ^ ^^^ ^ ^ ^^ ^ ^ ^ ^^^^ To use CaputEmendatoris, you can load it via the Transformers library: ```python import torch from transformers import AutoModel, AutoTokenizer device = "cuda" model_repo = "aimgo/CaputEmendatoris" tokenizer_repo = "aimgo/Emendator" tokenizer = AutoTokenizer.from_pretrained(tokenizer_repo) model = AutoModel.from_pretrained( model_repo, trust_remote_code=True, # <=== NECESSARY, THIS HEAD HAS A CUSTOM MODELING FILE torch_dtype=torch.bfloat16 if device == "cuda" else torch.float32, ).to(device) model.eval() text = "quandoquidcrn natura anirni rnortalis habctur." enc = tokenizer(text, return_tensors="pt").to(device) # detector with torch.no_grad(): probs = model.detect(enc["input_ids"],enc.get("attention_mask", None)) byte_probs = probs[0][:-1].detach().cpu().tolist() char_probs = [] byte_idx = 0 for c in text: n = len(c.encode("utf-8")) if byte_idx + n <= len(byte_probs): char_probs.append(max(byte_probs[byte_idx:byte_idx+n])) else: char_probs.append(0.0) byte_idx += n print(char_probs) ``` If you use this in your work, please cite: ``` @misc{mccarthy2026Emendator, author = {McCarthy, A. M.}, title = {{Emendator}: Latin OCR Artifact Correction}, year = {2026}, howpublished = {\url{https://huggingface.co/aimgo/CaputEmendatoris}}, note = {Model} } ```