--- tags: - spacy - token-classification - ancient-greek language: - grc license: mit model-index: - name: grc_dep_web_trf results: - task: name: POS Tagging type: token-classification metrics: - name: POS Accuracy type: accuracy value: 0.9728 - name: TAG (XPOS) Accuracy type: accuracy value: 0.9740 - task: name: Lemmatization type: token-classification metrics: - name: Lemma Accuracy type: accuracy value: 0.9399 - task: name: Dependency Parsing type: token-classification metrics: - name: Labeled Attachment Score type: f_score value: 0.8027 --- # grc_dep_web_trf **Ancient Greek** pipeline for [spaCy](https://spacy.io), part of the [LatinCy](https://huggingface.co/latincy) project. **Experimental beta release.** This is part of the first generation of Ancient Greek models porting the [LatinCy](https://huggingface.co/latincy) Latin pipeline infrastructure to Ancient Greek. Expect rough edges; scores and component behavior will improve as training data is harmonized and curated through the LatinCy flywheel (train, evaluate, curate, retrain). Transformer model powered by PhilBerta (Ancient Greek RoBERTa). Trained on Universal Dependencies Ancient Greek treebanks (PTNK, PROIEL, Perseus) with a 1.2M-entry lookup lemmatizer overlay built from CLTK Morpheus, UD treebanks, and Wiktionary. | Feature | Description | | --- | --- | | **Name** | `grc_dep_web_trf` | | **Version** | `3.8.1` | | **spaCy** | `>=3.8.11,<3.9.0` | | **Default Pipeline** | `senter`, `transformer`, `tagger`, `morphologizer`, `trainable_lemmatizer`, `lookup_lemmatizer`, `parser` | | **Components** | `senter`, `transformer`, `tagger`, `morphologizer`, `trainable_lemmatizer`, `lookup_lemmatizer`, `parser` | | **Vectors** | 0 keys, 0 unique vectors (0 dimensions) | | **License** | `MIT` | | **Author** | [Patrick J. Burns](https://huggingface.co/latincy) | ## Install ```bash pip install https://huggingface.co/latincy/grc_dep_web_trf/resolve/main/grc_dep_web_trf-3.8.1-py3-none-any.whl ``` ## Usage ```python import spacy nlp = spacy.load("grc_dep_web_trf") doc = nlp("\u03bc\u1fc6\u03bd\u03b9\u03bd \u1f04\u03b5\u03b9\u03b4\u03b5 \u03b8\u03b5\u1f70 \u03a0\u03b7\u03bb\u03b7\u03ca\u03ac\u03b4\u03b5\u03c9 \u1f08\u03c7\u03b9\u03bb\u1fc6\u03bf\u03c2") for token in doc: print(token.text, token.pos_, token.lemma_, token.dep_) ``` ## Evaluation Scores on held-out UD test data (combined PTNK + PROIEL + Perseus). | Metric | Score | | --- | --- | | **POS (UPOS) Accuracy** | 97.28 | | **TAG (XPOS) Accuracy** | 97.40 | | **Morph (UFeats) Accuracy** | 93.61 | | **Lemma Accuracy** | 93.99 | | **Unlabeled Attachment Score (UAS)** | 85.12 | | **Labeled Attachment Score (LAS)** | 80.27 | | **Sentences F-Score** | 88.18 | ## Training Data | Source | Description | | --- | --- | | [UD_Ancient_Greek-PTNK](https://github.com/UniversalDependencies/UD_Ancient_Greek-PTNK) | Septuagint (Codex Alexandrinus) | | [UD_Ancient_Greek-PROIEL](https://github.com/UniversalDependencies/UD_Ancient_Greek-PROIEL) | PROIEL Ancient Greek treebank | | [UD_Ancient_Greek-Perseus](https://github.com/UniversalDependencies/UD_Ancient_Greek-Perseus) | Perseus Ancient Greek treebank | ## Components - **transformer** -- PhilBerta transformer backbone (Ancient Greek RoBERTa) - **tagger** -- Fine-grained POS tagger (XPOS, harmonized 16-tag tagset) - **morphologizer** -- Morphological feature assignment (UPOS + UFeats) - **trainable_lemmatizer** -- Edit-tree lemmatizer - **lookup_lemmatizer** -- 1.2M-entry dictionary lemmatizer overlay (CLTK Morpheus + UD + Wiktionary); normalizes grave accents to acute at query time - **parser** -- Dependency parser (transition-based) - **senter** -- Sentence segmenter ## Label Scheme

View label scheme (1796 labels for 3 components)

**`tagger`**: `adjective`, `adverb`, `conjunction`, `conjunction_adverb`, `conjunction_pronoun`, `determiner`, `interjection`, `noun`, `number`, `particle`, `preposition`, `pronoun`, `proper_noun`, `punc`, `unknown`, `verb` **`morphologizer`**: 1749 morphological feature combinations **`parser`**: `ROOT`, `acl`, `advcl`, `advmod`, `amod`, `appos`, `aux`, `case`, `cc`, `ccomp`, `conj`, `cop`, `csubj`, `dep`, `det`, `discourse`, `dislocated`, `fixed`, `flat`, `iobj`, `mark`, `nmod`, `nsubj`, `nummod`, `obj`, `obl`, `orphan`, `parataxis`, `punct`, `vocative`, `xcomp`