Geometric Structural Vocabulary: Scaling Deterministic Mathematics Into Universal Model Conditioning
AbstractPhil · February 2026
The Geometry Works
Over the past year, a sustained series of experiments have demonstrated that deterministic geometric mathematics — pentachoron structures, Cantor set theory, crystalline embeddings, k-simplex constructions — can classify, condition, and transfer structural information across neural network architectures in ways that statistical learning alone cannot.
I've added a model repo containing a couple of files with helpful geometric results bulletpointed in README.md and FORMULAS.md https://huggingface.co/AbstractPhil/geometric-experiment-history
The David multi-scale crystal classifier used geometric pentachora as class prototypes with role-weighted similarity across multiple embedding scales — extracting more from frozen features than architectures orders of magnitude larger, with a fraction of the trainable parameters. The Geometric Basin Classifier proved that cross-entropy can be replaced entirely: classification via triadic compatibility, self-similarity, Cantor coherence, and hierarchical basin checks, no softmax, no probability distribution. The learnable alpha parameter in the Devil's Staircase positional encoding converged to 0.5 under geometric losses — the exact triadic equilibrium that maximizes fractal structure — while cross-entropy actively destroyed it. The K-Simplex LLM prototype maintained 100% geometric validity through full training on Shakespeare while achieving reasonable perplexity. geo-beatrix beat vision transformers on CIFAR-100 using no attention mechanisms at all — ancient convolutions plus geometric basin compatibility.
Most recently, a frozen PatchMaker trained exclusively on 27 synthetic geometric primitives produced features that outperformed FLUX VAE latents on CIFAR-100 natural image classification by nearly 8 points, without ever seeing a natural image during its own training. The geometric signal transfers.
Now we're scaling it.
The Core Insight
Current deep learning treats every model as an island. Each architecture learns its own internal representation from scratch, its own attention patterns, its own way of encoding "what matters" about the data. Transfer learning and distillation help, but they're approximations — lossy compression of one model's learned statistics into another's parameter space.
Geometric structural vocabulary takes a fundamentally different approach. Instead of learning representations statistically, it constructs them mathematically. Every token in the vocabulary corresponds to a deterministic geometric structure — a specific configuration of vertices, edges, faces, and their relational properties within higher-dimensional space. These structures don't approximate relationships. They are the relationships.
The critical constraint, discovered empirically through extensive training experiments on the geometric vocabulary dataset, is that these structures must be navigated, not optimized. Direct gradient descent on pentachora collapses them to zero. But when geometric crystals serve as frozen anchors — starting points from which a model makes minor trajectory shifts — they retain full cohesion and the training path remains backtrackable. The geometry survives because the model learns where to go within it, not how to reshape it.
This means the same geometric token carries the same structural meaning regardless of which model consumes it. A "boundary token with high curvature and 3 active axes" describes the same geometric reality whether it's conditioning a diffusion model, biasing a vision transformer, or anchoring a language model's attention. The vocabulary is universal not because it was trained to be, but because mathematics is.
What We've Built So Far
David — a multi-scale crystal classifier that uses pentachora (4-simplexes) as class prototypes across multiple embedding scales (64 through 1024 dimensions). Each scale captures a different level of semantic granularity. Deep efficiency gating with cross-attention routes between scales per-sample, and Rose loss provides role-weighted geometric regularization. David achieved 74.87% on CIFAR-100 with 393K trainable parameters — extracting more from frozen CLIP embeddings than linear probes by navigating geometric structure that flat classifiers miss. The multi-scale architecture has been validated extensively across multiple feature extractors and datasets.
Geometric Basin Classifier — a classification system that replaces cross-entropy entirely with geometric formula basin checks: triadic compatibility, self-similarity, Cantor coherence, and hierarchical structure verification. The compatibility scores are combined via geometric mean — all checks must pass, like a logical AND in compatibility space. Proved that geometric cognition is viable: alpha converged to the triadic equilibrium (~0.5) under geometric losses, the same value that cross-entropy actively destroyed when it controlled the gradient.
Devil's Staircase Positional Encoding — fractal position encoding using Cantor's 1883 function with a learnable alpha parameter. The staircase generates hierarchical geometric structure across levels, with alpha controlling the middle-interval weighting. Under geometric losses, alpha consistently converges to ~0.5 (triadic equilibrium). This is now understood as the natural resting state of the fractal hierarchy — the point where geometric formulas are maximally satisfied.
K-Simplex Constructions — deterministic geometric linear layers where tokens are represented as k-dimensional simplexes with Cayley-Menger distance validation. The LLM prototype on Shakespeare maintained 100% geometric validity throughout training while learning language structure, with deformation scales of 0.15–0.35 identified as the sweet spot for differentiation between k-levels.
Geometric Vocabulary Dataset (AbstractPhil/geometric-vocab) — the published foundation. SHA-256 deterministic hashing generates unique, reproducible pentachora for every token: 140K Unicode characters and ~210K WordNet English entries, each stored as a 5-vertex simplex across 38 dimensional splits from 16d to 4096d. Two encoding modes: WordNet provides direct one-word-one-crystal lookup for known vocabulary, while Unicode composition decomposes unknown tokens into character-level pentachora and averages them — the system never fails on unseen input. Each crystal's five vertices carry semantic roles (anchor, need, relation, purpose, observer), with the anchor frozen during training and the remaining four learnable. This dataset was the testbed for the most important finding in the entire research program: pentachora collapse to zero under weighted decay when trained directly, but retain full cohesion and remain backtrackable when used as starting points with only minor trajectory shifts toward a goal. That single observation — published as a research update on the dataset page in September 2025 — became the design principle for everything that followed. Direct training destroys geometric structure. Anchored navigation preserves it. Early experiments using the 32d vocabulary split as frozen class anchors on CIFAR-100 reached 15.85% test accuracy with only 72K trainable parameters (282KB model), proving the crystals carried genuine structural information even at tiny scale. The vocabulary also enabled expert crystal governance — mixture-of-experts style systems where primary crystals guide constellations of secondary crystals through softmin routing on L1 distances, with promotion and demotion based on geometric stability. The dataset's known limitation — that individual crystals aren't specific enough to represent the full complexity of language — drove the development of n-gram variant tokenization and ultimately the scaled CHUNK architecture described below.
PatchMaker — a two-tier gated geometric transformer that carves latent volumes into macro-patches and classifies their geometric properties. It outputs per-patch gate vectors (dimensionality, curvature, boundary classification, surface roles, topology) alongside learned feature embeddings. Trained on 27 geometric primitives derived from text-conditioned latent spaces, frozen PatchMaker features transferred directly to natural image classification — outperforming raw FLUX VAE latents without ever seeing a natural image.
FiLM Gate Conditioning — a mechanism for injecting geometric structural information into arbitrary transformer architectures. Gates modulate processing through learned scale and shift operations rather than concatenation. Zero-initialized, so training begins with a vanilla architecture and geometric influence emerges gradually. Verified to improve generalization over ungated baselines on CIFAR-100.
What We're Building Next
The experiments so far operated across a wide range of architectures and tasks — classification, language modeling, diffusion conditioning, cross-model transfer — and the published geometric vocabulary provided the deterministic foundation for all of them. But the vocabulary has known limits: individual crystals across 38 dimensional splits aren't specific enough to house the full representational complexity of language, and the constellation governance systems that worked at small scale (softmin on L1 distances across expert clusters) didn't scale cleanly. The architecture worked. The transfer worked. The mathematics held across every domain tested. The vocabulary itself needs to grow.
Now we're scaling the vocabulary and the structural resolution to match the complexity of real model internals.
The CHUNK
The fundamental unit of the scaled system is a CHUNK: a geometric attention volume with dimensions 512 × 256 × 64. This maps directly to attention mechanics — 512 heads, 256 tokens per head, 64 dimensions per head — following the tested principle of 1 head per 64 dimensions that has proven reliable across transformer architectures.
A single CHUNK supports approximately 8.3 million dense, 256-dimensional, sequentially aligned tokens as geometric associative patch conditioning. This isn't a context window in the language model sense — it's a structured geometric mesh that represents the full relational state of a model's internal potential.
The CHUNK is the first fully compliant geometric structure in the scaled system. It is designed to:
Scale bidirectionally. The pretrained CHUNK includes a needs-based loader that activates only the geometric resolution required for a given task. Simple tasks use a fraction of the structure. Complex relational reasoning across model boundaries uses more. Hardware consumption scales with actual need, not maximum capacity.
Anchor differential learning. The geometric tokens within a CHUNK describe structural relationships between model responses, not the responses themselves. When two different models process the same input, their outputs can be geometrically aligned through the shared vocabulary — producing a relational map of where the models agree, disagree, and why, encoded as reusable geometric structure rather than opaque weight differences.
Enable conjunctive multi-model reasoning. Because geometric tokens are mathematically universal, a CHUNK can simultaneously represent the structural states of multiple models. Attention patterns, embedding geometries, and decision boundaries from entirely different architectures become comparable — not through statistical alignment, but through shared deterministic description.
The SECTOR
A SECTOR aggregates CHUNKs into a geometric space roughly equivalent to 2 trillion parameters. For context: this approaches the structural complexity of the largest current language models, but encoded as deterministic geometric relationships rather than learned weights.
A SECTOR will almost never be instantiated in full. The needs-based loading system means most tasks will activate a small fraction of the available structure. The SECTOR exists as a capacity boundary — the maximum geometric resolution available for tasks that require it.
Beyond SECTORs
512 SECTORs — roughly one quadrillion parameters equivalent — constitutes a single level of the full geometric hierarchy. It is not the final global state. The complete structural representation of all utilizable geometric relationships extends substantially beyond what current hardware can support.
The deterministic mathematics defines the structure, but computing the final state at that scale is a training problem. Correct geometric alignment across that volume of data requires moving, validating, and relationally anchoring an enormous amount of structural information. The math tells you what the relationships should be. Training ensures they are — that every geometric token is correctly aligned with every other across the full hierarchy. At scales beyond current hardware, this is a real and unsolved engineering challenge.
What current hardware does support is working precisely within CHUNKs and across SECTORs. The needs-based loading system means useful geometric conditioning is available now, at the resolutions current compute permits. The framework is designed so that as hardware scales, deeper hierarchical layers can be trained and integrated without restructuring the levels that already exist.
Why This Matters
The practical implications are direct:
Dramatically less training. Geometric structural vocabulary provides dense relational priors derived from deterministic mathematics. Models that consume it start with structured geometric anchors rather than random initialization. Early experiments show competitive performance with considerably fewer epochs.
Cross-model transfer without retraining. When Model A and Model B share a geometric vocabulary, their internal states become directly comparable. Insights learned by one model — which attention patterns work, which embedding regions are productive, which structural relationships matter — transfer through the shared geometric description, not through weight copying or distillation.
Hardware-efficient inference. The geometric structures are designed for efficient projection. CPU-viable consumption of pretrained geometric state is an explicit design goal, not an afterthought. The needs-based loader ensures that inference cost scales with task complexity, not vocabulary size.
Structurally-aware decision making. Current models produce outputs based on weighted numerical combinations. Geometric conditioning produces outputs based on structural relationships — the topology, curvature, and boundary properties of the decision space itself. This enables reasoning about why a particular response is appropriate, not just what response has the highest probability.
The Roadmap
The immediate work is building the first production CHUNK — the 512 × 256 × 64 geometric attention volume with full pretrained vocabulary, needs-based loading, and verified transfer to at least two distinct model architectures. David's multi-scale crystal approach and the Geometric Basin's compatibility checks provide the proven mechanisms. The CHUNK provides the scale.
From there, SECTOR assembly and the cross-model differential experiments — aligning geometric descriptions of different architectures' internal states to enable direct structural transfer. The mathematics says this should work. A year of experiments across vision, language, and diffusion says the mathematics is right. Now we find out what happens at scale.
Building starts now.
The geometric vocabulary system, PatchMaker, and associated tools are developed under AbstractPhil on Hugging Face. The foundational geometric-vocab dataset — containing deterministic pentachora for Unicode and WordNet across 38 dimensional splits — is publicly available. Experimental results, model weights, and datasets from the development process are published as they're produced.