First off, progress report

As disappointing at this is, I could not fully converge the geolip-svd-transformer yet.

I deeply apologize for my inability to handle this task, and I will be doing my very best to implement the structure in a unilaterally useful scaling methodology using synthetic pretrained information as guideposts.

I have NOT given up this structure. I am expanding the entire differentiation underlying the system.

I have begun a heavy series of sweeps to test huge amounts of synthetic shapes, structural variances, coloration differentiations, and structural variants in a series of intended pretrain convergences that will manifest into the synthetic pixel solver structure.

These weight sets will begin in notebook form, and evolve into structural SVD weight infusions that will intentionally amplify learning speed to introduce huge amounts of potential autosolving encoder structures intentionally targeting very very small sizes.

INTENTIONALLY small. These are going to be imperfect, but there will be MANY OPTIONS.

The "auto" spectrum will have a series of prefabricated "init" spectrums, intentionally meant to allow skipping huge amounts of early pretraining using organized spectral attuned SVD attenuation mechanisms.

There will be multiple capable patchworks, multiple capable potentials, and multiple capable substructure options each with their own benefits, own negatives, and own convergence speeds.

The current goal here, is to synthetic shape expand the structural invariance of systems like this, to introduce prefabricated utility-driven patchworks using SVD as a catalyst.

The ultimate goal is to allow the system to learn independently and usefully as a rapid learning alternative of standard encoding techniques, as the SVD provides enough information. The information is just highly difficult to extract and use usefully without a specific architectural behavior involved.

I'm working out every element.

This does not impact negatively on distillation research

This is an independent structure from the distillation systems I've produced and will be specifically marked as such.

This does not impact negatively on the already constructed or created distillations that already exist, and it will only POTENTIALLY snap the potential lids off those systems when the system is updated and prepared to the correct states.

I CANNOT guarantee success, this has been a very tough journey, though I have found some guideposts to assist.

The guideposts are NOT ENOUGH to create the whole thing yet.

geolip-svd-transformer API


from geolip_core import svd_transformer

# S = S^N squarewise calculations of differentiation for utilization, default is 16 which is S^15
  # Future versions will support rectangular after experimentation permits.
# V = the the calculated embedding differentiation utilization size of the SVD, used for embedding and memory recall
# D = the decomposition of the SVD, the size of the square matrix to decompose, this is the most computationally expensive part and should be kept as small as possible for large batch sizes, but it is also the most important for performance so it should not be too small. The sweet spot is usually around 16 or 32, but it can be smaller or larger depending on the use case and hardware.
# Smaller D is specifically good for the triton kernels.
# Currently the custom geolip triton kernels support 2x2 and 3x3, but there are a large amount of solvers in the works where they can support 2x3, 3x2, and up to 6x6 for fp64, which would allow for more flexibility in the SVD decomposition size while still maintaining performance.
# There is a large amount of problems with D and the focus of the Triton kernels is to allow for larger D with good performance on many types of hardwares, but the sweet spot for most use cases is still around 16 or 32 for pure encoding accuracy and decoding accuracy, and the default is set to 16 for this reason.

former = svd_transformer(
  x,                      # primary learning tensor used for valid shape. Will be used for SVD.
  y=None,                 # tensor or None; tensor masks, specifically meant to handle QKV or SUVt/SUV tokenizations.
  z=None,                 # tensor or None; [patchworks, embeddings, encodings, encapsulants, structural invariants, etc]
                          # There is a series of points that can be hooked or overridden for experimentation, they will all be fully transparent.
                          # Z is used for experimentation tooling, the primary systems will not touch this by default.
  svd=None,               # the SVD sizes.
                          # None = x.shape batch omitted.
                          # [S, V, D]
                          # (S, V, D)
                          # tensor.shape
                          # This will hard crash by default if the D is too large > 128, or the shape is likely not correct, but it is up to the user to ensure that the SVD size is correct and that the D is not too large for their hardware and batch size.
  bypass_crash=True,      # if true, will bypass the crash that occurs when D is too large for the hardware and batch size, but it will still warn the user that this is likely to cause a crash and that they should reduce D or batch size.
  heads=64,               # heads per SVD relation
  hidden_size=4,          # Internal MLP hidden size for encoder projection
  depth=4,                # Internal passes through the cell to process the SVD
  encode="mlp",           # Encodes with MLP as the primary utility.
                          # "transformer" = applies a standard transformer encoder layer to the input features, can be effective for capturing complex relationships in the data, but may be more computationally expensive than other encoding methods.
                          # "conv" = applies a convolutional layer to the input features, can be effective for capturing local patterns in the data.
                          # "mlp" = Multi-Layer Perceptron, standard neural network encoding.
                          # "film" = Feature-wise Linear Modulation, applies a learned affine transformation to the input features, can be more parameter efficient and effective for certain tasks.
                          # "ffn" = Feed-Forward Network, a simple two-layer network with a non-linearity, often used in transformer architectures for encoding.
                          # "rotary" = applies rotary positional embeddings to the input features, can be effective for capturing positional information in sequences.
                          # "lstm" = applies a Long Short-Term Memory layer to the input features, can be effective for capturing sequential dependencies in the data.
                          # "gru" = applies a Gated Recurrent Unit layer to the input features, similar to LSTM but with fewer parameters, can also be effective for capturing sequential dependencies.
  attention_layers=2,     # Internal attention layers and purposes of them.
  activation="gelu",      # The activation formula for non-geometric portions.
  geo_activation="star",  # Relu squared is all positive so it's used by default, there are others.
                          # "relu" = standard relu, max(0, x), simple and effective for many tasks.
                          # "star" = relu squared, max(0, x)^2, can be more effective for geometric tokens as it allows for a smoother gradient and can capture more complex relationships in the data.
                          # "gelu" = Gaussian Error Linear Unit, x * sigmoid(x), can be more effective for certain tasks as it allows for a smoother gradient and can capture more complex relationships in the data.
                          # "silu" = Sigmoid Linear Unit, x * sigmoid(x), similar to gelu but with a slightly different formulation, can also be effective for certain tasks.
                          # "tanh" = Hyperbolic Tangent, (exp(x) - exp(-x)) / (exp(x) + exp(-x)), can be effective for certain tasks as it allows for both positive and negative values and can capture more complex relationships in the data.
                          # "sigmoid" = Sigmoid, 1 / (1 + exp(-x)), can be effective for certain tasks as it allows for values between 0 and 1 and can capture more complex relationships in the data.
                          # "leaky_relu" = Leaky ReLU, max(0.01 * x, x), can be effective for certain tasks as it allows for a small gradient when the input is negative, which can help prevent dead neurons in the network.
                          # "swilu" = Sigmoid Weighted Linear Unit, x * sigmoid(x), similar to silu but with a slightly different formulation, can also be effective for certain tasks.
  token_out="all",        # the format of token expected out
                          # "all" or None will return all tokens, which applies transformer logic automatically.
                          # "QKV" standard attention token, applies transformer logic internally and can accept rotary behavior
                          # "SUVt" or "SUV" geometric tokens returned only, QKV transformation learning not applied.
  target="SVD",           # "SVD" targets all 3, good for complex tasks.
                          # "VD" targets only VD for attention
                          # "SV" targets only SV for attention, still hardware limited by D as that's an upstream task.
                          # "S" or "V" targets only one of these, would not recommend but it has uses.
  svd_solver="auto",      # svd solver type,
                          # "auto" backend default, picks best from the combination using the benchmarks
                          # "torch" use torch default for svd
                          # "triton" use the triton svd with torch eigh, all triton supports 2x2, 2x3, 3x3, 3x2, up to 6x6 fp64 targeting D=size constraints
  eigh_solver="auto",     # "auto" backend default, picks best from the combination using the benchmarks
                          # "torch" use torch default specifically
                          # "fl" use gram with a different formula, increased accuracy with speed cost, compiles up to D=12

)

What Works

Huggingface Transformers If you snap transformers to process the tokens, it will work. Transformers are a beast and have tons of years of power capacity. Using huggingface transformers will definitely work as a setting, they just add substantial overhead and eliminate a piece of the experiment.

Conv2d, Conv3d Using CONV will definitely work as a setting. The convergence is high accuracy when correctly aligned with Cifar100, TinyImageNet, Imagenet128, and multiple datasets.

Kymatio Scatterpoint2D This requires some conv but not much, and this produces corresponding powerhouse behavior stronger than Conv alone when adjudicating large amounts of SVD information with the attention alignment spectrum.

What Needs To Work

Using MLP will reach fair accuracy and not use CONV or TRANSFORMERS.

I have seen around 60% on cifar100 with no traditional encoders, but the system was crutching the M_path to fill the gaps after enough epochs of the SVD path. This structure is under the microscope now.

Instability allows SGD optimization to heavily benefit some image tasks while it fails completely on text tasks.

Out Projection SUVt tokens are iffy

The out projection is an MLP multiscale projection that took a while to set up, and it produces approximate transformer QKV with useful SUVt tokens downstream.

Many activations corrupt geometry

They are in there for experimentation. Feel free to experiment.

without the expanded triton core spectrum larger systems suffer with triton

Claude code is having trouble with this one as a full task, I'll need to build it in pieces. I've had OpenClaw working on it but the outcome isn't looking good. The 4x4 and 5x4 won't converge, while the 6x6 crashes the system entirely instead of building it.

I'll need to wait for a fix for claude code, this is a known issue apparently.

Magnitudes are WILDLY hard to control

High magnitudes and complexity associations with those must be controlled. The vocabulary spectrum of a token is widely diverse, so the noise generated from large and small magnitudes is very difficult to curate for. I'm debugging this task as I progress, and the dominant shape learning corruption is magnitude differentiation currently.

I have some potentials and this is my current direction.

Additionally

There are multiple torch-access components meant to be utilized with this structure, so be aware there will be many ways to use this transformer in line with torch standard use. There is no rigid backing structure to it, just install the geolip-core and you're set - once I actually get the experimental branch live.

Claude loves to inline invalid eigh gram svd instead of actually using the imports, so I need to make sure claude respects the structure every single time.

Experiments are slow going, I need more hardware.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support