funding-parsing-token-probe-Llama-3.2-1B-lora

Token-level funding-statement classifier built on top of meta-llama/Llama-3.2-1B. Given an article, the model emits a per-token probability that the token is inside a funding-acknowledgment span.

Repository contents

adapter_config.json         # PEFT LoRA config (base: Llama-3.2-1B)
adapter_model.safetensors   # LoRA weights (q_proj, v_proj)
classifier.pt               # Conv1d head on top of last-4-layer hidden states
README.md

Architecture

Llama-3.2-1B (frozen base)
    + LoRA(r=32, α=32, target=[q_proj, v_proj], dropout=0.05)
    └── concat(hidden_states[-4:], dim=-1)  →  shape (B, T, 8192)
Conv1d head (classifier.pt, flat nn.Sequential keys 0/2/4):
    Conv1d(8192,  512, k=5, padding=2) + GELU
    Conv1d( 512,  128, k=3, padding=1) + GELU
    Conv1d( 128,    1, k=1)
  → per-token logits (B, T)

The head is ~42M params — substantially more than the LoRA adapter. This is deliberate: the probe learns rich span-boundary features from frozen last-4-layer hidden states.

Usage

import torch
import torch.nn as nn
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
from huggingface_hub import hf_hub_download

BASE = "meta-llama/Llama-3.2-1B"
REPO = "cometadata/funding-parsing-token-probe-Llama-3.2-1B-lora"
N_HIDDEN_LAYERS = 4
THRESHOLD = -4.0  # tuned on validation; raise for more precision, lower for more recall

tokenizer = AutoTokenizer.from_pretrained(BASE)
base = AutoModelForCausalLM.from_pretrained(BASE)
peft_model = PeftModel.from_pretrained(base, REPO)

class FundingProbe(nn.Module):
    def __init__(self, peft_model, n_hidden_layers=N_HIDDEN_LAYERS):
        super().__init__()
        self.model = peft_model
        self.n_hidden_layers = n_hidden_layers
        h = self.model.config.hidden_size
        input_dim = h * n_hidden_layers
        self.classifier = nn.Sequential(
            nn.Conv1d(input_dim, 512, kernel_size=5, padding=2), nn.GELU(),
            nn.Conv1d(512, 128, kernel_size=3, padding=1), nn.GELU(),
            nn.Conv1d(128, 1, kernel_size=1),
        )

    def forward(self, input_ids, attention_mask):
        outputs = self.model(
            input_ids=input_ids, attention_mask=attention_mask,
            output_hidden_states=True,
        )
        layers = outputs.hidden_states[-self.n_hidden_layers:]
        hidden = torch.cat(layers, dim=-1).float()
        # mask padding BEFORE conv (otherwise it learns padding patterns)
        hidden = hidden * attention_mask.unsqueeze(-1).float()
        logits = self.classifier(hidden.transpose(1, 2)).squeeze(1)  # (B, T)
        return logits

model = FundingProbe(peft_model)
classifier_path = hf_hub_download(repo_id=REPO, filename="classifier.pt")
model.classifier.load_state_dict(
    torch.load(classifier_path, map_location="cpu", weights_only=True)
)
model.eval()

Predicting on a document

Funding statements typically appear late in academic articles, so truncate from the front rather than the back:

MAX_LENGTH = 4096

def predict_spans(article_text, threshold=THRESHOLD):
    enc = tokenizer(
        article_text, return_offsets_mapping=True,
        add_special_tokens=False, truncation=False,
    )
    input_ids = enc["input_ids"]
    offsets = enc["offset_mapping"]
    if len(input_ids) > MAX_LENGTH:
        input_ids = input_ids[-MAX_LENGTH:]
        offsets = offsets[-MAX_LENGTH:]

    ids = torch.tensor([input_ids])
    mask = torch.ones_like(ids)
    with torch.no_grad():
        logits = model(ids, mask)[0].cpu().numpy()
    preds = logits > threshold

    spans, in_span, start_char = [], False, 0
    for i, (p, (s, e)) in enumerate(zip(preds, offsets)):
        if p and not in_span:
            in_span, start_char = True, s
        elif not p and in_span:
            spans.append(article_text[start_char:offsets[i-1][1]])
            in_span = False
    if in_span:
        spans.append(article_text[start_char:offsets[-1][1]])
    return spans

Long documents — sliding window

A 4096-token window covers only ~33% of the average article. For full coverage, run a sliding window with 50% overlap and union the predicted spans (the full implementation lives at extraction_probe/predict_sliding.py in the source repo).

Training recipe

Base: meta-llama/Llama-3.2-1B (frozen)
LoRA: r=32, α=32, dropout=0.05, target=[q_proj, v_proj]
Head: 3-layer Conv1d over concatenated last-4 hidden-layer states (~42M trainable head params)
Loss: asymmetric focal loss (ASL; Ridnik et al.) with gamma_neg=4, gamma_pos=0, pos_weight=50, plus a soft-IoU term (weight 1.0). ASL aggressively down-weights easy negatives; pos_weight balances the ~1% positive rate.
Truncation: no truncation at tokenization time; take the last 4096 tokens, so the funding section is preserved (it sits around position 0.78 of the token stream on average).
Padding: hidden states are zeroed at padding positions before the Conv1d so the head doesn't learn padding-specific features.
Single-GPU training: DataParallel caused subtle train/inference discrepancies on logit magnitudes, so training and inference are both single-GPU.
Optimizer: AdamW with differential learning rates — classifier head at LR=1e-3, LoRA at LR=1e-4 (0.1× multiplier), 20-step warmup.
Schedule: 5 epochs, batch 8, grad-accum 16, bf16 mixed precision.
Target mask: for each training row, the gold funding statement is located in the source markdown (exact substring) and converted to a per-token 0/1 span mask. This yields noisy labels when the markdown contains OCR artifacts, but works well in practice.

Test-set results (held-out `data/test.jsonl`, threshold sweep)

threshold	precision	recall	F1
−6.0	0.660	1.000	0.795
−5.0	0.661	0.998	0.795
−4.0	0.666	0.990	0.796
−3.0	0.666	0.964	0.788
−2.0	0.667	0.905	0.768
−1.0	0.671	0.838	0.745

Best F1 = 0.796 at threshold = −4.0 (recall 0.99, precision 0.67), evaluated on the full 3084 test spans from 2034 positive articles.

Citation / provenance

Training code: see the extraction_probe/ directory of the funding-statement-identification repo.

Trained by the comet-data / funding extraction effort, 2026-04.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cometadata/funding-parsing-token-probe-Llama-3.2-1B-lora

Base model

meta-llama/Llama-3.2-1B-Instruct

Finetuned

(1747)

this model