funding-parsing-token-probe-Llama-3.2-1B-lora

Token-level funding-statement classifier built on top of meta-llama/Llama-3.2-1B. Given an article, the model emits a per-token probability that the token is inside a funding-acknowledgment span.

Repository contents

adapter_config.json         # PEFT LoRA config (base: Llama-3.2-1B)
adapter_model.safetensors   # LoRA weights (q_proj, v_proj)
classifier.pt               # Conv1d head on top of last-4-layer hidden states
README.md

Architecture

Llama-3.2-1B (frozen base)
    + LoRA(r=32, Ξ±=32, target=[q_proj, v_proj], dropout=0.05)
    └── concat(hidden_states[-4:], dim=-1)  β†’  shape (B, T, 8192)
Conv1d head (classifier.pt, flat nn.Sequential keys 0/2/4):
    Conv1d(8192,  512, k=5, padding=2) + GELU
    Conv1d( 512,  128, k=3, padding=1) + GELU
    Conv1d( 128,    1, k=1)
  β†’ per-token logits (B, T)

The head is ~42M params β€” substantially more than the LoRA adapter. This is deliberate: the probe learns rich span-boundary features from frozen last-4-layer hidden states.

Usage

import torch
import torch.nn as nn
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
from huggingface_hub import hf_hub_download

BASE = "meta-llama/Llama-3.2-1B"
REPO = "cometadata/funding-parsing-token-probe-Llama-3.2-1B-lora"
N_HIDDEN_LAYERS = 4
THRESHOLD = -4.0  # tuned on validation; raise for more precision, lower for more recall

tokenizer = AutoTokenizer.from_pretrained(BASE)
base = AutoModelForCausalLM.from_pretrained(BASE)
peft_model = PeftModel.from_pretrained(base, REPO)

class FundingProbe(nn.Module):
    def __init__(self, peft_model, n_hidden_layers=N_HIDDEN_LAYERS):
        super().__init__()
        self.model = peft_model
        self.n_hidden_layers = n_hidden_layers
        h = self.model.config.hidden_size
        input_dim = h * n_hidden_layers
        self.classifier = nn.Sequential(
            nn.Conv1d(input_dim, 512, kernel_size=5, padding=2), nn.GELU(),
            nn.Conv1d(512, 128, kernel_size=3, padding=1), nn.GELU(),
            nn.Conv1d(128, 1, kernel_size=1),
        )

    def forward(self, input_ids, attention_mask):
        outputs = self.model(
            input_ids=input_ids, attention_mask=attention_mask,
            output_hidden_states=True,
        )
        layers = outputs.hidden_states[-self.n_hidden_layers:]
        hidden = torch.cat(layers, dim=-1).float()
        # mask padding BEFORE conv (otherwise it learns padding patterns)
        hidden = hidden * attention_mask.unsqueeze(-1).float()
        logits = self.classifier(hidden.transpose(1, 2)).squeeze(1)  # (B, T)
        return logits

model = FundingProbe(peft_model)
classifier_path = hf_hub_download(repo_id=REPO, filename="classifier.pt")
model.classifier.load_state_dict(
    torch.load(classifier_path, map_location="cpu", weights_only=True)
)
model.eval()

Predicting on a document

Funding statements typically appear late in academic articles, so truncate from the front rather than the back:

MAX_LENGTH = 4096

def predict_spans(article_text, threshold=THRESHOLD):
    enc = tokenizer(
        article_text, return_offsets_mapping=True,
        add_special_tokens=False, truncation=False,
    )
    input_ids = enc["input_ids"]
    offsets = enc["offset_mapping"]
    if len(input_ids) > MAX_LENGTH:
        input_ids = input_ids[-MAX_LENGTH:]
        offsets = offsets[-MAX_LENGTH:]

    ids = torch.tensor([input_ids])
    mask = torch.ones_like(ids)
    with torch.no_grad():
        logits = model(ids, mask)[0].cpu().numpy()
    preds = logits > threshold

    spans, in_span, start_char = [], False, 0
    for i, (p, (s, e)) in enumerate(zip(preds, offsets)):
        if p and not in_span:
            in_span, start_char = True, s
        elif not p and in_span:
            spans.append(article_text[start_char:offsets[i-1][1]])
            in_span = False
    if in_span:
        spans.append(article_text[start_char:offsets[-1][1]])
    return spans

Long documents β€” sliding window

A 4096-token window covers only ~33% of the average article. For full coverage, run a sliding window with 50% overlap and union the predicted spans (the full implementation lives at extraction_probe/predict_sliding.py in the source repo).

Training recipe

  • Base: meta-llama/Llama-3.2-1B (frozen)
  • LoRA: r=32, Ξ±=32, dropout=0.05, target=[q_proj, v_proj]
  • Head: 3-layer Conv1d over concatenated last-4 hidden-layer states (~42M trainable head params)
  • Loss: asymmetric focal loss (ASL; Ridnik et al.) with gamma_neg=4, gamma_pos=0, pos_weight=50, plus a soft-IoU term (weight 1.0). ASL aggressively down-weights easy negatives; pos_weight balances the ~1% positive rate.
  • Truncation: no truncation at tokenization time; take the last 4096 tokens, so the funding section is preserved (it sits around position 0.78 of the token stream on average).
  • Padding: hidden states are zeroed at padding positions before the Conv1d so the head doesn't learn padding-specific features.
  • Single-GPU training: DataParallel caused subtle train/inference discrepancies on logit magnitudes, so training and inference are both single-GPU.
  • Optimizer: AdamW with differential learning rates β€” classifier head at LR=1e-3, LoRA at LR=1e-4 (0.1Γ— multiplier), 20-step warmup.
  • Schedule: 5 epochs, batch 8, grad-accum 16, bf16 mixed precision.
  • Target mask: for each training row, the gold funding statement is located in the source markdown (exact substring) and converted to a per-token 0/1 span mask. This yields noisy labels when the markdown contains OCR artifacts, but works well in practice.

Test-set results (held-out data/test.jsonl, threshold sweep)

threshold precision recall F1
βˆ’6.0 0.660 1.000 0.795
βˆ’5.0 0.661 0.998 0.795
βˆ’4.0 0.666 0.990 0.796
βˆ’3.0 0.666 0.964 0.788
βˆ’2.0 0.667 0.905 0.768
βˆ’1.0 0.671 0.838 0.745

Best F1 = 0.796 at threshold = βˆ’4.0 (recall 0.99, precision 0.67), evaluated on the full 3084 test spans from 2034 positive articles.

Citation / provenance

Training code: see the extraction_probe/ directory of the funding-statement-identification repo.

Trained by the comet-data / funding extraction effort, 2026-04.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for cometadata/funding-parsing-token-probe-Llama-3.2-1B-lora

Finetuned
(1747)
this model