funding-parsing-token-probe-Llama-3.2-1B-lora
Token-level funding-statement classifier built on top of
meta-llama/Llama-3.2-1B.
Given an article, the model emits a per-token probability that the token is
inside a funding-acknowledgment span.
Repository contents
adapter_config.json # PEFT LoRA config (base: Llama-3.2-1B)
adapter_model.safetensors # LoRA weights (q_proj, v_proj)
classifier.pt # Conv1d head on top of last-4-layer hidden states
README.md
Architecture
Llama-3.2-1B (frozen base)
+ LoRA(r=32, Ξ±=32, target=[q_proj, v_proj], dropout=0.05)
βββ concat(hidden_states[-4:], dim=-1) β shape (B, T, 8192)
Conv1d head (classifier.pt, flat nn.Sequential keys 0/2/4):
Conv1d(8192, 512, k=5, padding=2) + GELU
Conv1d( 512, 128, k=3, padding=1) + GELU
Conv1d( 128, 1, k=1)
β per-token logits (B, T)
The head is ~42M params β substantially more than the LoRA adapter. This is deliberate: the probe learns rich span-boundary features from frozen last-4-layer hidden states.
Usage
import torch
import torch.nn as nn
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
from huggingface_hub import hf_hub_download
BASE = "meta-llama/Llama-3.2-1B"
REPO = "cometadata/funding-parsing-token-probe-Llama-3.2-1B-lora"
N_HIDDEN_LAYERS = 4
THRESHOLD = -4.0 # tuned on validation; raise for more precision, lower for more recall
tokenizer = AutoTokenizer.from_pretrained(BASE)
base = AutoModelForCausalLM.from_pretrained(BASE)
peft_model = PeftModel.from_pretrained(base, REPO)
class FundingProbe(nn.Module):
def __init__(self, peft_model, n_hidden_layers=N_HIDDEN_LAYERS):
super().__init__()
self.model = peft_model
self.n_hidden_layers = n_hidden_layers
h = self.model.config.hidden_size
input_dim = h * n_hidden_layers
self.classifier = nn.Sequential(
nn.Conv1d(input_dim, 512, kernel_size=5, padding=2), nn.GELU(),
nn.Conv1d(512, 128, kernel_size=3, padding=1), nn.GELU(),
nn.Conv1d(128, 1, kernel_size=1),
)
def forward(self, input_ids, attention_mask):
outputs = self.model(
input_ids=input_ids, attention_mask=attention_mask,
output_hidden_states=True,
)
layers = outputs.hidden_states[-self.n_hidden_layers:]
hidden = torch.cat(layers, dim=-1).float()
# mask padding BEFORE conv (otherwise it learns padding patterns)
hidden = hidden * attention_mask.unsqueeze(-1).float()
logits = self.classifier(hidden.transpose(1, 2)).squeeze(1) # (B, T)
return logits
model = FundingProbe(peft_model)
classifier_path = hf_hub_download(repo_id=REPO, filename="classifier.pt")
model.classifier.load_state_dict(
torch.load(classifier_path, map_location="cpu", weights_only=True)
)
model.eval()
Predicting on a document
Funding statements typically appear late in academic articles, so truncate from the front rather than the back:
MAX_LENGTH = 4096
def predict_spans(article_text, threshold=THRESHOLD):
enc = tokenizer(
article_text, return_offsets_mapping=True,
add_special_tokens=False, truncation=False,
)
input_ids = enc["input_ids"]
offsets = enc["offset_mapping"]
if len(input_ids) > MAX_LENGTH:
input_ids = input_ids[-MAX_LENGTH:]
offsets = offsets[-MAX_LENGTH:]
ids = torch.tensor([input_ids])
mask = torch.ones_like(ids)
with torch.no_grad():
logits = model(ids, mask)[0].cpu().numpy()
preds = logits > threshold
spans, in_span, start_char = [], False, 0
for i, (p, (s, e)) in enumerate(zip(preds, offsets)):
if p and not in_span:
in_span, start_char = True, s
elif not p and in_span:
spans.append(article_text[start_char:offsets[i-1][1]])
in_span = False
if in_span:
spans.append(article_text[start_char:offsets[-1][1]])
return spans
Long documents β sliding window
A 4096-token window covers only ~33% of the average article. For full
coverage, run a sliding window with 50% overlap and union the predicted
spans (the full implementation lives at extraction_probe/predict_sliding.py
in the source repo).
Training recipe
- Base:
meta-llama/Llama-3.2-1B(frozen) - LoRA: r=32, Ξ±=32, dropout=0.05, target=[
q_proj,v_proj] - Head: 3-layer Conv1d over concatenated last-4 hidden-layer states (~42M trainable head params)
- Loss: asymmetric focal loss (ASL; Ridnik et al.) with
gamma_neg=4,gamma_pos=0,pos_weight=50, plus a soft-IoU term (weight 1.0). ASL aggressively down-weights easy negatives; pos_weight balances the ~1% positive rate. - Truncation: no truncation at tokenization time; take the last 4096 tokens, so the funding section is preserved (it sits around position 0.78 of the token stream on average).
- Padding: hidden states are zeroed at padding positions before the Conv1d so the head doesn't learn padding-specific features.
- Single-GPU training:
DataParallelcaused subtle train/inference discrepancies on logit magnitudes, so training and inference are both single-GPU. - Optimizer: AdamW with differential learning rates β classifier head at LR=1e-3, LoRA at LR=1e-4 (0.1Γ multiplier), 20-step warmup.
- Schedule: 5 epochs, batch 8, grad-accum 16, bf16 mixed precision.
- Target mask: for each training row, the gold funding statement is located in the source markdown (exact substring) and converted to a per-token 0/1 span mask. This yields noisy labels when the markdown contains OCR artifacts, but works well in practice.
Test-set results (held-out data/test.jsonl, threshold sweep)
| threshold | precision | recall | F1 |
|---|---|---|---|
| β6.0 | 0.660 | 1.000 | 0.795 |
| β5.0 | 0.661 | 0.998 | 0.795 |
| β4.0 | 0.666 | 0.990 | 0.796 |
| β3.0 | 0.666 | 0.964 | 0.788 |
| β2.0 | 0.667 | 0.905 | 0.768 |
| β1.0 | 0.671 | 0.838 | 0.745 |
Best F1 = 0.796 at threshold = β4.0 (recall 0.99, precision 0.67), evaluated on the full 3084 test spans from 2034 positive articles.
Citation / provenance
Training code: see the extraction_probe/ directory of the
funding-statement-identification repo.
Trained by the comet-data / funding extraction effort, 2026-04.
Model tree for cometadata/funding-parsing-token-probe-Llama-3.2-1B-lora
Base model
meta-llama/Llama-3.2-1B-Instruct