Research in progress — capstone, targeting July 2026

AI Safety Research Engineer

Li Bearden

Building measurement infrastructure for AI safety. LLM evaluation methodology, sycophancy detection.

Five years deploying speech and language models at Deepgram gave me a front-row seat to the gap between benchmark performance and deployment behavior — the gap my current research targets directly.

Read Research Get in Touch

Evaluations that surface real failure modes

Most evaluation literature optimizes for accuracy on a benchmark. That tells you almost nothing about how a system fails in deployment. My research focuses on what evaluations miss—sycophancy, distributional drift, context- dependent failure—and on what kind of measurement infrastructure would surface those failure modes early enough to matter.

The gap between measurement and meaning

An evaluation has to defend a claim about what a system actually does. Most current eval claims can't. The methodological question I keep returning to: which datasets, which metrics, which contexts give us evidence we can actually trust—and where are we mistaking benchmark performance for behavioral guarantee?

Demographic and contextual validity

Evaluations that work on the median user often fail systematically for users outside the training distribution. This isn't a fairness add-on—it's a measurement validity problem. If an eval's signal is tight on some users and useless on others, the eval is broken regardless of the headline score. Capstone research in progress.

Building for what's hard to measure

Some failure modes resist easy measurement: sycophancy, sandbagging, capability concealment, deployment-context drift. The field has built strong measurement infrastructure for benchmarks; it has built much less for behaviors that emerge under deployment. My research focus sits with the behaviors where current eval frameworks give us the most false confidence.

The through-line: measurement is where the hard work is. If you're working on sycophancy, eval validity, or the measurement infrastructure for alignment claims, I'd like to hear about it — especially if you're thinking about how eval results generalize across demographic and deployment contexts.

LLM Evaluation & Measurement Infrastructure

2024–2026 in progress

Replicating Published Sycophancy Benchmarks

Abstract forthcoming. A replication audit of published LLM sycophancy benchmarks. Most sycophancy results report uncertainty from a single generation run, which can make an effect look more stable than it is. This work re-runs published benchmarks across many independent generations (K-run replication) to separate findings that survive honest across-run uncertainty from artifacts of single-run measurement.

Methodology: K-run (across-run) replication of published benchmarks. PARROT serves as a positive control: its follow-rate findings reproduce cleanly, with model and per-domain rankings stable across runs (Spearman 0.95–0.99). SycEval is the primary target: its published confidence intervals are within-run binomial only, computed from a single generation per item, which likely understates true uncertainty.

Stack: Python · HuggingFace · vLLM · multi-provider inference (Together, DeepInfra, OpenAI, Anthropic) · W&B · Docker

Target venues: arXiv preprint · SafeAI@AAAI · SoLaR · NeurIPS workshop

Preprint expected late August 2026.

↑ Replicating Published Sycophancy Benchmarks

Jun 2026 Published

Run-Level Reproducibility Audit of PARROT

An independent audit testing whether single-run confidence intervals understate uncertainty in a published LLM sycophancy benchmark — run at K=5 across 5 models, 1,302 items. Shipped as a one-command offline deterministic replay harness with research-integrity gates enforced in code: no hardcoded expected values, failures surface rather than suppress, fixture-precision limits documented rather than fudged.

Python stdlib · JSON Schema · Git · arXiv:2511.17220

→ github

2025 shipped

Bench

Research operations infrastructure for LLM safety work. Tracks focus states, initiative progress, and open research threads. Integrates with Claude via MCP for AI-assisted analysis of evaluation results.

Python · MCP · Claude API

GitHub →

Adjacent research interests

Demographic and contextual validity of alignment evaluations
Institutional dynamics of frontier AI safety review
Measurement infrastructure for behaviors that resist easy measurement (sandbagging, capability concealment, deployment-context drift)

Writing on evaluation methodology forthcoming — Substack for updates.

Background

I'm Elliot “Li” Bearden, an AI safety research engineer focused on LLM evaluation methodology. My capstone research investigates sycophancy detection and demographic variation in eval validity — specifically, how current evaluation infrastructure misses the failure modes that matter most in production deployment. MSCS completion July 2026.

Five years deploying speech and language models at Deepgram gave me a front-row seat to the gap between benchmark performance and deployment behavior — the gap my research now targets directly. I built and owned the evaluation and custom training infrastructure, including the pipeline that cut custom model delivery from 14 days to 1.

I take on selected applied evaluation engagements — details here.

Areas of Focus

Research

LLM Evaluation Sycophancy Eval Validity Measurement Infrastructure Demographic Generalization

Technical

Python PyTorch HuggingFace vLLM Ray Kubernetes

Based in Chiang Mai, Thailand · US citizen · Remote-first, outside the US — open to short in-person stints (up to ~3 months) outside the US.

View CV Consulting →

Let’s talk

Open to conversations about research collaboration, LLM safety hiring, funding and fellowship opportunities, and consulting engagements.

linkedin.com/in/elliot-bearden
Based in Chiang Mai, Thailand · US citizen · Remote-first, outside the US — open to short in-person stints (up to ~3 months) outside the US.

Send a Message →