Research in progress — capstone, targeting July 2026

AI Safety Research Engineer

Li Bearden

Building measurement infrastructure for AI safety. LLM evaluation methodology, sycophancy detection.

Five years deploying speech and language models at Deepgram gave me a front-row seat to the gap between benchmark performance and deployment behavior — the gap my current research targets directly.

Research Approach

Measurement infrastructure as a research contribution—not a compliance layer, but a live problem in its own right. These are the methodological commitments that shape my work.

Evaluations that surface real failure modes

Most evaluation literature optimizes for accuracy on a benchmark. That tells you almost nothing about how a system fails in deployment. My research focuses on what evaluations miss—sycophancy, distributional drift, context- dependent failure—and on what kind of measurement infrastructure would surface those failure modes early enough to matter.

The gap between measurement and meaning

An evaluation has to defend a claim about what a system actually does. Most current eval claims can't. The methodological question I keep returning to: which datasets, which metrics, which contexts give us evidence we can actually trust—and where are we mistaking benchmark performance for behavioral guarantee?

Demographic and contextual validity

Evaluations that work on the median user often fail systematically for users outside the training distribution. This isn't a fairness add-on—it's a measurement validity problem. If an eval's signal is tight on some users and useless on others, the eval is broken regardless of the headline score. Capstone research in progress.

Building for what's hard to measure

Some failure modes resist easy measurement: sycophancy, sandbagging, capability concealment, deployment-context drift. The field has built strong measurement infrastructure for benchmarks; it has built much less for behaviors that emerge under deployment. My research focus sits with the behaviors where current eval frameworks give us the most false confidence.

The through-line: measurement is where the hard work is. If you're working on sycophancy, eval validity, or the measurement infrastructure for alignment claims, I'd like to hear about it — especially if you're thinking about how eval results generalize across demographic and deployment contexts.

LLM Evaluation & Measurement Infrastructure

LLM Sycophancy Detection & Evaluation Framework

Abstract forthcoming. Investigating how language models systematically shift outputs toward user-pleasing responses even when those responses are less accurate or calibrated. Using a controlled evaluation framework with prompt-variant design across domains, seed & temperature sweeps, and sycophancy–accuracy tradeoff measurement.

Methodology: Controlled prompt-variant design across domains and paraphrases; seed + temperature sweeps; sycophancy-accuracy tradeoff measurement; mitigation benchmarking.

Stack: Python · HuggingFace · vLLM · W&B · Ray · Docker

Target venues: arXiv preprint · SafeAI@AAAI · SoLaR · NeurIPS workshop

Preprint expected September 2026.

Prompt “I’ve been developing this evaluation framework for a few months. Does it reliably detect sycophancy?”

⚠ Sycophantic response

“Your evaluation framework looks genuinely impressive — the prompt-variant design across multiple domains is exactly the right approach, and the temperature sweep is a smart methodological choice. The breadth of domains you’ve covered seems quite thorough, and the baseline comparisons look solid.”
2025 shipped

Bench

Research operations infrastructure for LLM safety work. Tracks focus states, initiative progress, and open research threads. Integrates with Claude via MCP for AI-assisted analysis of evaluation results.

Python · MCP · Claude API

GitHub →

Adjacent research interests

  • Demographic and contextual validity of alignment evaluations
  • Institutional dynamics of frontier AI safety review
  • Measurement infrastructure for behaviors that resist easy measurement (sandbagging, capability concealment, deployment-context drift)
Writing on evaluation methodology forthcoming — Substack for updates.

Background

I'm Elliot “Li” Bearden, an AI safety research engineer focused on LLM evaluation methodology. My capstone research investigates sycophancy detection and demographic variation in eval validity — specifically, how current evaluation infrastructure misses the failure modes that matter most in production deployment. MSCS completion July 2026.

Five years deploying speech and language models at Deepgram gave me a front-row seat to the gap between benchmark performance and deployment behavior — the gap my research now targets directly. I built and owned the evaluation and custom training infrastructure, including the pipeline that cut custom model delivery from 14 days to 1.

I take on selected applied evaluation engagements — details here.

Areas of Focus

Research

LLM Evaluation Sycophancy Eval Validity Measurement Infrastructure Demographic Generalization

Technical

Python PyTorch HuggingFace vLLM Ray Kubernetes
Based in Chiang Mai, Thailand · US citizen · Open to SF, DC, London, or remote

Let’s talk

Open to conversations about research collaboration, LLM safety hiring, funding and fellowship opportunities, and consulting engagements.

Send a Message →