Vijay Vankadaru / Portfolio

Hallucination is Structural, Not Accidental

2026-03-05T00:00:00+00:00

The standard framing of LLM hallucination is as a bug. Train better. Prompt better. Retrieve better. The bug will eventually go away.

Working on our survey on hallucination in medical LLMs changed how I think about this. The more carefully you look at the literature, the more it becomes clear that hallucination isn’t an engineering failure waiting to be fixed — it’s a structural property of how autoregressive language models work.

Here’s the core issue: these models are trained to predict the most likely next token given context. Not to ensure factual correctness. Not to model uncertainty. Just to predict likely continuations. A model that has learned the statistical structure of medical text will generate fluent, coherent, medically-plausible text. Whether that text is true is a separate question — one the training objective doesn’t directly address.

Theoretical work has started to formalize this. You can show that under standard assumptions, no language model can guarantee zero hallucination. This isn’t a statement about capability limits or insufficient data. It’s a statement about what the objective function is actually optimizing.

In healthcare, this matters in a specific way. The dangerous hallucinations in medicine are rarely obvious. They’re not “the patient has four kidneys.” They’re: subtly outdated drug dosages, plausible but nonexistent clinical trial citations, disease progression sequences that are almost right but causally reversed. These pass surface-level plausibility checks. They look like good answers. They’re the ones that hurt patients.

The implication is that the question “how do we eliminate hallucination” is probably the wrong question. The right question is “how do we build systems that remain safe in the presence of unavoidable hallucination.” That’s a systems engineering problem, not a model training problem. It requires layered detection, uncertainty quantification, human-in-the-loop escalation paths, and honest acknowledgment that the model will sometimes be wrong in ways you can’t fully anticipate.

The field is slowly moving in this direction. Detection methods, RAG pipelines, runtime verification systems like CHECK — these are all partial answers to the right question. What’s still missing is end-to-end evaluation that reflects actual clinical risk rather than benchmark accuracy, and deployment frameworks that treat hallucination as a lifecycle problem rather than a pre-deployment checkbox.

That’s what we tried to map out in the survey. More work coming.

Why Multi-Instance Learning is Actually Beautiful

2026-02-10T00:00:00+00:00

There’s a class of problems in ML that most supervised learning frameworks can’t handle cleanly: you know the label for a group of examples, but not for any individual one.

This is Multi-Instance Learning (MIL), and it shows up everywhere once you start looking. In clinical settings, a patient is either depressed or not — but you don’t have labels for each individual moment in their conversation. In pathology, a slide is cancerous or not — but individual patches might be ambiguous. In document classification, a document belongs to a topic — but not every sentence does.

The standard approach would be to just pool the instances (average them, max them) and train on the aggregate. It works. It’s also throwing away a lot of information.

What makes MIL genuinely interesting is what happens when you let the model learn which instances matter. This is where attention comes in — not as a black box trick, but as a principled mechanism for the model to express: “bag-level label is positive, and here’s which instances I’m basing that on.”

In our depression detection work, this means the model can say: “this interview is classified as depressed, and here are the specific response turns that drove that prediction.” Those turns tend to correspond to moments of flattened affect, extended pause patterns, and reduced prosodic variation — which is exactly what clinical literature would predict. The interpretability isn’t bolted on. It falls out of the architecture.

The aggregation rules matter too. A soft attention aggregation (weighted mean) treats every instance as contributing. A noisy-or rule says: if any instance is positive, the bag is positive. These encode different assumptions about how the label distributes across instances, and getting that choice right is part of the research.

What I find elegant about MIL is that it’s a natural fit for a huge class of real-world problems where strong supervision is expensive but weak supervision is free. You don’t need to annotate every frame of a video, every patch of a slide, every utterance in a conversation. You just need the outcome label — and then you let the model figure out the structure.

That’s not a limitation. That’s the design.

CTC Alignment and Why Temporal Correspondence Matters in Multimodal Learning

2026-01-20T00:00:00+00:00

When you fuse audio and text representations, the obvious approach is to encode both independently and then concatenate or cross-attend. It works. But it misses something important: the correspondence between what was said and how it was said, at the same moment in time.

In speech, a word doesn’t just have semantic content. It has prosody — the pitch, energy, rhythm, and timing with which it was spoken. “I’m fine” said with flat intonation and long pauses carries very different information than “I’m fine” said quickly and with natural affect. If you encode audio and text separately and then merge them, you lose the ability to model that correspondence explicitly. The model might learn it implicitly, but you’re not giving it the right structure to reason about it.

This is where CTC — Connectionist Temporal Classification — becomes useful in a non-obvious way.

CTC was originally developed for sequence-to-sequence problems where the alignment between input and output is unknown (speech recognition being the canonical case: you know the audio frames and the transcript, but not which frame corresponds to which character). It lets you train without frame-level labels by marginalizing over all valid alignments.

In our depression detection work, we use it differently: to create explicit temporal correspondence between Wav2Vec 2.0 audio features and token-level text representations. The idea is to use the alignment CTC learns to map audio frame sequences back to word-level timestamps, and then align those with the text embeddings from MT5/RoBERTa at the token level. The result is a representation where each word has both its semantic embedding and the acoustic properties of how it was spoken at that moment.

This matters for depression detection because the clinical signal is partly carried by the mismatch between semantic and acoustic content. Someone who says neutral words in flat, monotone speech, with extended pauses, is showing a different pattern than the words alone would suggest. You need both modalities, and you need them aligned.

The broader point is that multimodal fusion isn’t just about combining sources of information — it’s about which level of abstraction you fuse at, and whether the fusion preserves the structure that’s actually informative. Cross-modal attention without temporal grounding is better than nothing. Temporally aligned fusion is better still.

Getting the alignment right is tedious and architecturally nontrivial. It’s also, in my experience, the part that actually moves the metrics.

What It Felt Like to Finish a Paper

2026-01-05T00:00:00+00:00

The hallucination survey went live on MetaArXiv in March. I want to write about what the process was actually like before the memory fades.

The hardest part wasn’t the writing. It was the argument. A survey paper isn’t just a literature review — it’s a claim about what the literature means. Ours was: hallucination in medical LLMs isn’t a bug, it’s a structural property of the generation process, and the field has been asking the wrong question by trying to eliminate it rather than designing systems that manage it. That argument had to be earned, not just asserted. Every section needed to build toward it.

I rewrote my sections more times than I expected. Not because the facts were wrong, but because the framing kept being slightly off — too focused on what individual papers found rather than what they collectively revealed. Prof. Roosta was patient about this in a way that I found both reassuring and instructive. The feedback was always about the argument, not the execution.

The other thing I didn’t anticipate was how much the writing clarified my own thinking. I thought I understood the hallucination literature reasonably well before we started. I understood it much better after I had tried to explain it to a reader who knew nothing. The act of writing forced precision in a way that reading didn’t. There’s something about having to construct a coherent narrative out of fifty-plus papers that reveals gaps in your understanding you didn’t know were there.

What I’m left with is a paper I’m proud of and a much clearer sense of what I want to work on next. The follow-on work on mitigation methods is already underway. The survey mapped the problem. Now we get to work on part of the solution.

One paper down.

Industry Experience Is a Research Asset, Not a Gap to Apologize For

2025-11-18T00:00:00+00:00

There’s a version of the story I told about myself for a while that went like this: I spent years doing applied work in industry before getting serious about research. That framing treated the industry work as a detour — something to acknowledge and move past.

I don’t think that’s right anymore.

The thing about building ML systems in a clinical context is that you encounter failure modes that don’t show up in benchmarks. A model that achieves strong held-out accuracy can still be wrong in ways that matter — systematically worse on patient subgroups, brittle to distribution shift between clinical sites, confidently incorrect in the edge cases that clinicians actually care about. You only know this if you’ve watched a system fail in a real setting and had to explain it to someone who trusted it.

That experience is directly relevant to research. It tells you which problems are actually hard, as opposed to which problems look hard from a benchmark perspective. It gives you a calibrated sense of what “good enough” means in practice — and therefore what improvement is genuinely worth pursuing versus what’s incremental.

When I started working on the hallucination survey, I kept returning to this. The literature treats hallucination primarily as a model problem — something to be fixed through training or retrieval. But from a deployment perspective, the question is never “does this model hallucinate” (it does) — the question is “what happens when it does, and does the system around it catch it.” That’s a different problem, and it requires a different kind of solution.

I don’t think I would have arrived at that framing from first principles. I arrived at it because I’ve been on the other side of the deployment, watching what breaks.

The research question I’m most interested in now — when does ML actually work, and what breaks it — is directly shaped by that experience. The years I spent building systems weren’t a detour. They were where I learned what to study.

Learning to Actually Read Papers

2025-11-10T00:00:00+00:00

Nobody teaches you how to read a paper. You’re expected to figure it out, and most people do eventually, but the path is inefficient and kind of humbling.

My first instinct was to read papers the way I read documentation — linearly, from abstract to conclusion, treating each section as equally important. This is wrong. I was spending an hour on papers that should take fifteen minutes, and coming out with a blurry sense of the contribution rather than a clear model of what the authors actually did and why it mattered.

What changed things was a piece of advice I got early in the semester: read the abstract, then the conclusion, then the figures, then decide if you need the methods. Most of the time you don’t, at least not on the first pass. The figures are the argument. The methods are the evidence. Read them in that order.

The second thing that changed things was keeping a running log of how papers position themselves relative to each other. Every paper makes a claim about the state of the field and then argues that claim is wrong or incomplete. Once you notice that structure, you start building a map of where the disputes actually are — which assumptions are contested, which baselines are considered fair, which results are genuinely new versus incremental.

I didn’t have this map when I started. I had deep familiarity with a few specific techniques and a vague sense of the broader field from following arXiv. That’s not the same as understanding where the open problems are and why they’re hard. Building that map is what the first semester has mostly been about.

The honest assessment at the end of three months: I’m faster and more selective, but I’m still not reading papers the way I’ll need to for original research. I can navigate a literature, but I can’t yet see the gaps in it. That’s the next thing to learn.

What Research Meetings Actually Feel Like

2025-09-05T00:00:00+00:00

I had my first real research meeting with Prof. Paulik in late August. Not a class, not office hours — a working meeting about a project I was contributing to. I want to write down what it felt like before I forget.

The thing that surprised me most was how much of the meeting was questions rather than answers. I came in expecting to be given direction — a task, a dataset, a baseline to run. Instead, the first twenty minutes was her asking me what I thought the problem actually was, what the existing work got wrong, where I saw the gap.

I wasn’t ready for that. I had read the relevant papers. I knew the MedRAG architecture. But I hadn’t formed an opinion about what it got wrong or where the opportunity was. I had been preparing to execute, not to think.

That’s the gap I keep running into. Industry work trains you to be a good executor — take a well-defined problem and solve it efficiently. Research requires you to first argue that the problem is real and that your framing of it is the right one. That argument comes before any code gets written.

The other thing I noticed: Prof. Paulik asked questions she didn’t know the answer to. That sounds obvious but it was disorienting at first. In industry, most questions in meetings are checks — does this person know what they’re doing. In a research meeting, the questions are genuine. We were both trying to figure something out. That dynamic changes everything about how you show up.

I left the meeting with more questions than I arrived with. That felt like failure at first. Now I think it means it was a good meeting.

The Gap Between Building ML Systems and Doing ML Research

2025-08-20T00:00:00+00:00

I started at DASION in 2021 as a high school intern. By the time I enrolled at Berkeley this month, I had spent three years building ML systems that actually ran in clinical settings — models that processed real patient data, infrastructure that stayed up at 99.9%, pipelines that clinicians depended on. I thought that experience would translate directly to research.

It doesn’t. Or at least, not in the way I expected.

The difference is subtle but it matters enormously. In industry, the question is: does this work? Can we ship it? Is it reliable enough? The success condition is the system running without breaking. In research, the question is: why does this work? What does it tell us about the problem? Where does it fail, and what does that reveal?

I spent three years optimizing for the first set of questions. Getting to 93% diagnostic accuracy was a win. Understanding why the model failed on the other 7% — what those cases had in common, what the failure mode revealed about the representation — that wasn’t the priority. We had clinical partners waiting.

The shift I’m trying to make at Berkeley is genuinely hard. It requires slowing down in a way that feels unproductive. Sitting with a failure instead of patching it. Asking “what does this mean” instead of “how do we fix this.” It’s a different muscle and I’m aware I haven’t built it yet.

What I’m hoping is that the industry experience doesn’t go away — it just gets reframed. I’ve seen what breaks in deployment. I know which failure modes matter and which ones are theoretical. That’s not nothing. But turning observation into contribution requires a different kind of work, and I’m just starting to understand what that looks like.