Why Multi-Instance Learning is Actually Beautiful

2 minute read

Published:

There’s a class of problems in ML that most supervised learning frameworks can’t handle cleanly: you know the label for a group of examples, but not for any individual one.

This is Multi-Instance Learning (MIL), and it shows up everywhere once you start looking. In clinical settings, a patient is either depressed or not — but you don’t have labels for each individual moment in their conversation. In pathology, a slide is cancerous or not — but individual patches might be ambiguous. In document classification, a document belongs to a topic — but not every sentence does.

The standard approach would be to just pool the instances (average them, max them) and train on the aggregate. It works. It’s also throwing away a lot of information.

What makes MIL genuinely interesting is what happens when you let the model learn which instances matter. This is where attention comes in — not as a black box trick, but as a principled mechanism for the model to express: “bag-level label is positive, and here’s which instances I’m basing that on.”

In our depression detection work, this means the model can say: “this interview is classified as depressed, and here are the specific response turns that drove that prediction.” Those turns tend to correspond to moments of flattened affect, extended pause patterns, and reduced prosodic variation — which is exactly what clinical literature would predict. The interpretability isn’t bolted on. It falls out of the architecture.

The aggregation rules matter too. A soft attention aggregation (weighted mean) treats every instance as contributing. A noisy-or rule says: if any instance is positive, the bag is positive. These encode different assumptions about how the label distributes across instances, and getting that choice right is part of the research.

What I find elegant about MIL is that it’s a natural fit for a huge class of real-world problems where strong supervision is expensive but weak supervision is free. You don’t need to annotate every frame of a video, every patch of a slide, every utterance in a conversation. You just need the outcome label — and then you let the model figure out the structure.

That’s not a limitation. That’s the design.