Apr 12, 2026· 14 min read· Agents· Multimodal

Embodied Conversational Agents and the Promise of Foundation Models

A research summary on where conversational agents meet the physical world — and what foundation models actually buy us.

For two decades, embodied conversational agents lived in the gap between two communities that rarely talked. Foundation models, almost by accident, are now closing that gap — but the interesting questions are not the ones the headlines suggest.

The classical embodied conversational agent (ECA) was, in the original Cassell-and-colleagues sense, a humanlike interface: a body, a face, a voice, and the dialogue management to coordinate them. The dream was a system you could talk to, that would talk back the way people do — with gaze, gesture, and a stake in the shared physical scene. The reality, for a long time, was a brittle pipeline: ASR into NLU into a dialogue manager into a behavior planner into an animation engine, with each stage assuming the previous one was tidier than it really was.

Foundation models did not solve this problem. What they did was something more interesting: they made the cost of trying close to zero, and they unified representations across modalities that we used to treat as separate engineering disciplines.

What the field actually wants

When researchers say “embodied agent,” they tend to mean one of three things, and conflating them is the source of most disagreement at workshops:

Situated dialogue. An agent that reasons about a shared scene — “the red cup, not that one” — and grounds language in perception.
Task-following robots. A system that turns natural-language instructions into low-level control on real hardware.
Social agents. Avatars and characters whose embodiment is the point: tutors, companions, customer-facing kiosks.

These three live on different timescales, with different evaluation cultures, and they should probably stop being treated as the same field. But they share a problem that foundation models do address: the bottleneck of grounded common sense.

Hero figure — diagram of perception → dialogue → action loop

Figure 1. The classical ECA loop, redrawn for the foundation-model era. The boxes are the same; what changed is that adjacent boxes can now share a representation rather than negotiating across hand-built schemas.

The quiet shift: shared representations

The headline story about LLMs in agents is that they replaced the dialogue manager. That is true, but it is the boring version of the story. The more important shift is that vision encoders, speech encoders, and language models now sit close enough in representation space that the seams between modules are smaller than the modules themselves.

This matters because the hardest problem in embodied dialogue was never parsing — it was reference resolution under partial observation. When a user says “put it on the shelf next to the one I just moved,” a deployable system needs to maintain a joint belief over linguistic history, scene state, action history, and user intent. We used to model each of those with its own bespoke representation and pay translation costs at every interface. We don't have to anymore.

The interesting research question is no longer “can the model do it?” but “what does the model still get systematically wrong, and is it the same thing every time?”

An aside on benchmarks

Most embodied-agent benchmarks measure either task success in simulation or single-turn grounding accuracy. Neither captures what makes deployed agents fail. In our own work on multi-turn interactions, the dominant failure mode is not factual error — it is context drift: the model's working representation of the conversation slowly diverges from the user's, and recovery becomes impossible without a hard reset.

Figure 2. Stylized illustration of context drift accumulating over a multi-turn interaction. Two model variants are shown against a human reference; absolute scale is illustrative.

What foundation models do not give us

Three things, at least:

Calibrated uncertainty over the physical scene. Vision-language models will happily describe what they see; they are much less willing to say what they are unsure about, and the failure mode is hallucinating reference rather than refusing it.
Temporal commitments. A useful agent has to maintain commitments across turns — “okay, I'll watch for that” — and act on them later. Stateless prompting punts this to the application layer; nobody has a clean answer for it yet.
A theory of the user. The model knows the world; it does not know this person, here, now, with their goals and constraints. The interesting alignment work is at this layer.

Where I think this goes

The next round of progress will not come from larger models. It will come from designing training distributions that look like the deployment distribution: long, multi-turn, grounded, partial, and full of the small repairs that make human conversation work. The benchmarks we use today reward fluency at a single turn; they punish almost nothing that matters in deployment.

If you are an undergrad or master's student at UIUC interested in any of this, my door is open — there is plenty to do that does not require a 70B-parameter cluster.

References

Cassell, J. Embodied Conversational Agents. MIT Press, 2000.
Bisk, Y. et al. “Experience Grounds Language.” EMNLP, 2020.
Padmakumar, A. et al. “TEACh: Task-driven Embodied Agents that Chat.” AAAI, 2022.
Dongre, V. et al. “Drift No More? Context Equilibria in Multi-Turn LLM Interactions.” AAAI Workshop, 2026.