Generating motion that is grounded in the surrounding scene is a defining challenge for research in embodied agents. This challenge spans embodiments from virtual avatars to physical robots. A successful solution would facilitate more empowered agents that are aware of the physical world around them, be it virtual or real, and enable more plausible scene interactions. While the community has made significant strides in modelling human–object and human–scene interactions, these efforts remain largely fragmented across sub-communities in graphics, vision, and robotics, each tackling the problem from different angles with limited interdisciplinary exchange.
The workshop will host a challenge based on the recently released MM-Conv dataset [1], which captures multimodal conversational interactions in 3D environments, pairing spoken utterances with motion capture of co-speech gestures and scene representations. Its annotations for referential gestures provide a unique testbed for evaluating whether generated motion is temporally aligned with speech and spatially grounded in the environment.
Task: Given a spoken utterance, conversational context, and a virtual 3D scene, participants must generate co-speech gestures that are both communicatively appropriate and spatially consistent with the referenced objects. This task requires jointly reasoning about language semantics, scene geometry, and motion dynamics.
Evaluation: Submissions will be evaluated on motion quality (FGD, diversity) and spatial grounding accuracy, measuring whether generated referential gestures correctly indicate the target objects in the scene. Baselines and evaluation code will be released alongside the challenge.
Try the Demo
Try the Demo