New work out now in TMLR: Evaluating Compositional Scene Understanding in Multimodal Generative Models. We find that multimodal models have improved in their abilities to both generate and interpret compositional visual scenes, but still lag behind humans.
arxiv.org/abs/2503.23125
Evaluating Compositional Scene Understanding in Multimodal Generative Models
The visual world is fundamentally compositional. Visual scenes are defined by the composition of objects and their relations. Hence, it is essential for computer vision systems to reflect and exploit ...