Taylor Webb: New work out now in TMLR: Evaluating Compositional Scene Understanding in Multimodal Generative Models. We find that multimodal models have improved in their abilities to both generate and interpret compositional visual scenes, but still lag behind humans. arxiv.org/abs/2503.23125

See full post

Taylor Webb taylorwwebb.bsky.social
New work out now in TMLR: Evaluating Compositional Scene Understanding in Multimodal Generative Models. We find that multimodal models have improved in their abilities to both generate and interpret compositional visual scenes, but still lag behind humans. arxiv.org/abs/2503.23125
Evaluating Compositional Scene Understanding in Multimodal Generative Models

The visual world is fundamentally compositional. Visual scenes are defined by the composition of objects and their relations. Hence, it is essential for computer vision systems to reflect and exploit ...

arxiv.org
Apr 3, 2025 20:04
0 reposts 0 quotes 0 likes

View on Bluesky Show all post labels

An unhandled error has occurred. Reload 🗙