CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion
Published in arXiv preprint, 2025
Moritz Böhle*, Amélie Royer*, Juliette Marrie*, Edouard Grave, Patrick Pérez *Equal contribution.
CASA is a novel vision-language modeling techniques that build on — and improves — cross-attention for multimodal fusion.
Paper · Project page · Code
