U2A: Unified Unimodal Adaptation for Robust and Efficient Multimodal Learning
Md Kaykobad Reza, Niki Nezakati, Ameya Patil, Mashhour Solh, M. Salman Asif
January, 2025
Abstract
Multimodal learning systems often rely on specialized architectures and complex training procedures to achieve strong performance. In this work, we introduce Unified Unimodal Adaptation (U2A), a simple and efficient framework that jointly adapts pretrained unimodal encoders using low-rank adaptation (LoRA) for a wide range of multimodal tasks. U2A substantially reduces the number of trainable parameters and removes the need for strategies such as alternating updates, gradient manipulation, or unimodal pre-fine-tuning.
To address scenarios with missing modalities during training or inference, we propose Mask Tokens (MT), lightweight tokens that synthesize representations of unavailable modalities using information from the available ones. This unified mechanism avoids task-specific estimation modules or prompt-tuning techniques.
Experiments across diverse datasets demonstrate that U2A matches or exceeds the performance of state-of-the-art methods in both full-modality and missing-modality settings, while being parameter-efficient and easy to train.
Publication
CoRR (arXiv preprint)