This paper proposes a new paradigm called Open-vocabulary Multimodal Emotion Recognition (OV-MER) that enables the prediction of any number and category of emotions, advancing emotion recognition from basic to more nuanced emotions.
RACC, a framework that learns to compress and aggregate retrieved contexts, achieves state-of-the-art performance on knowledge-based visual question answering tasks while significantly reducing inference latency and storage requirements.
Omni-SMoLA is an efficient architecture that uses a soft mixture of many low-rank multimodal experts to improve the performance of generalist large language models across a wide range of vision-and-language tasks, often matching or outperforming specialized models.