Conceptos Básicos
Efficiently reducing visual tokens while maintaining model performance is crucial for large multimodal models.
Resumen
The study focuses on PruMerge, an adaptive token reduction approach for large multimodal models. It explores the challenges of increasing computational costs with more visual tokens and proposes a method to reduce them efficiently. The content covers the abstract, introduction, related work, method overview, experiments, efficiency analysis, ablation studies, conclusion, limitations, future work, and acknowledgments.
Abstract:
- Large Multimodal Models (LMMs) face increased computational costs due to the number of visual tokens.
- PruMerge is introduced as an adaptive token reduction approach to address this issue efficiently.
Introduction:
- LLMs and LMMs leverage visual encoders like CLIP-ViT for text generation.
- Computational costs are high due to the large number of input tokens fed into LLM backbones.
Related Work:
- Previous works focus on reducing computation costs by replacing LLM backbones or applying quantization.
- Efficient LMMs like Gemini and MobileVLM explore compact models suitable for low-memory devices.
Method Overview:
- PruMerge involves Adaptive Important Token Selection (AITS) via Outlier Detection and Token Supplement (TS) via Similar Key Clustering.
- The process includes selecting important visual tokens based on similarity and merging them strategically.
Experiments:
- PruMerge applied to LLaVA-1.5 compresses visual tokens by 14.4 times on average while maintaining performance across tasks.
- Efficiency analysis shows significant reductions in FLOPs and memory usage with PruMerge.
Ablation Studies:
- Comparisons show that PruMerge outperforms sequential and spatial sampling strategies consistently.
- Effectiveness of each module in PruMerge progressively improves downstream performance.
Conclusion:
- PruMerge demonstrates significant computational savings without compromising reasoning capabilities in LMMs.
- Further exploration into efficiency-performance balance in LMMs is encouraged.
Limitation and Future Work:
- Challenges include achieving lossless compression and validating scalability to larger-scale models like LLaVA-Next.
Acknowledgement:
Acknowledges support from NSF CAREER IIS2150012 and other grants funded by the Korea government(MSIT) and Microsoft Accelerate Foundation Models Research Program.
Estadísticas
Empirically when applied to LLaVA-1.5 [Liu et al., 2023a], our approach can compress the visual tokens by 14.4 times on average
Citas
"Our work demonstrates the effectiveness of building efficient large multimodal models from the perspective of visual token pruning."
"By leveraging spatial redundancy in visual tokens, we proposed a plug-and-play token reduction module."