采用不同Fushion Model
Brave [16] and Mousi [11] use sequence concatenation, LLaVA-HR [33] employs an MR-adaptor, MiniGemini [23] uses cross-attention, and Eagle [40] utilizes channel concatenation.
LEO
InternVL
BRAVE
EAGLE
Mini-Gemini
1、LEO: Boosting Mixture of Vision Encoders for Multimodal Large Language Models
2、InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
3、BRAVE : Broadening the visual encoding of vision-language models
4、EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
5、Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models