阶段七：多模态学习

预计时长：6-8周
目标：理解视觉-语言模型，掌握CLIP、BLIP等多模态架构，完成图文理解任务

7.1 多模态学习基础（1周）

7.1.1 什么是多模态学习

多模态学习：处理和融合多种类型数据（模态）的机器学习方法

常见模态：
- 视觉：图像、视频
- 语言：文本、语音
- 结构化数据：表格、图谱
- 传感器数据：雷达、激光

多模态AI应用：
- 图像描述生成（Image Captioning）
- 视觉问答（VQA）
- 图文检索（Image-Text Retrieval）
- 文生图（Text-to-Image）
- 视频理解
- 多模态对话

7.1.2 多模态学习核心问题

问题	说明	挑战
表示学习	如何表示不同模态	模态间语义差异
对齐	建立模态间对应关系	细粒度匹配
融合	整合多模态信息	融合策略选择
协同学习	模态间知识迁移	弱模态利用
生成	跨模态生成	保真度和多样性

7.1.3 模态融合策略

1. 早期融合（Early Fusion）
   - 在输入层拼接多模态特征
   - 优点：简单直接
   - 缺点：需要对齐

   Image ──┐
           ├──► Concat ──► Model ──► Output
   Text  ──┘

2. 晚期融合（Late Fusion）
   - 各模态独立编码，在决策层融合
   - 优点：模态独立
   - 缺点：交互不足

   Image ──► Image Encoder ──┐
                             ├──► Fusion ──► Output
   Text  ──► Text Encoder  ──┘

3. 混合融合（Hybrid Fusion）
   - 多层次交互
   - 优点：充分交互
   - 缺点：复杂度高

   Image ──► ┌─────────────────────────┐
             │ Cross-Modal Attention   │ ──► Output
   Text  ──► └─────────────────────────┘

7.1.4 对比学习基础

对比学习核心思想：
- 拉近相似样本（正样本对）
- 推远不相似样本（负样本对）

InfoNCE Loss:
L = -log(exp(sim(z_i, z_j)/τ) / Σ_k exp(sim(z_i, z_k)/τ))

其中：
- z_i, z_j: 正样本对的表示
- τ: 温度参数
- sim: 相似度函数（通常是余弦相似度）

def contrastive_loss(image_features, text_features, temperature=0.07):
    """对比学习损失"""
    # 归一化
    image_features = F.normalize(image_features, dim=-1)
    text_features = F.normalize(text_features, dim=-1)
    
    # 计算相似度矩阵
    logits = torch.matmul(image_features, text_features.T) / temperature
    
    # 标签：对角线为正样本
    batch_size = image_features.shape[0]
    labels = torch.arange(batch_size, device=logits.device)
    
    # 双向对比损失
    loss_i2t = F.cross_entropy(logits, labels)
    loss_t2i = F.cross_entropy(logits.T, labels)
    
    return (loss_i2t + loss_t2i) / 2

学习目标

理解多模态学习的基本概念
理解模态融合的不同策略
理解对比学习的原理

7.2 CLIP模型（2周）

7.2.1 CLIP概述

CLIP (Contrastive Language-Image Pre-training)
- OpenAI发布（2021）
- 4亿图文对训练
- 强大的零样本迁移能力

核心创新：
1. 大规模图文对比学习
2. 自然语言作为监督信号
3. 无需任务特定训练即可迁移

7.2.2 CLIP架构

            Image                        Text
              │                            │
              ▼                            ▼
     ┌────────────────┐          ┌────────────────┐
     │ Image Encoder  │          │  Text Encoder  │
     │ (ViT/ResNet)   │          │ (Transformer)  │
     └───────┬────────┘          └───────┬────────┘
             │                            │
             ▼                            ▼
     ┌────────────────┐          ┌────────────────┐
     │   Projection   │          │   Projection   │
     └───────┬────────┘          └───────┬────────┘
             │                            │
             ▼                            ▼
         I₁ I₂ I₃ ... Iₙ          T₁ T₂ T₃ ... Tₙ
             │                            │
             └─────────────┬──────────────┘
                           ▼
                    Contrastive Loss
                    (对角线最大化)
                    
        ┌─────────────────────────────────┐
        │      T₁    T₂    T₃    ...  Tₙ  │
        │ I₁  [✓]   [✗]   [✗]        [✗]  │
        │ I₂  [✗]   [✓]   [✗]        [✗]  │
        │ I₃  [✗]   [✗]   [✓]        [✗]  │
        │ ...                              │
        │ Iₙ  [✗]   [✗]   [✗]        [✓]  │
        └─────────────────────────────────┘

7.2.3 使用CLIP

import torch
import clip
from PIL import Image

# 加载模型
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

# 准备图像和文本
image = preprocess(Image.open("image.jpg")).unsqueeze(0).to(device)
text = clip.tokenize(["a dog", "a cat", "a bird"]).to(device)

# 提取特征
with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    # 归一化
    image_features = image_features / image_features.norm(dim=-1, keepdim=True)
    text_features = text_features / text_features.norm(dim=-1, keepdim=True)
    
    # 计算相似度
    similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
    
print(f"Label probs: {similarity[0]}")  # [0.95, 0.03, 0.02]

7.2.4 Zero-shot图像分类

def zero_shot_classification(image, class_names, model, preprocess):
    """零样本图像分类"""
    # 创建文本提示
    text_prompts = [f"a photo of a {name}" for name in class_names]
    
    # 编码
    image_input = preprocess(image).unsqueeze(0).to(device)
    text_inputs = clip.tokenize(text_prompts).to(device)
    
    with torch.no_grad():
        image_features = model.encode_image(image_input)
        text_features = model.encode_text(text_inputs)
        
        # 归一化
        image_features /= image_features.norm(dim=-1, keepdim=True)
        text_features /= text_features.norm(dim=-1, keepdim=True)
        
        # 计算相似度
        similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
    
    # 获取预测
    probs, indices = similarity[0].topk(5)
    
    return [(class_names[idx], prob.item()) for idx, prob in zip(indices, probs)]


# 使用示例
class_names = ["dog", "cat", "car", "airplane", "ship"]
predictions = zero_shot_classification(image, class_names, model, preprocess)
print(predictions)

7.2.5 图文检索

class CLIPRetrieval:
    """使用CLIP进行图文检索"""
    def __init__(self, model, preprocess):
        self.model = model
        self.preprocess = preprocess
        self.image_features = None
        self.images = []
    
    def build_index(self, image_paths):
        """构建图像索引"""
        self.images = image_paths
        features_list = []
        
        for path in image_paths:
            image = self.preprocess(Image.open(path)).unsqueeze(0).to(device)
            with torch.no_grad():
                features = self.model.encode_image(image)
                features = features / features.norm(dim=-1, keepdim=True)
                features_list.append(features)
        
        self.image_features = torch.cat(features_list, dim=0)
    
    def search_by_text(self, query_text, top_k=5):
        """文本检索图像"""
        text = clip.tokenize([query_text]).to(device)
        
        with torch.no_grad():
            text_features = self.model.encode_text(text)
            text_features = text_features / text_features.norm(dim=-1, keepdim=True)
            
            similarity = (text_features @ self.image_features.T).squeeze(0)
            values, indices = similarity.topk(top_k)
        
        return [(self.images[idx], val.item()) for idx, val in zip(indices, values)]
    
    def search_by_image(self, query_image_path, top_k=5):
        """图像检索图像"""
        image = self.preprocess(Image.open(query_image_path)).unsqueeze(0).to(device)
        
        with torch.no_grad():
            query_features = self.model.encode_image(image)
            query_features = query_features / query_features.norm(dim=-1, keepdim=True)
            
            similarity = (query_features @ self.image_features.T).squeeze(0)
            values, indices = similarity.topk(top_k)
        
        return [(self.images[idx], val.item()) for idx, val in zip(indices, values)]

7.2.6 OpenCLIP

import open_clip

# 加载OpenCLIP（开源实现，更多模型选择）
model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-B-32',
    pretrained='laion2b_s34b_b79k'  # 更大的预训练数据集
)
tokenizer = open_clip.get_tokenizer('ViT-B-32')

# 可用模型
# - ViT-B-32, ViT-B-16, ViT-L-14
# - ViT-H-14, ViT-g-14 (更大模型)
# - 中文CLIP模型

# 使用中文CLIP
model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-B-16',
    pretrained='chinese_clip_base'
)

学习目标

理解CLIP的对比学习训练
使用CLIP进行零样本分类
实现图文检索系统
了解OpenCLIP和中文CLIP

7.3 视觉语言预训练模型（2周）

7.3.1 BLIP/BLIP-2

BLIP架构

BLIP (Bootstrapping Language-Image Pre-training)

三个预训练任务：
1. Image-Text Contrastive Learning (ITC)
2. Image-Text Matching (ITM)
3. Language Modeling (LM)

            ┌─────────────────────────────────────────┐
            │              Image Encoder               │
            │              (ViT)                       │
            └──────────────────┬──────────────────────┘
                               │
            ┌──────────────────┼──────────────────────┐
            │                  │                       │
            ▼                  ▼                       ▼
    ┌──────────────┐  ┌──────────────┐      ┌──────────────┐
    │ ITC          │  │ ITM          │      │ LM           │
    │ (对比学习)    │  │ (匹配分类)    │      │ (文本生成)    │
    │              │  │              │      │              │
    │ Text Encoder │  │ Image-grounded│      │ Image-grounded│
    │ (Unimodal)   │  │ Text Encoder │      │ Text Decoder │
    └──────────────┘  └──────────────┘      └──────────────┘

使用BLIP进行图像描述

from transformers import BlipProcessor, BlipForConditionalGeneration

# 加载模型
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

# 生成描述
image = Image.open("image.jpg")
inputs = processor(image, return_tensors="pt")

# 无条件生成
out = model.generate(**inputs)
caption = processor.decode(out[0], skip_special_tokens=True)
print(f"Caption: {caption}")

# 条件生成（提供前缀）
inputs = processor(image, text="a photo of", return_tensors="pt")
out = model.generate(**inputs)
caption = processor.decode(out[0], skip_special_tokens=True)

BLIP-2架构

BLIP-2: 使用Q-Former连接视觉编码器和LLM

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Frozen Image   │     │    Q-Former     │     │   Frozen LLM    │
│    Encoder      │────►│  (Lightweight)  │────►│  (OPT/FlanT5)   │
│   (ViT-g)       │     │  Cross-Attention│     │                 │
└─────────────────┘     └─────────────────┘     └─────────────────┘
      冻结                   可训练                    冻结

Q-Former特点：
- 少量可学习查询向量
- 通过交叉注意力提取视觉信息
- 高效连接视觉和语言模型

from transformers import Blip2Processor, Blip2ForConditionalGeneration

# 加载BLIP-2
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained(
    "Salesforce/blip2-opt-2.7b",
    torch_dtype=torch.float16
)
model.to("cuda")

# 视觉问答
image = Image.open("image.jpg")
prompt = "Question: What is in this image? Answer:"

inputs = processor(image, text=prompt, return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=50)
answer = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

7.3.2 LLaVA

LLaVA (Large Language and Vision Assistant)

架构：
- 视觉编码器：CLIP ViT-L/14
- 语言模型：Vicuna (LLaMA微调)
- 连接层：简单线性投影

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│ CLIP ViT    │────►│  Projector  │────►│   Vicuna    │
│ (Frozen)    │     │  (Linear)   │     │   (LoRA)    │
└─────────────┘     └─────────────┘     └─────────────┘

训练策略：
Stage 1: 预训练连接层（图文对）
Stage 2: 指令微调（视觉指令数据）

from transformers import LlavaProcessor, LlavaForConditionalGeneration

# 加载LLaVA
processor = LlavaProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
model = LlavaForConditionalGeneration.from_pretrained(
    "llava-hf/llava-1.5-7b-hf",
    torch_dtype=torch.float16
)
model.to("cuda")

# 对话
image = Image.open("image.jpg")
prompt = "USER: <image>\nDescribe this image in detail.\nASSISTANT:"

inputs = processor(text=prompt, images=image, return_tensors="pt").to("cuda")
generate_ids = model.generate(**inputs, max_new_tokens=200)
response = processor.batch_decode(generate_ids, skip_special_tokens=True)[0]

7.3.3 Qwen-VL

from transformers import AutoModelForCausalLM, AutoTokenizer

# 加载Qwen-VL
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-VL-Chat",
    device_map="cuda",
    trust_remote_code=True
).eval()

# 单图对话
query = tokenizer.from_list_format([
    {'image': 'image.jpg'},
    {'text': '描述这张图片的内容'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)

# 多轮对话
response, history = model.chat(tokenizer, '图中有几个人？', history=history)
print(response)

# 多图对话
query = tokenizer.from_list_format([
    {'image': 'image1.jpg'},
    {'image': 'image2.jpg'},
    {'text': '这两张图有什么区别？'},
])
response, history = model.chat(tokenizer, query=query, history=None)

7.3.4 视觉编码器架构

Vision Transformer (ViT)

class VisionTransformer(nn.Module):
    """视觉Transformer"""
    def __init__(self, image_size=224, patch_size=16, dim=768, depth=12, heads=12):
        super().__init__()
        
        num_patches = (image_size // patch_size) ** 2
        patch_dim = 3 * patch_size ** 2
        
        # Patch嵌入
        self.patch_embed = nn.Sequential(
            nn.Conv2d(3, dim, kernel_size=patch_size, stride=patch_size),
            nn.Flatten(2),  # (B, dim, num_patches)
        )
        
        # 位置编码
        self.pos_embed = nn.Parameter(torch.randn(1, num_patches + 1, dim))
        self.cls_token = nn.Parameter(torch.randn(1, 1, dim))
        
        # Transformer层
        self.transformer = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model=dim, nhead=heads, batch_first=True),
            num_layers=depth
        )
        
        self.norm = nn.LayerNorm(dim)
    
    def forward(self, x):
        # Patch嵌入
        x = self.patch_embed(x).transpose(1, 2)  # (B, num_patches, dim)
        
        # 添加CLS token
        cls_tokens = self.cls_token.expand(x.shape[0], -1, -1)
        x = torch.cat([cls_tokens, x], dim=1)
        
        # 添加位置编码
        x = x + self.pos_embed
        
        # Transformer
        x = self.transformer(x)
        x = self.norm(x)
        
        return x[:, 0]  # 返回CLS token

学习目标

理解BLIP的多任务预训练
使用BLIP进行图像描述
理解LLaVA的架构设计
使用多模态模型进行视觉问答

7.4 多模态大模型（1-2周）

7.4.1 多模态大模型架构

多模态大模型的核心问题：如何将视觉信息注入LLM

常见方案：

1. Adapter方案（LLaVA, MiniGPT-4）
   - 线性投影或MLP连接视觉和语言
   - 简单高效

2. Q-Former方案（BLIP-2, InstructBLIP）
   - 可学习查询提取视觉信息
   - 压缩视觉token数量

3. Perceiver方案
   - 交叉注意力处理任意模态
   - 统一的多模态处理

4. 原生多模态（Gemini）
   - 从头训练的多模态模型
   - 更深度的模态融合

7.4.2 MiniGPT-4

MiniGPT-4架构：
- 视觉编码器：BLIP-2的ViT + Q-Former（冻结）
- 语言模型：Vicuna-7B/13B（冻结）
- 连接层：单层线性投影（训练）

训练策略：
Stage 1: 预训练（大规模图文对）
Stage 2: 指令微调（高质量对话数据）

7.4.3 GPT-4V能力分析

GPT-4V核心能力：
1. 图像理解
   - 场景描述
   - 物体识别
   - 文字识别（OCR）
   - 图表理解

2. 视觉推理
   - 空间关系
   - 数量统计
   - 因果推理

3. 多图理解
   - 图像对比
   - 时序理解
   - 关联分析

4. 创意生成
   - 基于图像写作
   - 代码生成
   - 多语言描述

局限性：
- 空间推理仍有不足
- 细粒度识别能力有限
- 可能产生幻觉

7.4.4 视觉Adapter设计

class VisualAdapter(nn.Module):
    """视觉适配器：连接视觉编码器和LLM"""
    def __init__(self, visual_dim, llm_dim, num_query_tokens=32):
        super().__init__()
        
        # 方案1: 简单线性投影
        self.linear_proj = nn.Linear(visual_dim, llm_dim)
        
        # 方案2: MLP投影
        self.mlp_proj = nn.Sequential(
            nn.Linear(visual_dim, llm_dim),
            nn.GELU(),
            nn.Linear(llm_dim, llm_dim)
        )
        
        # 方案3: Q-Former风格
        self.query_tokens = nn.Parameter(torch.randn(1, num_query_tokens, llm_dim))
        self.cross_attn = nn.MultiheadAttention(llm_dim, num_heads=8, batch_first=True)
    
    def forward_linear(self, visual_features):
        """线性投影"""
        return self.linear_proj(visual_features)
    
    def forward_mlp(self, visual_features):
        """MLP投影"""
        return self.mlp_proj(visual_features)
    
    def forward_qformer(self, visual_features):
        """Q-Former风格"""
        batch_size = visual_features.shape[0]
        query_tokens = self.query_tokens.expand(batch_size, -1, -1)
        
        # 交叉注意力
        output, _ = self.cross_attn(query_tokens, visual_features, visual_features)
        return output

7.4.5 多模态Prompt Engineering

# 多模态提示模板

# 1. 图像描述
prompt_caption = """
<image>
Describe this image in detail. Include:
- Main subjects and their positions
- Colors and lighting
- Actions or movements
- Mood and atmosphere
"""

# 2. 视觉问答
prompt_vqa = """
<image>
Question: {question}
Please answer based on what you see in the image.
Answer:
"""

# 3. 图像对比
prompt_compare = """
<image1> <image2>
Compare these two images and describe:
1. Similarities
2. Differences
3. Any notable changes
"""

# 4. OCR和文档理解
prompt_ocr = """
<image>
Extract all text from this image and organize it in a structured format.
If it's a table, present it in markdown format.
"""

# 5. 推理任务
prompt_reasoning = """
<image>
Based on this image, answer the following:
1. What time of day is it likely to be?
2. What season might this be?
3. What might happen next?
Explain your reasoning.
"""

学习目标

理解多模态大模型的架构设计
了解不同视觉适配器方案
掌握多模态Prompt设计
能使用开源多模态模型

7.5 实战项目（1-2周）

项目1：CLIP图像检索系统

目标：构建基于CLIP的图文检索系统

要求：

使用CLIP提取图像和文本特征
构建向量索引（Faiss）
实现文本搜图和以图搜图
部署Web界面

import faiss
import numpy as np

class CLIPSearchEngine:
    def __init__(self, model, preprocess, index_dim=512):
        self.model = model
        self.preprocess = preprocess
        self.index = faiss.IndexFlatIP(index_dim)  # 内积相似度
        self.image_paths = []
    
    def add_images(self, image_paths):
        """添加图像到索引"""
        features = []
        for path in image_paths:
            image = self.preprocess(Image.open(path)).unsqueeze(0)
            with torch.no_grad():
                feat = self.model.encode_image(image)
                feat = feat / feat.norm(dim=-1, keepdim=True)
                features.append(feat.cpu().numpy())
        
        features = np.vstack(features).astype('float32')
        self.index.add(features)
        self.image_paths.extend(image_paths)
    
    def search(self, query, k=10):
        """搜索"""
        if isinstance(query, str):
            # 文本查询
            text = clip.tokenize([query])
            with torch.no_grad():
                query_feat = self.model.encode_text(text)
        else:
            # 图像查询
            image = self.preprocess(query).unsqueeze(0)
            with torch.no_grad():
                query_feat = self.model.encode_image(image)
        
        query_feat = query_feat / query_feat.norm(dim=-1, keepdim=True)
        query_feat = query_feat.cpu().numpy().astype('float32')
        
        scores, indices = self.index.search(query_feat, k)
        
        results = []
        for score, idx in zip(scores[0], indices[0]):
            results.append({
                'path': self.image_paths[idx],
                'score': float(score)
            })
        
        return results

项目2：视觉问答系统

目标：构建交互式视觉问答系统

要求：

使用LLaVA或BLIP-2
支持多轮对话
支持多图输入
部署Gradio界面

import gradio as gr

def create_vqa_interface(model, processor):
    """创建VQA Gradio界面"""
    
    def answer_question(image, question, history):
        # 构建提示
        if history:
            context = "\n".join([f"Q: {q}\nA: {a}" for q, a in history])
            prompt = f"{context}\nQ: {question}\nA:"
        else:
            prompt = f"Q: {question}\nA:"
        
        # 生成回答
        inputs = processor(image, text=prompt, return_tensors="pt").to(device)
        output = model.generate(**inputs, max_new_tokens=200)
        answer = processor.decode(output[0], skip_special_tokens=True)
        
        # 更新历史
        history.append((question, answer))
        
        return answer, history
    
    with gr.Blocks() as demo:
        gr.Markdown("# 视觉问答系统")
        
        with gr.Row():
            with gr.Column():
                image_input = gr.Image(type="pil", label="上传图片")
                question_input = gr.Textbox(label="提问")
                submit_btn = gr.Button("提交")
            
            with gr.Column():
                answer_output = gr.Textbox(label="回答")
                history_state = gr.State([])
        
        submit_btn.click(
            answer_question,
            inputs=[image_input, question_input, history_state],
            outputs=[answer_output, history_state]
        )
    
    return demo

项目3：图像描述生成

目标：生成高质量图像描述

要求：

使用BLIP或BLIP-2
支持中英文描述
支持不同风格（简洁/详细）
批量处理能力

项目4：多模态内容推荐

目标：结合图文特征进行内容推荐

要求：

提取商品图像和描述特征
用户偏好建模
跨模态相似度计算
推荐结果排序

阶段七 Checklist

完成以下任务后，进入阶段八（视频理解与生成）：

下一步

完成本阶段后，进入阶段八：视频理解与生成

FilesExpand file tree

07-multimodal.md

Latest commit

History

07-multimodal.md

File metadata and controls

阶段七：多模态学习

7.1 多模态学习基础（1周）

7.1.1 什么是多模态学习

7.1.2 多模态学习核心问题

7.1.3 模态融合策略

7.1.4 对比学习基础

学习目标

7.2 CLIP模型（2周）

7.2.1 CLIP概述

7.2.2 CLIP架构

7.2.3 使用CLIP

7.2.4 Zero-shot图像分类

7.2.5 图文检索

7.2.6 OpenCLIP

学习目标

7.3 视觉语言预训练模型（2周）

7.3.1 BLIP/BLIP-2

BLIP架构

使用BLIP进行图像描述

BLIP-2架构

7.3.2 LLaVA

7.3.3 Qwen-VL

7.3.4 视觉编码器架构

Vision Transformer (ViT)

学习目标

7.4 多模态大模型（1-2周）

7.4.1 多模态大模型架构

7.4.2 MiniGPT-4

7.4.3 GPT-4V能力分析

7.4.4 视觉Adapter设计

7.4.5 多模态Prompt Engineering

学习目标

7.5 实战项目（1-2周）

项目1：CLIP图像检索系统

项目2：视觉问答系统

项目3：图像描述生成

项目4：多模态内容推荐

阶段七 Checklist

下一步