Skip to content

Latest commit

 

History

History
881 lines (675 loc) · 25.8 KB

File metadata and controls

881 lines (675 loc) · 25.8 KB

阶段七:多模态学习

预计时长:6-8周
目标:理解视觉-语言模型,掌握CLIP、BLIP等多模态架构,完成图文理解任务


7.1 多模态学习基础(1周)

7.1.1 什么是多模态学习

多模态学习:处理和融合多种类型数据(模态)的机器学习方法

常见模态:
- 视觉:图像、视频
- 语言:文本、语音
- 结构化数据:表格、图谱
- 传感器数据:雷达、激光

多模态AI应用:
- 图像描述生成(Image Captioning)
- 视觉问答(VQA)
- 图文检索(Image-Text Retrieval)
- 文生图(Text-to-Image)
- 视频理解
- 多模态对话

7.1.2 多模态学习核心问题

问题 说明 挑战
表示学习 如何表示不同模态 模态间语义差异
对齐 建立模态间对应关系 细粒度匹配
融合 整合多模态信息 融合策略选择
协同学习 模态间知识迁移 弱模态利用
生成 跨模态生成 保真度和多样性

7.1.3 模态融合策略

1. 早期融合(Early Fusion)
   - 在输入层拼接多模态特征
   - 优点:简单直接
   - 缺点:需要对齐

   Image ──┐
           ├──► Concat ──► Model ──► Output
   Text  ──┘

2. 晚期融合(Late Fusion)
   - 各模态独立编码,在决策层融合
   - 优点:模态独立
   - 缺点:交互不足

   Image ──► Image Encoder ──┐
                             ├──► Fusion ──► Output
   Text  ──► Text Encoder  ──┘

3. 混合融合(Hybrid Fusion)
   - 多层次交互
   - 优点:充分交互
   - 缺点:复杂度高

   Image ──► ┌─────────────────────────┐
             │ Cross-Modal Attention   │ ──► Output
   Text  ──► └─────────────────────────┘

7.1.4 对比学习基础

对比学习核心思想:
- 拉近相似样本(正样本对)
- 推远不相似样本(负样本对)

InfoNCE Loss:
L = -log(exp(sim(z_i, z_j)/τ) / Σ_k exp(sim(z_i, z_k)/τ))

其中:
- z_i, z_j: 正样本对的表示
- τ: 温度参数
- sim: 相似度函数(通常是余弦相似度)
def contrastive_loss(image_features, text_features, temperature=0.07):
    """对比学习损失"""
    # 归一化
    image_features = F.normalize(image_features, dim=-1)
    text_features = F.normalize(text_features, dim=-1)
    
    # 计算相似度矩阵
    logits = torch.matmul(image_features, text_features.T) / temperature
    
    # 标签:对角线为正样本
    batch_size = image_features.shape[0]
    labels = torch.arange(batch_size, device=logits.device)
    
    # 双向对比损失
    loss_i2t = F.cross_entropy(logits, labels)
    loss_t2i = F.cross_entropy(logits.T, labels)
    
    return (loss_i2t + loss_t2i) / 2

学习目标

  • 理解多模态学习的基本概念
  • 理解模态融合的不同策略
  • 理解对比学习的原理

7.2 CLIP模型(2周)

7.2.1 CLIP概述

CLIP (Contrastive Language-Image Pre-training)
- OpenAI发布(2021)
- 4亿图文对训练
- 强大的零样本迁移能力

核心创新:
1. 大规模图文对比学习
2. 自然语言作为监督信号
3. 无需任务特定训练即可迁移

7.2.2 CLIP架构

            Image                        Text
              │                            │
              ▼                            ▼
     ┌────────────────┐          ┌────────────────┐
     │ Image Encoder  │          │  Text Encoder  │
     │ (ViT/ResNet)   │          │ (Transformer)  │
     └───────┬────────┘          └───────┬────────┘
             │                            │
             ▼                            ▼
     ┌────────────────┐          ┌────────────────┐
     │   Projection   │          │   Projection   │
     └───────┬────────┘          └───────┬────────┘
             │                            │
             ▼                            ▼
         I₁ I₂ I₃ ... Iₙ          T₁ T₂ T₃ ... Tₙ
             │                            │
             └─────────────┬──────────────┘
                           ▼
                    Contrastive Loss
                    (对角线最大化)
                    
        ┌─────────────────────────────────┐
        │      T₁    T₂    T₃    ...  Tₙ  │
        │ I₁  [✓]   [✗]   [✗]        [✗]  │
        │ I₂  [✗]   [✓]   [✗]        [✗]  │
        │ I₃  [✗]   [✗]   [✓]        [✗]  │
        │ ...                              │
        │ Iₙ  [✗]   [✗]   [✗]        [✓]  │
        └─────────────────────────────────┘

7.2.3 使用CLIP

import torch
import clip
from PIL import Image

# 加载模型
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

# 准备图像和文本
image = preprocess(Image.open("image.jpg")).unsqueeze(0).to(device)
text = clip.tokenize(["a dog", "a cat", "a bird"]).to(device)

# 提取特征
with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    # 归一化
    image_features = image_features / image_features.norm(dim=-1, keepdim=True)
    text_features = text_features / text_features.norm(dim=-1, keepdim=True)
    
    # 计算相似度
    similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
    
print(f"Label probs: {similarity[0]}")  # [0.95, 0.03, 0.02]

7.2.4 Zero-shot图像分类

def zero_shot_classification(image, class_names, model, preprocess):
    """零样本图像分类"""
    # 创建文本提示
    text_prompts = [f"a photo of a {name}" for name in class_names]
    
    # 编码
    image_input = preprocess(image).unsqueeze(0).to(device)
    text_inputs = clip.tokenize(text_prompts).to(device)
    
    with torch.no_grad():
        image_features = model.encode_image(image_input)
        text_features = model.encode_text(text_inputs)
        
        # 归一化
        image_features /= image_features.norm(dim=-1, keepdim=True)
        text_features /= text_features.norm(dim=-1, keepdim=True)
        
        # 计算相似度
        similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
    
    # 获取预测
    probs, indices = similarity[0].topk(5)
    
    return [(class_names[idx], prob.item()) for idx, prob in zip(indices, probs)]


# 使用示例
class_names = ["dog", "cat", "car", "airplane", "ship"]
predictions = zero_shot_classification(image, class_names, model, preprocess)
print(predictions)

7.2.5 图文检索

class CLIPRetrieval:
    """使用CLIP进行图文检索"""
    def __init__(self, model, preprocess):
        self.model = model
        self.preprocess = preprocess
        self.image_features = None
        self.images = []
    
    def build_index(self, image_paths):
        """构建图像索引"""
        self.images = image_paths
        features_list = []
        
        for path in image_paths:
            image = self.preprocess(Image.open(path)).unsqueeze(0).to(device)
            with torch.no_grad():
                features = self.model.encode_image(image)
                features = features / features.norm(dim=-1, keepdim=True)
                features_list.append(features)
        
        self.image_features = torch.cat(features_list, dim=0)
    
    def search_by_text(self, query_text, top_k=5):
        """文本检索图像"""
        text = clip.tokenize([query_text]).to(device)
        
        with torch.no_grad():
            text_features = self.model.encode_text(text)
            text_features = text_features / text_features.norm(dim=-1, keepdim=True)
            
            similarity = (text_features @ self.image_features.T).squeeze(0)
            values, indices = similarity.topk(top_k)
        
        return [(self.images[idx], val.item()) for idx, val in zip(indices, values)]
    
    def search_by_image(self, query_image_path, top_k=5):
        """图像检索图像"""
        image = self.preprocess(Image.open(query_image_path)).unsqueeze(0).to(device)
        
        with torch.no_grad():
            query_features = self.model.encode_image(image)
            query_features = query_features / query_features.norm(dim=-1, keepdim=True)
            
            similarity = (query_features @ self.image_features.T).squeeze(0)
            values, indices = similarity.topk(top_k)
        
        return [(self.images[idx], val.item()) for idx, val in zip(indices, values)]

7.2.6 OpenCLIP

import open_clip

# 加载OpenCLIP(开源实现,更多模型选择)
model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-B-32',
    pretrained='laion2b_s34b_b79k'  # 更大的预训练数据集
)
tokenizer = open_clip.get_tokenizer('ViT-B-32')

# 可用模型
# - ViT-B-32, ViT-B-16, ViT-L-14
# - ViT-H-14, ViT-g-14 (更大模型)
# - 中文CLIP模型

# 使用中文CLIP
model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-B-16',
    pretrained='chinese_clip_base'
)

学习目标

  • 理解CLIP的对比学习训练
  • 使用CLIP进行零样本分类
  • 实现图文检索系统
  • 了解OpenCLIP和中文CLIP

7.3 视觉语言预训练模型(2周)

7.3.1 BLIP/BLIP-2

BLIP架构

BLIP (Bootstrapping Language-Image Pre-training)

三个预训练任务:
1. Image-Text Contrastive Learning (ITC)
2. Image-Text Matching (ITM)
3. Language Modeling (LM)

            ┌─────────────────────────────────────────┐
            │              Image Encoder               │
            │              (ViT)                       │
            └──────────────────┬──────────────────────┘
                               │
            ┌──────────────────┼──────────────────────┐
            │                  │                       │
            ▼                  ▼                       ▼
    ┌──────────────┐  ┌──────────────┐      ┌──────────────┐
    │ ITC          │  │ ITM          │      │ LM           │
    │ (对比学习)    │  │ (匹配分类)    │      │ (文本生成)    │
    │              │  │              │      │              │
    │ Text Encoder │  │ Image-grounded│      │ Image-grounded│
    │ (Unimodal)   │  │ Text Encoder │      │ Text Decoder │
    └──────────────┘  └──────────────┘      └──────────────┘

使用BLIP进行图像描述

from transformers import BlipProcessor, BlipForConditionalGeneration

# 加载模型
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

# 生成描述
image = Image.open("image.jpg")
inputs = processor(image, return_tensors="pt")

# 无条件生成
out = model.generate(**inputs)
caption = processor.decode(out[0], skip_special_tokens=True)
print(f"Caption: {caption}")

# 条件生成(提供前缀)
inputs = processor(image, text="a photo of", return_tensors="pt")
out = model.generate(**inputs)
caption = processor.decode(out[0], skip_special_tokens=True)

BLIP-2架构

BLIP-2: 使用Q-Former连接视觉编码器和LLM

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Frozen Image   │     │    Q-Former     │     │   Frozen LLM    │
│    Encoder      │────►│  (Lightweight)  │────►│  (OPT/FlanT5)   │
│   (ViT-g)       │     │  Cross-Attention│     │                 │
└─────────────────┘     └─────────────────┘     └─────────────────┘
      冻结                   可训练                    冻结

Q-Former特点:
- 少量可学习查询向量
- 通过交叉注意力提取视觉信息
- 高效连接视觉和语言模型
from transformers import Blip2Processor, Blip2ForConditionalGeneration

# 加载BLIP-2
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained(
    "Salesforce/blip2-opt-2.7b",
    torch_dtype=torch.float16
)
model.to("cuda")

# 视觉问答
image = Image.open("image.jpg")
prompt = "Question: What is in this image? Answer:"

inputs = processor(image, text=prompt, return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=50)
answer = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

7.3.2 LLaVA

LLaVA (Large Language and Vision Assistant)

架构:
- 视觉编码器:CLIP ViT-L/14
- 语言模型:Vicuna (LLaMA微调)
- 连接层:简单线性投影

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│ CLIP ViT    │────►│  Projector  │────►│   Vicuna    │
│ (Frozen)    │     │  (Linear)   │     │   (LoRA)    │
└─────────────┘     └─────────────┘     └─────────────┘

训练策略:
Stage 1: 预训练连接层(图文对)
Stage 2: 指令微调(视觉指令数据)
from transformers import LlavaProcessor, LlavaForConditionalGeneration

# 加载LLaVA
processor = LlavaProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
model = LlavaForConditionalGeneration.from_pretrained(
    "llava-hf/llava-1.5-7b-hf",
    torch_dtype=torch.float16
)
model.to("cuda")

# 对话
image = Image.open("image.jpg")
prompt = "USER: <image>\nDescribe this image in detail.\nASSISTANT:"

inputs = processor(text=prompt, images=image, return_tensors="pt").to("cuda")
generate_ids = model.generate(**inputs, max_new_tokens=200)
response = processor.batch_decode(generate_ids, skip_special_tokens=True)[0]

7.3.3 Qwen-VL

from transformers import AutoModelForCausalLM, AutoTokenizer

# 加载Qwen-VL
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-VL-Chat",
    device_map="cuda",
    trust_remote_code=True
).eval()

# 单图对话
query = tokenizer.from_list_format([
    {'image': 'image.jpg'},
    {'text': '描述这张图片的内容'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)

# 多轮对话
response, history = model.chat(tokenizer, '图中有几个人?', history=history)
print(response)

# 多图对话
query = tokenizer.from_list_format([
    {'image': 'image1.jpg'},
    {'image': 'image2.jpg'},
    {'text': '这两张图有什么区别?'},
])
response, history = model.chat(tokenizer, query=query, history=None)

7.3.4 视觉编码器架构

Vision Transformer (ViT)

class VisionTransformer(nn.Module):
    """视觉Transformer"""
    def __init__(self, image_size=224, patch_size=16, dim=768, depth=12, heads=12):
        super().__init__()
        
        num_patches = (image_size // patch_size) ** 2
        patch_dim = 3 * patch_size ** 2
        
        # Patch嵌入
        self.patch_embed = nn.Sequential(
            nn.Conv2d(3, dim, kernel_size=patch_size, stride=patch_size),
            nn.Flatten(2),  # (B, dim, num_patches)
        )
        
        # 位置编码
        self.pos_embed = nn.Parameter(torch.randn(1, num_patches + 1, dim))
        self.cls_token = nn.Parameter(torch.randn(1, 1, dim))
        
        # Transformer层
        self.transformer = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model=dim, nhead=heads, batch_first=True),
            num_layers=depth
        )
        
        self.norm = nn.LayerNorm(dim)
    
    def forward(self, x):
        # Patch嵌入
        x = self.patch_embed(x).transpose(1, 2)  # (B, num_patches, dim)
        
        # 添加CLS token
        cls_tokens = self.cls_token.expand(x.shape[0], -1, -1)
        x = torch.cat([cls_tokens, x], dim=1)
        
        # 添加位置编码
        x = x + self.pos_embed
        
        # Transformer
        x = self.transformer(x)
        x = self.norm(x)
        
        return x[:, 0]  # 返回CLS token

学习目标

  • 理解BLIP的多任务预训练
  • 使用BLIP进行图像描述
  • 理解LLaVA的架构设计
  • 使用多模态模型进行视觉问答

7.4 多模态大模型(1-2周)

7.4.1 多模态大模型架构

多模态大模型的核心问题:如何将视觉信息注入LLM

常见方案:

1. Adapter方案(LLaVA, MiniGPT-4)
   - 线性投影或MLP连接视觉和语言
   - 简单高效

2. Q-Former方案(BLIP-2, InstructBLIP)
   - 可学习查询提取视觉信息
   - 压缩视觉token数量

3. Perceiver方案
   - 交叉注意力处理任意模态
   - 统一的多模态处理

4. 原生多模态(Gemini)
   - 从头训练的多模态模型
   - 更深度的模态融合

7.4.2 MiniGPT-4

MiniGPT-4架构:
- 视觉编码器:BLIP-2的ViT + Q-Former(冻结)
- 语言模型:Vicuna-7B/13B(冻结)
- 连接层:单层线性投影(训练)

训练策略:
Stage 1: 预训练(大规模图文对)
Stage 2: 指令微调(高质量对话数据)

7.4.3 GPT-4V能力分析

GPT-4V核心能力:
1. 图像理解
   - 场景描述
   - 物体识别
   - 文字识别(OCR)
   - 图表理解

2. 视觉推理
   - 空间关系
   - 数量统计
   - 因果推理

3. 多图理解
   - 图像对比
   - 时序理解
   - 关联分析

4. 创意生成
   - 基于图像写作
   - 代码生成
   - 多语言描述

局限性:
- 空间推理仍有不足
- 细粒度识别能力有限
- 可能产生幻觉

7.4.4 视觉Adapter设计

class VisualAdapter(nn.Module):
    """视觉适配器:连接视觉编码器和LLM"""
    def __init__(self, visual_dim, llm_dim, num_query_tokens=32):
        super().__init__()
        
        # 方案1: 简单线性投影
        self.linear_proj = nn.Linear(visual_dim, llm_dim)
        
        # 方案2: MLP投影
        self.mlp_proj = nn.Sequential(
            nn.Linear(visual_dim, llm_dim),
            nn.GELU(),
            nn.Linear(llm_dim, llm_dim)
        )
        
        # 方案3: Q-Former风格
        self.query_tokens = nn.Parameter(torch.randn(1, num_query_tokens, llm_dim))
        self.cross_attn = nn.MultiheadAttention(llm_dim, num_heads=8, batch_first=True)
    
    def forward_linear(self, visual_features):
        """线性投影"""
        return self.linear_proj(visual_features)
    
    def forward_mlp(self, visual_features):
        """MLP投影"""
        return self.mlp_proj(visual_features)
    
    def forward_qformer(self, visual_features):
        """Q-Former风格"""
        batch_size = visual_features.shape[0]
        query_tokens = self.query_tokens.expand(batch_size, -1, -1)
        
        # 交叉注意力
        output, _ = self.cross_attn(query_tokens, visual_features, visual_features)
        return output

7.4.5 多模态Prompt Engineering

# 多模态提示模板

# 1. 图像描述
prompt_caption = """
<image>
Describe this image in detail. Include:
- Main subjects and their positions
- Colors and lighting
- Actions or movements
- Mood and atmosphere
"""

# 2. 视觉问答
prompt_vqa = """
<image>
Question: {question}
Please answer based on what you see in the image.
Answer:
"""

# 3. 图像对比
prompt_compare = """
<image1> <image2>
Compare these two images and describe:
1. Similarities
2. Differences
3. Any notable changes
"""

# 4. OCR和文档理解
prompt_ocr = """
<image>
Extract all text from this image and organize it in a structured format.
If it's a table, present it in markdown format.
"""

# 5. 推理任务
prompt_reasoning = """
<image>
Based on this image, answer the following:
1. What time of day is it likely to be?
2. What season might this be?
3. What might happen next?
Explain your reasoning.
"""

学习目标

  • 理解多模态大模型的架构设计
  • 了解不同视觉适配器方案
  • 掌握多模态Prompt设计
  • 能使用开源多模态模型

7.5 实战项目(1-2周)

项目1:CLIP图像检索系统

目标:构建基于CLIP的图文检索系统

要求

  1. 使用CLIP提取图像和文本特征
  2. 构建向量索引(Faiss)
  3. 实现文本搜图和以图搜图
  4. 部署Web界面
import faiss
import numpy as np

class CLIPSearchEngine:
    def __init__(self, model, preprocess, index_dim=512):
        self.model = model
        self.preprocess = preprocess
        self.index = faiss.IndexFlatIP(index_dim)  # 内积相似度
        self.image_paths = []
    
    def add_images(self, image_paths):
        """添加图像到索引"""
        features = []
        for path in image_paths:
            image = self.preprocess(Image.open(path)).unsqueeze(0)
            with torch.no_grad():
                feat = self.model.encode_image(image)
                feat = feat / feat.norm(dim=-1, keepdim=True)
                features.append(feat.cpu().numpy())
        
        features = np.vstack(features).astype('float32')
        self.index.add(features)
        self.image_paths.extend(image_paths)
    
    def search(self, query, k=10):
        """搜索"""
        if isinstance(query, str):
            # 文本查询
            text = clip.tokenize([query])
            with torch.no_grad():
                query_feat = self.model.encode_text(text)
        else:
            # 图像查询
            image = self.preprocess(query).unsqueeze(0)
            with torch.no_grad():
                query_feat = self.model.encode_image(image)
        
        query_feat = query_feat / query_feat.norm(dim=-1, keepdim=True)
        query_feat = query_feat.cpu().numpy().astype('float32')
        
        scores, indices = self.index.search(query_feat, k)
        
        results = []
        for score, idx in zip(scores[0], indices[0]):
            results.append({
                'path': self.image_paths[idx],
                'score': float(score)
            })
        
        return results

项目2:视觉问答系统

目标:构建交互式视觉问答系统

要求

  1. 使用LLaVA或BLIP-2
  2. 支持多轮对话
  3. 支持多图输入
  4. 部署Gradio界面
import gradio as gr

def create_vqa_interface(model, processor):
    """创建VQA Gradio界面"""
    
    def answer_question(image, question, history):
        # 构建提示
        if history:
            context = "\n".join([f"Q: {q}\nA: {a}" for q, a in history])
            prompt = f"{context}\nQ: {question}\nA:"
        else:
            prompt = f"Q: {question}\nA:"
        
        # 生成回答
        inputs = processor(image, text=prompt, return_tensors="pt").to(device)
        output = model.generate(**inputs, max_new_tokens=200)
        answer = processor.decode(output[0], skip_special_tokens=True)
        
        # 更新历史
        history.append((question, answer))
        
        return answer, history
    
    with gr.Blocks() as demo:
        gr.Markdown("# 视觉问答系统")
        
        with gr.Row():
            with gr.Column():
                image_input = gr.Image(type="pil", label="上传图片")
                question_input = gr.Textbox(label="提问")
                submit_btn = gr.Button("提交")
            
            with gr.Column():
                answer_output = gr.Textbox(label="回答")
                history_state = gr.State([])
        
        submit_btn.click(
            answer_question,
            inputs=[image_input, question_input, history_state],
            outputs=[answer_output, history_state]
        )
    
    return demo

项目3:图像描述生成

目标:生成高质量图像描述

要求

  1. 使用BLIP或BLIP-2
  2. 支持中英文描述
  3. 支持不同风格(简洁/详细)
  4. 批量处理能力

项目4:多模态内容推荐

目标:结合图文特征进行内容推荐

要求

  1. 提取商品图像和描述特征
  2. 用户偏好建模
  3. 跨模态相似度计算
  4. 推荐结果排序

阶段七 Checklist

完成以下任务后,进入阶段八(视频理解与生成):

  • 多模态基础

    • 理解模态融合策略
    • 理解对比学习原理
  • CLIP

    • 使用CLIP进行零样本分类
    • 实现图文检索
  • 视觉语言模型

    • 使用BLIP进行图像描述
    • 使用LLaVA进行视觉问答
  • 多模态大模型

    • 了解视觉适配器设计
    • 掌握多模态Prompt
  • 实践项目

    • 完成图像检索或视觉问答项目

下一步

完成本阶段后,进入阶段八:视频理解与生成