预计时长:6-8周
目标:理解视觉-语言模型,掌握CLIP、BLIP等多模态架构,完成图文理解任务
多模态学习:处理和融合多种类型数据(模态)的机器学习方法
常见模态:
- 视觉:图像、视频
- 语言:文本、语音
- 结构化数据:表格、图谱
- 传感器数据:雷达、激光
多模态AI应用:
- 图像描述生成(Image Captioning)
- 视觉问答(VQA)
- 图文检索(Image-Text Retrieval)
- 文生图(Text-to-Image)
- 视频理解
- 多模态对话
| 问题 | 说明 | 挑战 |
|---|---|---|
| 表示学习 | 如何表示不同模态 | 模态间语义差异 |
| 对齐 | 建立模态间对应关系 | 细粒度匹配 |
| 融合 | 整合多模态信息 | 融合策略选择 |
| 协同学习 | 模态间知识迁移 | 弱模态利用 |
| 生成 | 跨模态生成 | 保真度和多样性 |
1. 早期融合(Early Fusion)
- 在输入层拼接多模态特征
- 优点:简单直接
- 缺点:需要对齐
Image ──┐
├──► Concat ──► Model ──► Output
Text ──┘
2. 晚期融合(Late Fusion)
- 各模态独立编码,在决策层融合
- 优点:模态独立
- 缺点:交互不足
Image ──► Image Encoder ──┐
├──► Fusion ──► Output
Text ──► Text Encoder ──┘
3. 混合融合(Hybrid Fusion)
- 多层次交互
- 优点:充分交互
- 缺点:复杂度高
Image ──► ┌─────────────────────────┐
│ Cross-Modal Attention │ ──► Output
Text ──► └─────────────────────────┘
对比学习核心思想:
- 拉近相似样本(正样本对)
- 推远不相似样本(负样本对)
InfoNCE Loss:
L = -log(exp(sim(z_i, z_j)/τ) / Σ_k exp(sim(z_i, z_k)/τ))
其中:
- z_i, z_j: 正样本对的表示
- τ: 温度参数
- sim: 相似度函数(通常是余弦相似度)
def contrastive_loss(image_features, text_features, temperature=0.07):
"""对比学习损失"""
# 归一化
image_features = F.normalize(image_features, dim=-1)
text_features = F.normalize(text_features, dim=-1)
# 计算相似度矩阵
logits = torch.matmul(image_features, text_features.T) / temperature
# 标签:对角线为正样本
batch_size = image_features.shape[0]
labels = torch.arange(batch_size, device=logits.device)
# 双向对比损失
loss_i2t = F.cross_entropy(logits, labels)
loss_t2i = F.cross_entropy(logits.T, labels)
return (loss_i2t + loss_t2i) / 2- 理解多模态学习的基本概念
- 理解模态融合的不同策略
- 理解对比学习的原理
CLIP (Contrastive Language-Image Pre-training)
- OpenAI发布(2021)
- 4亿图文对训练
- 强大的零样本迁移能力
核心创新:
1. 大规模图文对比学习
2. 自然语言作为监督信号
3. 无需任务特定训练即可迁移
Image Text
│ │
▼ ▼
┌────────────────┐ ┌────────────────┐
│ Image Encoder │ │ Text Encoder │
│ (ViT/ResNet) │ │ (Transformer) │
└───────┬────────┘ └───────┬────────┘
│ │
▼ ▼
┌────────────────┐ ┌────────────────┐
│ Projection │ │ Projection │
└───────┬────────┘ └───────┬────────┘
│ │
▼ ▼
I₁ I₂ I₃ ... Iₙ T₁ T₂ T₃ ... Tₙ
│ │
└─────────────┬──────────────┘
▼
Contrastive Loss
(对角线最大化)
┌─────────────────────────────────┐
│ T₁ T₂ T₃ ... Tₙ │
│ I₁ [✓] [✗] [✗] [✗] │
│ I₂ [✗] [✓] [✗] [✗] │
│ I₃ [✗] [✗] [✓] [✗] │
│ ... │
│ Iₙ [✗] [✗] [✗] [✓] │
└─────────────────────────────────┘
import torch
import clip
from PIL import Image
# 加载模型
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
# 准备图像和文本
image = preprocess(Image.open("image.jpg")).unsqueeze(0).to(device)
text = clip.tokenize(["a dog", "a cat", "a bird"]).to(device)
# 提取特征
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
# 归一化
image_features = image_features / image_features.norm(dim=-1, keepdim=True)
text_features = text_features / text_features.norm(dim=-1, keepdim=True)
# 计算相似度
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print(f"Label probs: {similarity[0]}") # [0.95, 0.03, 0.02]def zero_shot_classification(image, class_names, model, preprocess):
"""零样本图像分类"""
# 创建文本提示
text_prompts = [f"a photo of a {name}" for name in class_names]
# 编码
image_input = preprocess(image).unsqueeze(0).to(device)
text_inputs = clip.tokenize(text_prompts).to(device)
with torch.no_grad():
image_features = model.encode_image(image_input)
text_features = model.encode_text(text_inputs)
# 归一化
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
# 计算相似度
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
# 获取预测
probs, indices = similarity[0].topk(5)
return [(class_names[idx], prob.item()) for idx, prob in zip(indices, probs)]
# 使用示例
class_names = ["dog", "cat", "car", "airplane", "ship"]
predictions = zero_shot_classification(image, class_names, model, preprocess)
print(predictions)class CLIPRetrieval:
"""使用CLIP进行图文检索"""
def __init__(self, model, preprocess):
self.model = model
self.preprocess = preprocess
self.image_features = None
self.images = []
def build_index(self, image_paths):
"""构建图像索引"""
self.images = image_paths
features_list = []
for path in image_paths:
image = self.preprocess(Image.open(path)).unsqueeze(0).to(device)
with torch.no_grad():
features = self.model.encode_image(image)
features = features / features.norm(dim=-1, keepdim=True)
features_list.append(features)
self.image_features = torch.cat(features_list, dim=0)
def search_by_text(self, query_text, top_k=5):
"""文本检索图像"""
text = clip.tokenize([query_text]).to(device)
with torch.no_grad():
text_features = self.model.encode_text(text)
text_features = text_features / text_features.norm(dim=-1, keepdim=True)
similarity = (text_features @ self.image_features.T).squeeze(0)
values, indices = similarity.topk(top_k)
return [(self.images[idx], val.item()) for idx, val in zip(indices, values)]
def search_by_image(self, query_image_path, top_k=5):
"""图像检索图像"""
image = self.preprocess(Image.open(query_image_path)).unsqueeze(0).to(device)
with torch.no_grad():
query_features = self.model.encode_image(image)
query_features = query_features / query_features.norm(dim=-1, keepdim=True)
similarity = (query_features @ self.image_features.T).squeeze(0)
values, indices = similarity.topk(top_k)
return [(self.images[idx], val.item()) for idx, val in zip(indices, values)]import open_clip
# 加载OpenCLIP(开源实现,更多模型选择)
model, _, preprocess = open_clip.create_model_and_transforms(
'ViT-B-32',
pretrained='laion2b_s34b_b79k' # 更大的预训练数据集
)
tokenizer = open_clip.get_tokenizer('ViT-B-32')
# 可用模型
# - ViT-B-32, ViT-B-16, ViT-L-14
# - ViT-H-14, ViT-g-14 (更大模型)
# - 中文CLIP模型
# 使用中文CLIP
model, _, preprocess = open_clip.create_model_and_transforms(
'ViT-B-16',
pretrained='chinese_clip_base'
)- 理解CLIP的对比学习训练
- 使用CLIP进行零样本分类
- 实现图文检索系统
- 了解OpenCLIP和中文CLIP
BLIP (Bootstrapping Language-Image Pre-training)
三个预训练任务:
1. Image-Text Contrastive Learning (ITC)
2. Image-Text Matching (ITM)
3. Language Modeling (LM)
┌─────────────────────────────────────────┐
│ Image Encoder │
│ (ViT) │
└──────────────────┬──────────────────────┘
│
┌──────────────────┼──────────────────────┐
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ ITC │ │ ITM │ │ LM │
│ (对比学习) │ │ (匹配分类) │ │ (文本生成) │
│ │ │ │ │ │
│ Text Encoder │ │ Image-grounded│ │ Image-grounded│
│ (Unimodal) │ │ Text Encoder │ │ Text Decoder │
└──────────────┘ └──────────────┘ └──────────────┘
from transformers import BlipProcessor, BlipForConditionalGeneration
# 加载模型
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
# 生成描述
image = Image.open("image.jpg")
inputs = processor(image, return_tensors="pt")
# 无条件生成
out = model.generate(**inputs)
caption = processor.decode(out[0], skip_special_tokens=True)
print(f"Caption: {caption}")
# 条件生成(提供前缀)
inputs = processor(image, text="a photo of", return_tensors="pt")
out = model.generate(**inputs)
caption = processor.decode(out[0], skip_special_tokens=True)BLIP-2: 使用Q-Former连接视觉编码器和LLM
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Frozen Image │ │ Q-Former │ │ Frozen LLM │
│ Encoder │────►│ (Lightweight) │────►│ (OPT/FlanT5) │
│ (ViT-g) │ │ Cross-Attention│ │ │
└─────────────────┘ └─────────────────┘ └─────────────────┘
冻结 可训练 冻结
Q-Former特点:
- 少量可学习查询向量
- 通过交叉注意力提取视觉信息
- 高效连接视觉和语言模型
from transformers import Blip2Processor, Blip2ForConditionalGeneration
# 加载BLIP-2
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-opt-2.7b",
torch_dtype=torch.float16
)
model.to("cuda")
# 视觉问答
image = Image.open("image.jpg")
prompt = "Question: What is in this image? Answer:"
inputs = processor(image, text=prompt, return_tensors="pt").to("cuda", torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=50)
answer = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]LLaVA (Large Language and Vision Assistant)
架构:
- 视觉编码器:CLIP ViT-L/14
- 语言模型:Vicuna (LLaMA微调)
- 连接层:简单线性投影
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ CLIP ViT │────►│ Projector │────►│ Vicuna │
│ (Frozen) │ │ (Linear) │ │ (LoRA) │
└─────────────┘ └─────────────┘ └─────────────┘
训练策略:
Stage 1: 预训练连接层(图文对)
Stage 2: 指令微调(视觉指令数据)
from transformers import LlavaProcessor, LlavaForConditionalGeneration
# 加载LLaVA
processor = LlavaProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
model = LlavaForConditionalGeneration.from_pretrained(
"llava-hf/llava-1.5-7b-hf",
torch_dtype=torch.float16
)
model.to("cuda")
# 对话
image = Image.open("image.jpg")
prompt = "USER: <image>\nDescribe this image in detail.\nASSISTANT:"
inputs = processor(text=prompt, images=image, return_tensors="pt").to("cuda")
generate_ids = model.generate(**inputs, max_new_tokens=200)
response = processor.batch_decode(generate_ids, skip_special_tokens=True)[0]from transformers import AutoModelForCausalLM, AutoTokenizer
# 加载Qwen-VL
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen-VL-Chat",
device_map="cuda",
trust_remote_code=True
).eval()
# 单图对话
query = tokenizer.from_list_format([
{'image': 'image.jpg'},
{'text': '描述这张图片的内容'},
])
response, history = model.chat(tokenizer, query=query, history=None)
print(response)
# 多轮对话
response, history = model.chat(tokenizer, '图中有几个人?', history=history)
print(response)
# 多图对话
query = tokenizer.from_list_format([
{'image': 'image1.jpg'},
{'image': 'image2.jpg'},
{'text': '这两张图有什么区别?'},
])
response, history = model.chat(tokenizer, query=query, history=None)class VisionTransformer(nn.Module):
"""视觉Transformer"""
def __init__(self, image_size=224, patch_size=16, dim=768, depth=12, heads=12):
super().__init__()
num_patches = (image_size // patch_size) ** 2
patch_dim = 3 * patch_size ** 2
# Patch嵌入
self.patch_embed = nn.Sequential(
nn.Conv2d(3, dim, kernel_size=patch_size, stride=patch_size),
nn.Flatten(2), # (B, dim, num_patches)
)
# 位置编码
self.pos_embed = nn.Parameter(torch.randn(1, num_patches + 1, dim))
self.cls_token = nn.Parameter(torch.randn(1, 1, dim))
# Transformer层
self.transformer = nn.TransformerEncoder(
nn.TransformerEncoderLayer(d_model=dim, nhead=heads, batch_first=True),
num_layers=depth
)
self.norm = nn.LayerNorm(dim)
def forward(self, x):
# Patch嵌入
x = self.patch_embed(x).transpose(1, 2) # (B, num_patches, dim)
# 添加CLS token
cls_tokens = self.cls_token.expand(x.shape[0], -1, -1)
x = torch.cat([cls_tokens, x], dim=1)
# 添加位置编码
x = x + self.pos_embed
# Transformer
x = self.transformer(x)
x = self.norm(x)
return x[:, 0] # 返回CLS token- 理解BLIP的多任务预训练
- 使用BLIP进行图像描述
- 理解LLaVA的架构设计
- 使用多模态模型进行视觉问答
多模态大模型的核心问题:如何将视觉信息注入LLM
常见方案:
1. Adapter方案(LLaVA, MiniGPT-4)
- 线性投影或MLP连接视觉和语言
- 简单高效
2. Q-Former方案(BLIP-2, InstructBLIP)
- 可学习查询提取视觉信息
- 压缩视觉token数量
3. Perceiver方案
- 交叉注意力处理任意模态
- 统一的多模态处理
4. 原生多模态(Gemini)
- 从头训练的多模态模型
- 更深度的模态融合
MiniGPT-4架构:
- 视觉编码器:BLIP-2的ViT + Q-Former(冻结)
- 语言模型:Vicuna-7B/13B(冻结)
- 连接层:单层线性投影(训练)
训练策略:
Stage 1: 预训练(大规模图文对)
Stage 2: 指令微调(高质量对话数据)
GPT-4V核心能力:
1. 图像理解
- 场景描述
- 物体识别
- 文字识别(OCR)
- 图表理解
2. 视觉推理
- 空间关系
- 数量统计
- 因果推理
3. 多图理解
- 图像对比
- 时序理解
- 关联分析
4. 创意生成
- 基于图像写作
- 代码生成
- 多语言描述
局限性:
- 空间推理仍有不足
- 细粒度识别能力有限
- 可能产生幻觉
class VisualAdapter(nn.Module):
"""视觉适配器:连接视觉编码器和LLM"""
def __init__(self, visual_dim, llm_dim, num_query_tokens=32):
super().__init__()
# 方案1: 简单线性投影
self.linear_proj = nn.Linear(visual_dim, llm_dim)
# 方案2: MLP投影
self.mlp_proj = nn.Sequential(
nn.Linear(visual_dim, llm_dim),
nn.GELU(),
nn.Linear(llm_dim, llm_dim)
)
# 方案3: Q-Former风格
self.query_tokens = nn.Parameter(torch.randn(1, num_query_tokens, llm_dim))
self.cross_attn = nn.MultiheadAttention(llm_dim, num_heads=8, batch_first=True)
def forward_linear(self, visual_features):
"""线性投影"""
return self.linear_proj(visual_features)
def forward_mlp(self, visual_features):
"""MLP投影"""
return self.mlp_proj(visual_features)
def forward_qformer(self, visual_features):
"""Q-Former风格"""
batch_size = visual_features.shape[0]
query_tokens = self.query_tokens.expand(batch_size, -1, -1)
# 交叉注意力
output, _ = self.cross_attn(query_tokens, visual_features, visual_features)
return output# 多模态提示模板
# 1. 图像描述
prompt_caption = """
<image>
Describe this image in detail. Include:
- Main subjects and their positions
- Colors and lighting
- Actions or movements
- Mood and atmosphere
"""
# 2. 视觉问答
prompt_vqa = """
<image>
Question: {question}
Please answer based on what you see in the image.
Answer:
"""
# 3. 图像对比
prompt_compare = """
<image1> <image2>
Compare these two images and describe:
1. Similarities
2. Differences
3. Any notable changes
"""
# 4. OCR和文档理解
prompt_ocr = """
<image>
Extract all text from this image and organize it in a structured format.
If it's a table, present it in markdown format.
"""
# 5. 推理任务
prompt_reasoning = """
<image>
Based on this image, answer the following:
1. What time of day is it likely to be?
2. What season might this be?
3. What might happen next?
Explain your reasoning.
"""- 理解多模态大模型的架构设计
- 了解不同视觉适配器方案
- 掌握多模态Prompt设计
- 能使用开源多模态模型
目标:构建基于CLIP的图文检索系统
要求:
- 使用CLIP提取图像和文本特征
- 构建向量索引(Faiss)
- 实现文本搜图和以图搜图
- 部署Web界面
import faiss
import numpy as np
class CLIPSearchEngine:
def __init__(self, model, preprocess, index_dim=512):
self.model = model
self.preprocess = preprocess
self.index = faiss.IndexFlatIP(index_dim) # 内积相似度
self.image_paths = []
def add_images(self, image_paths):
"""添加图像到索引"""
features = []
for path in image_paths:
image = self.preprocess(Image.open(path)).unsqueeze(0)
with torch.no_grad():
feat = self.model.encode_image(image)
feat = feat / feat.norm(dim=-1, keepdim=True)
features.append(feat.cpu().numpy())
features = np.vstack(features).astype('float32')
self.index.add(features)
self.image_paths.extend(image_paths)
def search(self, query, k=10):
"""搜索"""
if isinstance(query, str):
# 文本查询
text = clip.tokenize([query])
with torch.no_grad():
query_feat = self.model.encode_text(text)
else:
# 图像查询
image = self.preprocess(query).unsqueeze(0)
with torch.no_grad():
query_feat = self.model.encode_image(image)
query_feat = query_feat / query_feat.norm(dim=-1, keepdim=True)
query_feat = query_feat.cpu().numpy().astype('float32')
scores, indices = self.index.search(query_feat, k)
results = []
for score, idx in zip(scores[0], indices[0]):
results.append({
'path': self.image_paths[idx],
'score': float(score)
})
return results目标:构建交互式视觉问答系统
要求:
- 使用LLaVA或BLIP-2
- 支持多轮对话
- 支持多图输入
- 部署Gradio界面
import gradio as gr
def create_vqa_interface(model, processor):
"""创建VQA Gradio界面"""
def answer_question(image, question, history):
# 构建提示
if history:
context = "\n".join([f"Q: {q}\nA: {a}" for q, a in history])
prompt = f"{context}\nQ: {question}\nA:"
else:
prompt = f"Q: {question}\nA:"
# 生成回答
inputs = processor(image, text=prompt, return_tensors="pt").to(device)
output = model.generate(**inputs, max_new_tokens=200)
answer = processor.decode(output[0], skip_special_tokens=True)
# 更新历史
history.append((question, answer))
return answer, history
with gr.Blocks() as demo:
gr.Markdown("# 视觉问答系统")
with gr.Row():
with gr.Column():
image_input = gr.Image(type="pil", label="上传图片")
question_input = gr.Textbox(label="提问")
submit_btn = gr.Button("提交")
with gr.Column():
answer_output = gr.Textbox(label="回答")
history_state = gr.State([])
submit_btn.click(
answer_question,
inputs=[image_input, question_input, history_state],
outputs=[answer_output, history_state]
)
return demo目标:生成高质量图像描述
要求:
- 使用BLIP或BLIP-2
- 支持中英文描述
- 支持不同风格(简洁/详细)
- 批量处理能力
目标:结合图文特征进行内容推荐
要求:
- 提取商品图像和描述特征
- 用户偏好建模
- 跨模态相似度计算
- 推荐结果排序
完成以下任务后,进入阶段八(视频理解与生成):
-
多模态基础
- 理解模态融合策略
- 理解对比学习原理
-
CLIP
- 使用CLIP进行零样本分类
- 实现图文检索
-
视觉语言模型
- 使用BLIP进行图像描述
- 使用LLaVA进行视觉问答
-
多模态大模型
- 了解视觉适配器设计
- 掌握多模态Prompt
-
实践项目
- 完成图像检索或视觉问答项目
完成本阶段后,进入阶段八:视频理解与生成