hot-comments

聚合国内外平台的高赞评论，自动按主题与情感分类。CLI + 本地 Web UI。

一个关键词输入，自动从 Hacker News、Reddit、YouTube、小红书、抖音、微博、知乎抓取近一年高赞评论，按主题/情感分类后输出到终端、SQLite 数据库和多种导出格式。

功能概览

能力	说明
多平台聚合	HN / Reddit / YouTube / 小红书 / 抖音 / 微博 / 知乎
自动分类	规则词表（零成本）/ HuggingFace 模型 / LLM / 混合模式
情感分析	正面 / 负面 / 中性，基于 SnowNLP
持久化存储	SQLite，跨次运行自动去重
多格式导出	CSV / JSON / Markdown
本地 Web UI	FastAPI + 单页 HTML，可视化浏览与筛选
纯 CLI 操作	三条子命令：`search` / `show` / `web`

平台接入方式

平台	接入方式	是否需要 Key
Hacker News	公开 Algolia API	否
Reddit	官方 API（PRAW）	是
YouTube	YouTube Data API v3	是
小红书 / 抖音 / 微博 / 知乎	MediaCrawler 子进程	需要扫码登录

环境要求

Python 3.10+
（中文平台）Node.js + Playwright，由 setup_mediacrawler.sh 自动安装

安装

git clone https://github.com/yourname/hot-comments.git
cd hot-comments
python -m venv .venv && source .venv/bin/activate
pip install -e .

如需 LLM 兜底分类：

pip install -e ".[llm]"

如需 HuggingFace 模型分类：

pip install -e ".[hf]"

快速上手（仅 Hacker News，零配置）

cp config.example.yaml config.yaml
# 默认配置已开启 hackernews，其他平台关闭

hotcomments search "AI" --min-likes 50 --days 365

跑完会在终端输出分类分布、平台分布、情感分布和 Top 20 高赞评论表格，并将数据写到 data/hotcomments.db 与 data/exports/。

配置

复制 config.example.yaml 为 config.yaml，然后按需填写：

# ===== 海外平台 =====
hackernews:
  enabled: true          # 公开 API，无需 key

reddit:
  enabled: true
  client_id: "${REDDIT_CLIENT_ID}"
  client_secret: "${REDDIT_CLIENT_SECRET}"
  user_agent: "hot-comments/0.1 by your_username"

youtube:
  enabled: true
  api_key: "${YOUTUBE_API_KEY}"

# ===== 抓取参数 =====
search:
  default_days: 365       # 抓近多少天
  min_likes: 100          # 最低点赞过滤
  per_platform_limit: 200 # 每平台最多条数

# ===== 分类器 =====
classifier:
  mode: "rule"            # rule | hf | llm | hybrid
  taxonomy_path: "config/taxonomy.yaml"

# ===== 存储 =====
storage:
  sqlite_path: "data/hotcomments.db"
  export_dir: "data/exports"

推荐通过环境变量注入敏感 key，避免提交到 git：

export REDDIT_CLIENT_ID=xxx
export REDDIT_CLIENT_SECRET=xxx
export YOUTUBE_API_KEY=xxx
export ANTHROPIC_API_KEY=xxx   # 使用 llm/hybrid 模式时需要

申请 Reddit API Key

登录 Reddit，进入 https://www.reddit.com/prefs/apps
创建一个 script 类型的应用
复制 client_id（应用名下方的短字符串）和 client_secret

申请 YouTube API Key

进入 Google Cloud Console
创建项目，启用 YouTube Data API v3
创建 API Key 并复制

接入中文平台（小红书 / 抖音 / 微博 / 知乎）

中文平台没有公开 API，通过 MediaCrawler 模拟浏览器抓取。

第一步：一键安装 MediaCrawler

bash scripts/setup_mediacrawler.sh ~/code/MediaCrawler

脚本会自动克隆仓库、安装 Python 依赖和 Chromium。

第二步：首次登录（每个平台执行一次）

cd ~/code/MediaCrawler
source .venv/bin/activate
python main.py --platform xhs --lt qrcode --type search --keywords "测试" --get_comment yes

扫码完成后，登录态会被缓存，后续可切回 headless 模式。

第三步：在 config.yaml 中开启

mediacrawler:
  enabled: true
  repo_path: "~/code/MediaCrawler"
  python: "~/code/MediaCrawler/.venv/bin/python"
  platforms: [xhs, dy, wb, zhihu]
  login_type: "qrcode"
  headless: true          # 登录后改为 true

第四步：搜索

hotcomments search "情感" --days 365 --min-likes 200

风险提示：各平台明确禁止爬虫，账号有被风控/封禁风险。建议使用小号或专用账号，不要并发太大，每次抓取间隔几十秒以上。抓到的数据仅供个人研究，不要二次发布或商用。

命令参考

`hotcomments search` — 抓取 + 分类 + 入库

# 基础搜索
hotcomments search "情感"

# 指定时间范围和最低点赞
hotcomments search "AI" --days 90 --min-likes 50

# 限制每平台条数，展示前 30 条
hotcomments search "考研" --limit 100 --top 30

# 只跑指定平台（可多次传）
hotcomments search "职场" --platform reddit --platform hackernews

# 指定导出格式（all/csv/json/md/none）
hotcomments search "生活" --export csv

# 调试模式
hotcomments -v search "AI"

完整选项：

选项	默认值	说明
`--days N`	配置文件	抓取最近 N 天
`--min-likes N`	配置文件	最低点赞过滤
`--limit N`	配置文件	每平台最多条数
`--platform P`	全部启用的	只跑指定平台，可重复传
`--export FORMAT`	`all`	导出格式：all/csv/json/md/none
`--top N`	20	终端展示前 N 条

`hotcomments show` — 查看已抓取数据

# 列出抓取过的所有关键词
hotcomments show

# 展示某个关键词的 Top 30 评论
hotcomments show "情感" --top 30

# 按分类和平台筛选
hotcomments show "情感" --category 失恋 --platform xiaohongshu

# 只看高赞（500+）
hotcomments show "AI" --min-likes 500

`hotcomments web` — 启动本地 Web UI

hotcomments web                      # 默认 http://127.0.0.1:8765
hotcomments web --port 9000          # 自定义端口
hotcomments web --host 0.0.0.0       # 允许局域网访问

浏览器打开后，可以按关键词、平台、分类、情感筛选，实时查看已入库的评论。

分类机制

规则模式（默认）

classifier.mode: rule：jieba 分词 + 词表命中（config/taxonomy.yaml）。

优点：快、零成本、可解释，词表可手工迭代
缺点：词表覆盖有限，"其他"类偏多

词表结构示例（config/taxonomy.yaml）：

情感:
  爱情: [爱情, 喜欢, 恋爱, 暗恋]
  失恋: [失恋, 分手, 前任, 复合]
  婚姻: [结婚, 离婚, 婚姻, 彩礼]

职场:
  跳槽: [跳槽, 离职, 辞职, offer]
  加班: [加班, 996, 内卷]

可以按需扩展词表，新增分类立即生效。

混合模式

classifier.mode: hybrid：先用规则分类，只对"其他"类调用 LLM 兜底。

需要：pip install -e ".[llm]" + ANTHROPIC_API_KEY

LLM 模式

classifier.mode: llm：全部由 LLM 分类，最准确但成本最高。

HuggingFace 模式

classifier.mode: hf：使用本地 HuggingFace 模型（情感分类）。

需要：pip install -e ".[hf]"

项目结构

hot-comments/
├── config.example.yaml          # 配置模板，复制为 config.yaml 后修改
├── config/
│   └── taxonomy.yaml            # 主题分类词表，自由扩展
├── scripts/
│   └── setup_mediacrawler.sh    # MediaCrawler 一键安装脚本
├── data/
│   ├── hotcomments.db           # SQLite 数据库
│   └── exports/                 # 导出文件目录
├── src/hotcomments/
│   ├── cli.py                   # CLI 入口（Click）
│   ├── orchestrator.py          # 主流程协调（并发抓取 + 分类）
│   ├── models.py                # 数据模型：Comment / ClassifiedComment / SearchRequest
│   ├── config.py                # YAML 配置加载（支持 ${ENV_VAR} 替换）
│   ├── platforms/               # 各平台适配器
│   │   ├── hackernews.py        # Algolia API
│   │   ├── reddit.py            # PRAW
│   │   ├── youtube.py           # YouTube Data API
│   │   └── mediacrawler.py      # MediaCrawler 子进程桥接
│   ├── classifier/              # 分类器
│   │   ├── rule.py              # 规则 + jieba
│   │   ├── llm.py               # Anthropic API
│   │   ├── sentiment.py         # SnowNLP 情感分析
│   │   └── base.py              # 抽象基类
│   ├── storage/
│   │   ├── sqlite_store.py      # SQLite 读写 + 去重
│   │   └── exporters.py         # CSV / JSON / Markdown 导出
│   └── web/app.py               # FastAPI Web UI
└── tests/
    ├── test_classifier.py
    ├── test_storage.py
    └── test_hn_integration.py

开发

# 安装开发依赖
pip install -e ".[dev]"

# 运行测试
pytest -q

# 代码检查
ruff check src tests

# 调试模式运行
hotcomments -v search "AI" --min-likes 10 --days 7

常见问题

Q：抓到的评论"其他"分类占比很高？

A：词表（config/taxonomy.yaml）覆盖不足。可以手动添加关键词，或切到 hybrid 模式让 LLM 处理"其他"类。

Q：MediaCrawler 启动失败？

A：确认已完成首次扫码登录，且 config.yaml 中 repo_path 和 python 路径正确。登录态过期需重新扫码。

Q：Reddit/YouTube 返回 0 条结果？

A：检查 API Key 是否填写正确，并通过环境变量注入（echo $REDDIT_CLIENT_ID 验证是否生效）。

Q：如何只抓单个平台？

A：使用 --platform 参数，例如 hotcomments search "AI" --platform hackernews。

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
scripts		scripts
src/hotcomments		src/hotcomments
tests		tests
.gitignore		.gitignore
README.md		README.md
config.example.yaml		config.example.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hot-comments

功能概览

平台接入方式

环境要求

安装

快速上手（仅 Hacker News，零配置）

配置

申请 Reddit API Key

申请 YouTube API Key

接入中文平台（小红书 / 抖音 / 微博 / 知乎）

命令参考

`hotcomments search` — 抓取 + 分类 + 入库

`hotcomments show` — 查看已抓取数据

`hotcomments web` — 启动本地 Web UI

分类机制

规则模式（默认）

混合模式

LLM 模式

HuggingFace 模式

项目结构

开发

常见问题

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

hot-comments

功能概览

平台接入方式

环境要求

安装

快速上手（仅 Hacker News，零配置）

配置

申请 Reddit API Key

申请 YouTube API Key

接入中文平台（小红书 / 抖音 / 微博 / 知乎）

命令参考

hotcomments search — 抓取 + 分类 + 入库

hotcomments show — 查看已抓取数据

hotcomments web — 启动本地 Web UI

分类机制

规则模式（默认）

混合模式

LLM 模式

HuggingFace 模式

项目结构

开发

常见问题

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`hotcomments search` — 抓取 + 分类 + 入库

`hotcomments show` — 查看已抓取数据

`hotcomments web` — 启动本地 Web UI

Packages