SparrowRecSys是一个电影推荐系统,名字SparrowRecSys(麻雀推荐系统),取自“麻雀虽小,五脏俱全”之意。项目是一个基于maven的混合语言项目,同时包含了TensorFlow,Spark,Jetty Server等推荐系统的不同模块。希望你能够利用SparrowRecSys进行推荐系统的学习,并有机会一起完善它。
注意: 本项目中的推荐算法/模型,仅为展示推荐系统所用,不保证在实际应用中的准确度。
######## create network ########
docker network create --driver bridge demo-recsys-net
######## start data container ########
docker run -dti --network demo-recsys-net \
--name demo-recsys-data \
yuzhiyu3/demo-recsys-data:v1.1
# wait for HDFS startup
sleep 120
######## start tensorflow container ########
docker run -dti --network demo-recsys-net \
--name demo-recsys-tensorflow \
yuzhiyu3/demo-recsys-tensorflow:v1.1
######## start spark & flink container ########
docker run -dti --network demo-recsys-net \
--name demo-recsys-spark-flink \
-p 18088:8088 -p 18081:8081 -p 18042:8042 \
yuzhiyu3/demo-recsys-spark-flink:v1.1
# wait for Spark job (EmbeedingLSH)
sleep 180
######## start tensorflow_serving container ########
docker run -dti --network demo-recsys-net \
--name demo-recsys-tensorflow-serving \
-e MODEL_NAME=sparrow_recsys_widedeep \
-p 18501:8501 \
yuzhiyu3/demo-recsys-tensorflow-serving:v1.1
######## start web server container ########
docker run -dti --network demo-recsys-net \
--name demo-recsys-web \
-p 18010:8010 \
yuzhiyu3/demo-recsys-web:v1.1在浏览器中查看demo网站:
http://ip_of_your_host:18010/
在浏览器中查看spark任务运行状态:
http://ip_of_your_host:18088/
在浏览器中查看flink任务运行状态:
http://ip_of_your_host:18081/
测试tensorflow能正常响应inference请求:
curl -X POST \
http://ip_of_your_host:18501/v1/models/sparrow_recsys_widedeep:predict \
-H 'cache-control: no-cache' \
-H 'content-type: application/json' \
-d '{
"instances":
[
{
"movieGenre2": "",
"userAvgRating": 4,
"movieGenre1": "Drama",
"movieRatingStddev": 0.89,
"userRatingStddev": 1.1,
"userGenre4": "War",
"movieId": 501,
"userGenre5": "Drama",
"userGenre2": "Adventure",
"userId": 55,
"userGenre3": "Romance",
"userGenre1": "Action",
"movieAvgRating": 3.6,
"userRatedMovie1": 858,
"movieRatingCount": 5,
"userRatingCount": 6,
"releaseYear": 1993,
"movieGenre3": ""
}
]
}'
新增电影:
curl --location --request POST 'http://ip_of_your_host:18010/createmovie' \
--header 'Content-Type: application/x-www-form-urlencoded' \
--data-urlencode 'title=Test Movile (2022)' \
--data-urlencode 'genres=1,5'
新增用户评分:
curl --location --request POST 'http://ip_of_your_host:18010/createrating' \
--header 'Content-Type: application/x-www-form-urlencoded' \
--data-urlencode 'userId=888' \
--data-urlencode 'movieId=1001' \
--data-urlencode 'rating=4.8'
- Java 8
- Scala 2.11
- Python 3.6+
- TensorFlow 2.0+
- Mysql
- Redis
- Kafka
- Hadoop 2.7+
- Spark 2.4+
- Flink 1.12+
- Docker
- Python packages:
tensorflow, tensorflow_hub, tensorflow_text, redis, kafka-python
首先修改代码中含有 demo-recsys-data, demo-recsys-tensorflow-serving 相关的URL为实际配置
编译
mvn clean package
导入初始数据
# 用 sql/db.sql 创建MySQL数据库
# 将movies和ratings数据导入MySQL (数据位于src/main/resources/webroot/sampledata/)
# 将MySQL数据导入HDFS sh ./bin/mysql_to_hdfs.sh
训练电影Embedding
python TFRecModel/src/com/sparrowrecsys/offline/tensorflow/HDFSMoviesBERTEmbedding.py
启动Parameter Servers 和 Workers
export TF_CONFIG='{"cluster":{"worker":["localhost:12345","localhost:12346"],"ps":["localhost:23456","localhost:23457"],"chief":["localhost:34567"]},"task":{"type":"worker","index":0}}'
nohup python TFRecModel/src/com/sparrowrecsys/offline/tensorflow/TFServer.py &
export TF_CONFIG='{"cluster":{"worker":["localhost:12345","localhost:12346"],"ps":["localhost:23456","localhost:23457"],"chief":["localhost:34567"]},"task":{"type":"worker","index":1}}'
nohup python TFRecModel/src/com/sparrowrecsys/offline/tensorflow/TFServer.py &
export TF_CONFIG='{"cluster":{"worker":["localhost:12345","localhost:12346"],"ps":["localhost:23456","localhost:23457"],"chief":["localhost:34567"]},"task":{"type":"ps","index":0}}'
nohup python TFRecModel/src/com/sparrowrecsys/offline/tensorflow/TFServer.py &
export TF_CONFIG='{"cluster":{"worker":["localhost:12345","localhost:12346"],"ps":["localhost:23456","localhost:23457"],"chief":["localhost:34567"]},"task":{"type":"ps","index":1}}'
nohup python TFRecModel/src/com/sparrowrecsys/offline/tensorflow/TFServer.py &
按一定的启动频率设置以下几个定时任务:
# movie embedding
./bin/spark-submit --name EmbeddingLSH --master yarn --deploy-mode cluster --class com.sparrowrecsys.offline.spark.embedding.EmbeddingLSH ~/work/recsys/SparrowRecSys/target/SparrowRecSys-1.0-SNAPSHOT-jar-with-dependencies.jar
# feature engineering
./bin/spark-submit --name FeatureEngineering --master yarn --deploy-mode cluster --class com.sparrowrecsys.offline.spark.featureeng.FeatureEngForRecModel ~/work/recsys/SparrowRecSys/target/SparrowRecSys-1.0-SNAPSHOT-jar-with-dependencies.jar
# training
export TF_CONFIG='{"cluster":{"worker":["localhost:12345","localhost:12346"],"ps":["localhost:23456","localhost:23457"],"chief":["localhost:34567"]},"task":{"type":"chief","index":0}}'; python TFRecModel/src/com/sparrowrecsys/offline/tensorflow/WideNDeep.py
启动Web服务器
java -jar target/SparrowRecSys-1.0-SNAPSHOT-jar-with-dependencies.jar
启动Tensorflow Serving
docker run -t --rm -p 8501:8501 \
-v "~/work/recsys/SparrowRecSys/tmp_model/widendeep:/models/sparrow_recsys_widedeep" \
-e MODEL_NAME=sparrow_recsys_widedeep \
tensorflow/serving &
启动以下几个实时流数据处理任务
python TFRecModel/src/com/sparrowrecsys/nearline/tensorflow/KafkaMoviesBERTEmbedding.py
./bin/flink run -p 2 -c com.sparrowrecsys.nearline.flink.NewMovieHandler ~/work/recsys/SparrowRecSys/target/SparrowRecSys-1.0-SNAPSHOT-jar-with-dependencies.jar
./bin/flink run -p 2 -c com.sparrowrecsys.nearline.flink.NewRatingHandler ~/work/recsys/SparrowRecSys/target/SparrowRecSys-1.0-SNAPSHOT-jar-with-dependencies.jar
# 停止Web服务器、Tensorflow Serving
# 停止定时任务和实时流任务.
# 删除MySQL中sparrow_recsys数据库
# 删除redis中sparrow_recsys开头的数据
# 删除 hdfs:///sparrow_recsys/*
# 删除 ./kafka-movie-embeddings.csv ./tmp_model ./tmp_sampledata
# 删除Kafka 日志数据,一般位于/tmp/kafka-logs
项目数据来源于开源电影数据集MovieLens,项目自带数据集对MovieLens数据集进行了精简,仅保留1000部电影和相关评论、用户数据。全量数据集请到MovieLens官方网站进行下载,推荐使用MovieLens 20M Dataset。
SparrowRecSys技术架构遵循经典的工业级深度学习推荐系统架构,包括了离线数据处理、模型训练、近线的流处理、线上模型服务、前端推荐结果显示等多个模块。以下是SparrowRecSys的架构图:

- Word2vec (Item2vec)
- DeepWalk (Random Walk based Graph Embedding)
- Embedding MLP
- Wide&Deep
- Nerual CF
- Two Towers
- DeepFM
- DIN(Deep Interest Network)
- [FFM] Field-aware Factorization Machines for CTR Prediction (Criteo 2016)
- [GBDT+LR] Practical Lessons from Predicting Clicks on Ads at Facebook (Facebook 2014)
- [PS-PLM] Learning Piece-wise Linear Models from Large Scale Data for Ad Click Prediction (Alibaba 2017)
- [FM] Fast Context-aware Recommendations with Factorization Machines (UKON 2011)
- [DCN] Deep & Cross Network for Ad Click Predictions (Stanford 2017)
- [Deep Crossing] Deep Crossing - Web-Scale Modeling without Manually Crafted Combinatorial Features (Microsoft 2016)
- [PNN] Product-based Neural Networks for User Response Prediction (SJTU 2016)
- [DIN] Deep Interest Network for Click-Through Rate Prediction (Alibaba 2018)
- [ESMM] Entire Space Multi-Task Model - An Effective Approach for Estimating Post-Click Conversion Rate (Alibaba 2018)
- [Wide & Deep] Wide & Deep Learning for Recommender Systems (Google 2016)
- [xDeepFM] xDeepFM - Combining Explicit and Implicit Feature Interactions for Recommender Systems (USTC 2018)
- [Image CTR] Image Matters - Visually modeling user behaviors using Advanced Model Server (Alibaba 2018)
- [AFM] Attentional Factorization Machines - Learning the Weight of Feature Interactions via Attention Networks (ZJU 2017)
- [DIEN] Deep Interest Evolution Network for Click-Through Rate Prediction (Alibaba 2019)
- [DSSM] Learning Deep Structured Semantic Models for Web Search using Clickthrough Data (UIUC 2013)
- [FNN] Deep Learning over Multi-field Categorical Data (UCL 2016)
- [DeepFM] A Factorization-Machine based Neural Network for CTR Prediction (HIT-Huawei 2017)
- [NFM] Neural Factorization Machines for Sparse Predictive Analytics (NUS 2017)
