Skip to content

iyupeng/SparrowRecSys

 
 

Repository files navigation

SparrowRecSys

SparrowRecSys是一个电影推荐系统,名字SparrowRecSys(麻雀推荐系统),取自“麻雀虽小,五脏俱全”之意。项目是一个基于maven的混合语言项目,同时包含了TensorFlow,Spark,Jetty Server等推荐系统的不同模块。希望你能够利用SparrowRecSys进行推荐系统的学习,并有机会一起完善它。

注意: 本项目中的推荐算法/模型,仅为展示推荐系统所用,不保证在实际应用中的准确度。

运行方式 1: 于 Docker 中运行

Docker运行方式的整体架构

alt text

启动容器

######## create network ########
docker network create --driver bridge demo-recsys-net

######## start data container ########
docker run -dti --network demo-recsys-net \
    --name demo-recsys-data \
    yuzhiyu3/demo-recsys-data:v1.1
# wait for HDFS startup
sleep 120

######## start tensorflow container ########
docker run -dti --network demo-recsys-net \
    --name demo-recsys-tensorflow \
    yuzhiyu3/demo-recsys-tensorflow:v1.1

######## start spark & flink container ########
docker run -dti --network demo-recsys-net \
    --name demo-recsys-spark-flink \
    -p 18088:8088 -p 18081:8081 -p 18042:8042 \
    yuzhiyu3/demo-recsys-spark-flink:v1.1
# wait for Spark job (EmbeedingLSH)
sleep 180

######## start tensorflow_serving container ########
docker run -dti --network demo-recsys-net \
    --name demo-recsys-tensorflow-serving \
    -e MODEL_NAME=sparrow_recsys_widedeep \
    -p 18501:8501 \
    yuzhiyu3/demo-recsys-tensorflow-serving:v1.1

######## start web server container ########
docker run -dti --network demo-recsys-net \
    --name demo-recsys-web \
    -p 18010:8010 \
    yuzhiyu3/demo-recsys-web:v1.1

检查运行状态

在浏览器中查看demo网站:
http://ip_of_your_host:18010/

在浏览器中查看spark任务运行状态:
http://ip_of_your_host:18088/

在浏览器中查看flink任务运行状态:
http://ip_of_your_host:18081/

测试tensorflow能正常响应inference请求:
curl -X POST \
  http://ip_of_your_host:18501/v1/models/sparrow_recsys_widedeep:predict \
  -H 'cache-control: no-cache' \
  -H 'content-type: application/json' \
  -d '{
    "instances":
    [
        {
            "movieGenre2": "",
            "userAvgRating": 4,
            "movieGenre1": "Drama",
            "movieRatingStddev": 0.89,
            "userRatingStddev": 1.1,
            "userGenre4": "War",
            "movieId": 501,
            "userGenre5": "Drama",
            "userGenre2": "Adventure",
            "userId": 55,
            "userGenre3": "Romance",
            "userGenre1": "Action",
            "movieAvgRating": 3.6,
            "userRatedMovie1": 858,
            "movieRatingCount": 5,
            "userRatingCount": 6,
            "releaseYear": 1993,
            "movieGenre3": ""
        }
    ]
}'

新增电影:
curl --location --request POST 'http://ip_of_your_host:18010/createmovie' \
    --header 'Content-Type: application/x-www-form-urlencoded' \
    --data-urlencode 'title=Test Movile (2022)' \
    --data-urlencode 'genres=1,5'

新增用户评分:
curl --location --request POST 'http://ip_of_your_host:18010/createrating' \
    --header 'Content-Type: application/x-www-form-urlencoded' \
    --data-urlencode 'userId=888' \
    --data-urlencode 'movieId=1001' \
    --data-urlencode 'rating=4.8'

运行方式 2: 于 Linux Host 中安装运行

环境要求

  • Java 8
  • Scala 2.11
  • Python 3.6+
  • TensorFlow 2.0+
  • Mysql
  • Redis
  • Kafka
  • Hadoop 2.7+
  • Spark 2.4+
  • Flink 1.12+
  • Docker
  • Python packages: tensorflow, tensorflow_hub, tensorflow_text, redis, kafka-python

启动步骤

首先修改代码中含有 demo-recsys-data, demo-recsys-tensorflow-serving 相关的URL为实际配置

编译

mvn clean package

导入初始数据


# 用 sql/db.sql 创建MySQL数据库

# 将movies和ratings数据导入MySQL (数据位于src/main/resources/webroot/sampledata/)

# 将MySQL数据导入HDFS sh ./bin/mysql_to_hdfs.sh

训练电影Embedding

python TFRecModel/src/com/sparrowrecsys/offline/tensorflow/HDFSMoviesBERTEmbedding.py

启动Parameter Servers 和 Workers

export TF_CONFIG='{"cluster":{"worker":["localhost:12345","localhost:12346"],"ps":["localhost:23456","localhost:23457"],"chief":["localhost:34567"]},"task":{"type":"worker","index":0}}'

nohup python TFRecModel/src/com/sparrowrecsys/offline/tensorflow/TFServer.py &



export TF_CONFIG='{"cluster":{"worker":["localhost:12345","localhost:12346"],"ps":["localhost:23456","localhost:23457"],"chief":["localhost:34567"]},"task":{"type":"worker","index":1}}'

nohup python TFRecModel/src/com/sparrowrecsys/offline/tensorflow/TFServer.py &



export TF_CONFIG='{"cluster":{"worker":["localhost:12345","localhost:12346"],"ps":["localhost:23456","localhost:23457"],"chief":["localhost:34567"]},"task":{"type":"ps","index":0}}'

nohup python TFRecModel/src/com/sparrowrecsys/offline/tensorflow/TFServer.py &



export TF_CONFIG='{"cluster":{"worker":["localhost:12345","localhost:12346"],"ps":["localhost:23456","localhost:23457"],"chief":["localhost:34567"]},"task":{"type":"ps","index":1}}'

nohup python TFRecModel/src/com/sparrowrecsys/offline/tensorflow/TFServer.py &

按一定的启动频率设置以下几个定时任务:

# movie embedding
./bin/spark-submit --name EmbeddingLSH --master yarn --deploy-mode cluster --class com.sparrowrecsys.offline.spark.embedding.EmbeddingLSH ~/work/recsys/SparrowRecSys/target/SparrowRecSys-1.0-SNAPSHOT-jar-with-dependencies.jar

# feature engineering
./bin/spark-submit --name FeatureEngineering --master yarn --deploy-mode cluster --class com.sparrowrecsys.offline.spark.featureeng.FeatureEngForRecModel ~/work/recsys/SparrowRecSys/target/SparrowRecSys-1.0-SNAPSHOT-jar-with-dependencies.jar

# training
export TF_CONFIG='{"cluster":{"worker":["localhost:12345","localhost:12346"],"ps":["localhost:23456","localhost:23457"],"chief":["localhost:34567"]},"task":{"type":"chief","index":0}}'; python TFRecModel/src/com/sparrowrecsys/offline/tensorflow/WideNDeep.py

启动Web服务器

java -jar target/SparrowRecSys-1.0-SNAPSHOT-jar-with-dependencies.jar

启动Tensorflow Serving

docker run -t --rm -p 8501:8501 \

  -v "~/work/recsys/SparrowRecSys/tmp_model/widendeep:/models/sparrow_recsys_widedeep" \

  -e MODEL_NAME=sparrow_recsys_widedeep \

  tensorflow/serving &

启动以下几个实时流数据处理任务

python TFRecModel/src/com/sparrowrecsys/nearline/tensorflow/KafkaMoviesBERTEmbedding.py

./bin/flink run -p 2 -c com.sparrowrecsys.nearline.flink.NewMovieHandler  ~/work/recsys/SparrowRecSys/target/SparrowRecSys-1.0-SNAPSHOT-jar-with-dependencies.jar

./bin/flink run -p 2 -c com.sparrowrecsys.nearline.flink.NewRatingHandler  ~/work/recsys/SparrowRecSys/target/SparrowRecSys-1.0-SNAPSHOT-jar-with-dependencies.jar

清理步骤

# 停止Web服务器、Tensorflow Serving

# 停止定时任务和实时流任务.

# 删除MySQL中sparrow_recsys数据库

# 删除redis中sparrow_recsys开头的数据

# 删除 hdfs:///sparrow_recsys/*

# 删除 ./kafka-movie-embeddings.csv ./tmp_model ./tmp_sampledata

# 删除Kafka 日志数据,一般位于/tmp/kafka-logs

项目数据

项目数据来源于开源电影数据集MovieLens,项目自带数据集对MovieLens数据集进行了精简,仅保留1000部电影和相关评论、用户数据。全量数据集请到MovieLens官方网站进行下载,推荐使用MovieLens 20M Dataset。

SparrowRecSys技术架构

SparrowRecSys技术架构遵循经典的工业级深度学习推荐系统架构,包括了离线数据处理、模型训练、近线的流处理、线上模型服务、前端推荐结果显示等多个模块。以下是SparrowRecSys的架构图: alt text

SparrowRecSys实现的深度学习模型

  • Word2vec (Item2vec)
  • DeepWalk (Random Walk based Graph Embedding)
  • Embedding MLP
  • Wide&Deep
  • Nerual CF
  • Two Towers
  • DeepFM
  • DIN(Deep Interest Network)

相关论文

其他相关资源

About

A Deep Learning Recommender System

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 32.4%
  • Java 32.1%
  • Scala 21.1%
  • HTML 8.5%
  • JavaScript 5.6%
  • Shell 0.3%