####a closer look at the recommendation engine
In the last chapter we built our first recommendation system and its recommendations seemed to match what we expected. But how did it really come up with those recommendations?
Recall that when we built our recommendation system we based it on a template:
~$ pio template get PredictionIO/template-scala-parallel-recommendation musicRecommender
Please enter author's name: Ron Zacharski
Please enter the template's Scala package name (e.g. com.mycompany): org.zacharski
Please enter author's e-mail address: ron.zacharski@gmail.com
Would you like to be informed about new bug fixes and security updates of this template? (Y/n) ...
Engine template PredictionIO/template-scala-parallel-recommendation is now ready at musicRecommender
This command downloads the recommender template, template-scala-parallel-recommendation-0.3.0, and places the files that make up this template in a directory called musicRecommender. In that directory we see a file titled engine.json, which we edited to add our appName, MusicApp:
{
"id": "default",
"description": "Default settings",
"engineFactory": "org.zacharski.RecommendationEngine",
"datasource": {
"params" : {
"appName": "MusicApp"
}
},
"algorithms": [
{
"name": "als",
"params": {
"rank": 10,
"numIterations": 20,
"lambda": 0.01,
"seed": 3
}
}
]
}
One interesting part of this file is that we see that the recommender is using the ALS algorithm (alternating least squares) and we pass that algorithm several parameters such as rank and number of iterations. Let's check out how this ALS algorithm works.
Let's say we work in a vinyl record and CD store downtown. A customer comes in and asks us for a recommendation. We ask the obvious question "What do you like?" and she responds with "I like Taylor Swift and Carrie Underwood." There are several ways we might come up with a recommendation for her. One is to reflect on regular customers to our store who bought albums from those artists and think what else they bought. Perhaps we notice that recently, people who bought Taylor Swift CDs also bought CDs by Miranda Lambert and we recommend Miranda Lambert to the new customer. This is a two step process. First, we determine previous customers who are most similar to the person standing in front of us and second, we look at what those previous customers bought and then use that information to make recommendations to our current customer.
Another way we might come up with a recommendation is as follows. We know Taylor Swift and Carrie Underwood CDs share certain features. They both, obviously, have prominent female vocals. They both feature singer-songwriters They both have country influences and no PBR&B influences (the term PBR&B, aka hipster R&B and R neg B, is a portmanteau of PBR--Pabst Blue Ribbon, the hipster beer of choice--and R&B). Then we think "Hey, Miranda Lambert CDs also have prominent female vocals and have country influences but no PBR&B influences, and we recommend Miranda Lambert to our new customer. With this recommendation method we extract a set of features from CDs this person likes and then think what other CDs share these features.
Let's check this out a bit further. We will restrict ourselves to two features (country and PBR&B influences), five artists (Taylor Swift, Miranda Lambert, Carrie Underwood, Jhené Aiko, and The Weeknd) and two customers (Jake and Ann). As owners of the vinyl record store we have gone through and meticulously rated artists on these features.
| Artist | Country | PBR&B |
|---|---|---|
| Taylor Swift | 0.90 | 0.05 |
| Miranda Lambert | 0.98 | 0.00 |
| Carrie Underwood | 0.95 | 0.03 |
| Jhené Aiko | 0.01 | 0.99 |
| The Weeknd | 0.03 | 0.98 |
So Taylor Swift exudes a lot of country influences (0.90) but little PBR&B (0.05).
When customers come into our store we ask them on a scale of 0 to 5 how well they like country and how well they like PBR&B:
| Customer | Country | PBR&B |
|---|---|---|
| Jake | 5 | 1 |
| Ann | 0 | 5 |
Suppose Jake comes into the store, has never heard of Miranda Lambert, and we are trying to predict how he might rate her. Jake rated country music a 5 and Miranda is 0.98 country so we multiply those numbers together.
We do the same for the PBR&B numbers and add them together to get our estimate of Jake's rating of Miranda Lambert.
####Question What is Ann's ratings of Taylor Swift and Jhené Aiko?
####a slight change Let's change this scenario a bit. Suppose we still rate our artists as above, but this time instead of asking customers about how well they like country and PBR&B we ask them how well they like various artists and we get something like the following (a question mark indicates that that customer has not rated that artist):
| Customer | Taylor Swift | Miranda Lambert | Carrie Underwood | Jhené Aiko | The Weeknd |
|---|---|---|---|---|---|
| Jake | 5 | ? | 5 | 2 | 2 |
| Clara | 2 | ? | ? | 4 | 5 |
| Kelsey | 5 | 5 | 5 | 2 | ? |
| Ann | 2 | 3 | ? | 5 | 5 |
| Jessica | 2 | 1 | ? | 5 | ? |
Based on the table of the country and PBR&B influences of various artists and our customer artist rating table we would like to predict how well our customers like country and PBR&B:
| Customer | Country | PBR&B |
|---|---|---|
| Jake | ? | ? |
| Clara | ? | ? |
| Kelsey | ? | ? |
| Ann | ? | ? |
| Jessica | ? | ? |
####Take a moment Can you fill in the table above with just rough guesses?
My feeling is that you can guess these values very accurately. For example, you probably gave Jake a 5 for country and a 2 for PBR&B and Jessica a 1.5 for country and a 5 for PBR&B.
So now we have estimates for how well our customers like country and PBR&B. Jessica comes into our store. She hasn't rated Carie Underwood or The Weeknd and we are trying to decide which of those two we should recommend to her. Here are the tables we will need:
| Customer | Country | PBR&B |
|---|---|---|
| Jake | 5 | 2 |
| Clara | 2 | 4.5? |
| Kelsey | 5 | 2 |
| Ann | 2.5 | 5 |
| Jessica | 1.5 | 5 |
| Artist | Country | PBR&B |
|---|---|---|
| Taylor Swift | 0.90 | 0.05 |
| Miranda Lambert | 0.98 | 0.00 |
| Carrie Underwood | 0.95 | 0.03 |
| Jhené Aiko | 0.01 | 0.99 |
| The Weeknd | 0.03 | 0.98 |
So, doing the math:
So we would recommend The Weeknd to Jessica.
I am going to call the table of ratings R:
| Customer | Taylor Swift | Miranda Lambert | Carrie Underwood | Jhené Aiko | The Weeknd |
|---|---|---|---|---|---|
| Jake | 5 | ? | 5 | 2 | 2 |
| Clara | 2 | ? | ? | 4 | 5 |
| Kelsey | 5 | 5 | 5 | 2 | ? |
| Ann | 2 | 3 | ? | 5 | 5 |
| Jessica | 2 | 1 | ? | 5 | ? |
the table of how well users like various features P:
| Customer | Country | PBR&B |
|---|---|---|
| Jake | 5 | 2 |
| Clara | 2 | 4.5 |
| Kelsey | 5 | 2 |
| Ann | 2.5 | 5 |
| Jessica | 1.5 | 5 |
and the table matching artists to various features Q:
| Artist | Country | PBR&B |
|---|---|---|
| Taylor Swift | 0.90 | 0.05 |
| Miranda Lambert | 0.98 | 0.00 |
| Carrie Underwood | 0.95 | 0.03 |
| Jhené Aiko | 0.01 | 0.99 |
| The Weeknd | 0.03 | 0.98 |
We've seen that if we have P and Q we can figure out R. And we have just seen that if we have Q and R we can figure out P. Can we figure out Q, the table indicating the country and PBR&B influences of the artists from R the table of customers rating artists and P the table of how well customers like country and PBR&B?
####Take a moment and decide Take a look at the above tables and see if that is true.
It is the case that if we know how customers rated artists (R) and how well customers like country and PBR&B music (P) we can figure out the country and PBR&B influences of the artists (Q).
You may be wondering what this has to do with the topic we started this chapter with, namely the ALS algorithm, since in the previous chapter when we built our recommender we didn't specify any features.
For matrix factorization we don't tell the algorithm a preset list of features (female vocal, country, PBR&B, etc.). Instead we give the algorithm a chart (matrix) like the following:
| Customer | Taylor Swift | Miranda Lambert | Carrie Underwood | Jhené Aiko | The Weeknd |
|---|---|---|---|---|---|
| Jake | 5 | ? | 5 | 2 | 2 |
| Clara | 2 | ? | ? | 4 | 5 |
| Kelsey | 5 | 5 | 5 | 2 | ? |
| Ann | 2 | 3 | ? | 5 | 5 |
| Jessica | 2 | 1 | ? | 5 | ? |
and ask the algorithm to extract a set of features from this data. To anthropomorphize this yet even more, it is like asking the algorithm, Okay algorithm, given 2 features (or some number of features) call them feature 1 and feature 2, can you come up with the P and Q matrices? These extracted features are not going to be something like 'female vocals' or 'country influence'. In fact, we don't care what these features represent. Again, we are going to ask the algorithm to extract features that are hidden in that table above. In order to make this sound a bit fancier than 'hidden features' data scientists use the Latin word for 'lie hidden', lateo, and call these latent features.
The inputs to the matrix factorization algorithm are the data in the chart shown above and the number of latent features to use (for example, 2). Our eventual goal is to calculate
| Customer | Taylor Swift | Miranda Lambert | Carrie Underwood | Jhené Aiko | The Weeknd |
|---|---|---|---|---|---|
| Jake | 4.92 | 4.78 | 4.94 | 1.79 | 2.17 |
| Clara | 2.16 | 2.97 | 1.64 | 4.30 | 4.62 |
| Kelsey | 4.98 | 4.91 | 4.96 | 2.12 | 2.52 |
| Ann | 2.04 | 2.99 | 1.45 | 4.79 | 5.13 |
| Jessica | 1.79 | 2.80 | 1.16 | 4.89 | 5.22 |
The bolded numbers are those that were blank in the original chart but predicted by our algorithm. The unbolded numbers are predicted values that have an actual value in the original table. From our original data we see that Jake gave a rating of 5 to both Taylor Swift and Carrie Underwood and we see that the algorithm's estimates for those are 4.92 and 4.94---pretty good! To get these predicted values we use latent features as an intermediary. Let's say we have two features: feature 1 and feature 2. And, to keep things simple, let's just look at how to get Jake's rating of Taylor Swift. Jake's rating is based solely on these two features and for Jake, these features are not equal in importance but are weighed differently. For example, Jake might weigh these features:
| - | Feature 1 | Feature 2 |
|---|---|---|
| Jake | 0.717 | 2.309 |
So feature 2 is much more influential in Jake's rating than feature 1 is.
We are going to have these feature weights for all our users and, again, by convention we call the resulting matrix, P:
| - | Feature 1 | Feature 2 |
|---|---|---|
| Jake | 0.717 | 2.309 |
| Clara | 1.875 | 0.437 |
| Kelsey | 0.861 | 2.288 |
| Ann | 2.10 | 0.295 |
| Jessica | 2.14 | 0.145 |
The word 'matrix' just means a table of numbers, just like we have above.
The other thing we need is how these features are represented in Taylor Swift-- how much "Feature 1-iness IS Taylor Swift? So we need a table of weights for the artists and again by convention, we call this matrix Q:
| Artist | Feature 1 | Feature 2 |
|---|---|---|
| Taylor Swift | 0.705 | 1.913 |
| Miranda Lambert | 1.189 | 1.700 |
| Carrie Underwood | 0.407 | 2.015 |
| Jhené Aiko | 2.276 | 0.072 |
| The Weeknd | 2.419 | 0.191 |
Now back to our task of predicting how Jake will rate Taylor Swift ...
If we want to know how Jake will rate Taylor Swift we take Jake's weights for these features
| - | Feature x | Feature y |
|---|---|---|
| Jake | 0.717 | 2.309 |
and Taylor Swift's:
| Artist | Feature x | Feature y |
|---|---|---|
| Taylor Swift | 0.705 | 1.913 |
Multiply together Jake's and Taylor Swift's values for each feature:
| - | Feature x | Feature y |
|---|---|---|
| Jake | 0.717 | 2.309 |
| Taylor Swift | 0.705 | 1.913 |
| Product | 0.505 | 4.417 |
Then add those products up to get the predicted rating, r
This operation is called the dot product. A list of numbers, for example, Jake's weights for the features: [0.717, 2.309] is called a vector. A dot product is performed on two vectors of equal length and produces a single value. It is defined as follows:
Let A and B be two vectors of equal length. Then
So, for example, if
and
then the dot product of A and B is
So above we determined Jake's rating of Taylor Swift by getting the dot product of Jake, J and Taylor Swift, S:
And, since I am giving things fancy names in this section, I am going to call the Table from users to weights of the different features, Matrix P and the table from artists to weight Matrix Q. Once we have P and Q it is easy to make predictions.
Great. We now have an estimate of how Jake will rate Taylor Swift. Now we want to do this for all user, artist pairs to get
| Artist | Feature 1 | Feature 2 |
|---|---|---|
| Taylor Swift | 0.705 | 1.913 |
| Miranda Lambert | 1.189 | 1.700 |
| Carrie Underwood | 0.407 | 2.015 |
| Jhené Aiko | 2.276 | 0.072 |
| The Weeknd | 2.419 | 0.191 |
and flipped:
| feature: | Taylor Swift | Miranda Lambert | Carrie Underwood | Jhené Aiko | The Weeknd |
|---|---|---|---|---|---|
| 1 | 0.705 | 1.189 | 0.407 | 2.276 | 2.419 |
| 2 | 1.913 | 1.700 | 2.015 | 0.072 | 0.191 |
This flipping of the table (or matrix) is called transposing the matrix. If the original matrix is called Q the transpose of the matrix is indicated by
or in our case of customers and artists:
$$\hat{R} =\begin{bmatrix} 0.717 & 2.309 \ 1.875 & 0.437 \ 0.861 & 2.288 \ 2.100 & 0.295 \ 2.140 & 0.145 \end{bmatrix} \times \begin{bmatrix} 0.705 & 1.189 & 0.407 & 2.276 & 2.419 \ 1.913 & 1.700 & 2.015 & 0.072 & 0.191 \end{bmatrix} $$ and when we do this multiplication we will get the filled in version of our estimated ratings table:
| Customer | Taylor Swift | Miranda Lambert | Carrie Underwood | Nicki Minaj | Ariana Grande |
|---|---|---|---|---|---|
| Jake | - | - | - | - | - |
| Clara | - | - | - | - | - |
| Kelsey | - | - | - | - | - |
| Angelica | - | - | - | - | - |
| Jordyn | - | - | - | - | - |
Here is how we multiply matrices P and
| Customer | Taylor Swift | Miranda Lambert | Carrie Underwood | Nicki Minaj | Ariana Grande |
|---|---|---|---|---|---|
| Jake | 4.92 | - | - | - | - |
| Clara | - | - | - | - | - |
| Kelsey | - | - | - | - | - |
| Angelica | - | - | - | - | - |
| Jordyn | - | - | - | - | - |
To get the estimated value for row one column two (Jake's rating of Miranda Lambert) we take the dot product of the first row of P and the second column of
| Customer | Taylor Swift | Miranda Lambert | Carrie Underwood | Nicki Minaj | Ariana Grande |
|---|---|---|---|---|---|
| Jake | 4.78 | 4.77 | - | - | - |
| Clara | - | - | - | - | - |
| Kelsey | - | - | - | - | - |
| Angelica | - | - | - | - | - |
| Jordyn | - | - | - | - | - |
and so on.
Once we have P and Q it is easy to generate estimated ratings. But how do we get these matrices?
###How do we get Matrices P and Q? There are several common ways to derive these matrices. One method is called stochastic gradient descent. The basic idea is this. We are going to randomly select values for P and Q. For example, we would randomly select initial values for Jake:
Jake = [0.03, 0.88]
and randomly select initial values for Taylor Swift:
Taylor = [ 0.73, 0.49]
So with those initial ratings we get a prediction of
which is a particularly bad guess considering Jake really gave Taylor Swift a '5'. So we adjust those values. We underestimated Jake's rating of Taylor Swift so we boost maybe something like:
Jake = [0.12, 0.83]
Taylor = [ 0.80, 0.47]
and now we get:
That is better than before but still we underestimated so we adust and try again. And adjust and try again. We repeat this process thousands of times until our predicted values get close to the actual values. The general algorithm is
- generate random values for the P and Q matrices
- using these P and Q matrices estimate the ratings (for ex., Jake's rating of Taylor Swift).
- compute the error between the actual rating and our estimated rating (for example, Jake actually gave Taylor Swift a '5' but using P and Q we estimated the rating to be 0.45. Our error was 4.55.
- using this error adjust P and Q to improve our estimate
- If our total error rate is small enough or we have gone through a bunch of iterations (for ex., 4000) terminate the algorithm. Else go to step 2.
For how simple this algorithm is, it works surprisingly well.
The other method of estimated P and Q is called alternating least squares or ALS and this is the one our PredictionIO recommendation engine uses.
##Alternating Least Squares
Let's remind ourselves of the task. We are given matrix R and we would like to estimate P and Q. Recall that earlier in the chapter we determined that if we had 2 of the matrices we could determine the third. The example used in that section was that if we had the matrix of users rating different artists
| Customer | Taylor Swift | Miranda Lambert | Carrie Underwood | Jhené Aiko | The Weeknd |
|---|---|---|---|---|---|
| Jake | 5 | ? | 5 | 2 | 2 |
| Clara | 2 | ? | ? | 4 | 5 |
| Kelsey | 5 | 5 | 5 | 2 | ? |
| Ann | 2 | 3 | ? | 5 | 5 |
| Jessica | 2 | 1 | ? | 5 | ? |
and the matrix of how much country and PBR&B influences those artists exhibit
| Artist | Country | PBR&B |
|---|---|---|
| Taylor Swift | 0.90 | 0.05 |
| Miranda Lambert | 0.98 | 0.00 |
| Carrie Underwood | 0.95 | 0.03 |
| Jhené Aiko | 0.01 | 0.99 |
| The Weeknd | 0.03 | 0.98 |
we could determine how much each customer liked country and PBR&B. We also saw that if we had the matrix of users rating different artists and a matrix representing how much each customer liked country and PBR&B, then we could determine the country and PBR&B influences for each artist. So as long as we had two of the matrices we could determine the third. Unfortunately we only have one of the matrices, not two. It seems we are stuck. But there is a way to bootstrap our way of of this problem. We simply randomly guess the values of one of the other two matrices. Suppose we guess at the values of Q. Now we have two matrices, R and Q and we can determine P. Now we have all three matrices but for Q we just took a random guess, so it is a bit dodgy. But that is okay because now we have P, and with P and R we can determine a better guess for Q. So we do that.
Now P is a bit dodgy as well since it was based on our original wild guess for Q, but now that we have a better value for Q we can use that to recompute P. So the algorithm is this.
- compute random values for Q
- use that and R to compute P
- use that and R to compute Q
- while we haven't reached the max number of iterations go to 2.
The word alternating in alternating least squares is based on us alternating our computations of P and Q. Now you may be wondering what the least squares bit means. We are given the matrix R and initially we compute random values for Q. Then, using R and Q we are going to compute the best possible P. But what do we mean by best possible? Let's say R the ratings customers gave artists is:
|Customer | Taylor Swift | Miranda Lambert | Jhené Aiko | The Weeknd | |:-----------|:------:|:------:|:---------:|:------:|:--------:| |Jake|5|?|2|2| |Ann|2|?|5|?|
and using the Q and R matrices we get
|Customer | Taylor Swift | Miranda Lambert | Jhené Aiko | The Weeknd | |:-----------|:------:|:------:|:---------:|:------:|:--------:| |Jake|2.5|?|1.0|3.5| |Ann|5.0|?|3.5|?|
So how good of a guess is it? Well, let's see. Jake really gave Taylor Swift a 5 and our algorithm predicted a 2.5. That is 2.5 different. And Jake gave Jhené Aiko a 2 and our algorithm predicted a 1. That is 1 different. It sounds like a good idea might be to add up all these differences. The one twist we do is square the difference. So for each artist we subtract what our algorithm predicted from Jake's actual rating and square that. Then we sum those up:
The smaller this number is, the closer our estimate is to Jake's real ratings. So to start we have the real ratings
Recall that when we looked at the engine.json file we saw several parameters being sent to the ALS algorithm:
{
"id": "default",
"description": "Default settings",
"engineFactory": "org.zacharski.RecommendationEngine",
"datasource": {
"params" : {
"appName": "MusicApp"
}
},
"algorithms": [
{
"name": "als",
"params": {
"rank": 10,
"numIterations": 20,
"lambda": 0.01,
"seed": 3
}
}
]
}
#####rank The rank parameter specifies how many latent features to have. As you can see when we built our recommender in chapter one we used 10 latent features. The more latent features you have the better the accuracy of the recommender. For example, one seminal paper in this area Large-scale Parallel Collaborative Filtering for the Netflix Prize by Zhou et al. used 1000 features. However, the more latent features you use the more memory your recommender will take. In most cases a value between 10 and 500 is reasonable. #####numIterations As you can guess, this parameter refers to the number of iterations (repeating steps 2 and 3 above) the algorithm executes. As you might think, the more iterations, the better the results, but surprisingly, the algorithm needs very few iterations to produce good values for the P and Q matrices. Generally, a value between 10 and 20 will be sufficient. #####lambda In many machine learning algorithms, we train the algorithm on some set of data (called the training set) and once it is trained we use it on new data. There is a danger that the algorithm is so highly tuned to the training data that it performs poorly on the new data. This is called overfitting. In the ALS algorithm, one potential cause of overfitting is having some latent features having extreme values. This lambda is roughly how much we are weighing these extreme values. The greater the lambda the more the algorithm dislikes solutions that contain extreme values.
#####seed Recall the the first step of the algorithm is to generate random values for one of the matrices. Generally, each time we run the algorithm we get different initial values and because these initial values are different, the final results can vary as well. Often when we are testing, we want the algorithm to be deterministic and produce the exact same results each time we run it. To do so, we pass in a value for this seed parameter. In the example above, we passed the value 3 as a seed.
Before moving on, let's try building a recommender with a moderate sized dataset. The MovieLens Latest dataset contains slightly over 20 million ratings of about 27,000 movies by 229,000 users. You already know all the steps to build a recommender. Can you do so with this data? The ratings.csv file contains lines like
which means that person 1 gave movie 122 a rating of a 5 at Unix timestamp 838985046 (Friday, August 2, 1996 at 11:24GMT). The file movies.csv matches those movie IDs with actual movie titles (movie 122 in this case is Boomerang). We could import our data as before using a similar python script but there is a faster way to load large datasets. We will first use the following Python script to convert the movie data into a json file:
import predictionio
exporter = predictionio.FileExporter(file_name="my_events.json")
def readMovieInfo(filename):
movies = {}
with open(filename) as f:
for line in f:
entry = line.split(",")
movies[entry[0]] = entry[1]
return movies
def import_events():
movieInfo = readMovieInfo('ml-latest/movies.csv')
count = -1
with open('ml-latest/ratings.csv') as f:
for line in f:
if count >= 0:
data = line.rstrip('\r\n').split(',')
#print data[2]
exporter.create_event(
event="rate",
entity_type="user",
entity_id=data[0],
target_entity_type="item",
target_entity_id=movieInfo[data[1]],
properties= { "rating" : float(data[2]) }
)
count += 1
if count % 1000 == 0:
print count
print "%s events are imported." % count
exporter.close()
import_events()
The contents of the output json file look like:
{"eventTime": "2015-05-11T02:08:42.917+0000", "entityType": "user", "targetEntityType": "item", "event": "rate", "entityId": "1", "targetEntityId": "Interview with the Vampire: The Vampire Chronicles (1994)", "properties": {"rating": 3.0}}
{"eventTime": "2015-05-11T02:08:42.918+0000", "entityType": "user", "targetEntityType": "item", "event": "rate", "entityId": "1", "targetEntityId": "Jaws (1975)", "properties": {"rating": 3.0}}
{"eventTime": "2015-05-11T02:08:42.918+0000", "entityType": "user", "targetEntityType": "item", "event": "rate", "entityId": "1", "targetEntityId": "Scream (1996)", "properties": {"rating": 5.0}}
{"eventTime": "2015-05-11T02:08:42.918+0000", "entityType": "user", "targetEntityType": "item", "event": "rate", "entityId": "1", "targetEntityId": "Scream 2 (1997)", "properties": {"rating": 5.0}}
Now we can load the data into PredictionIO with the command
pio import --appid 1 --input my_events.json
(or whatever the appid is in your case). On my, not particularly fast, machine it took slightly under 15 minutes.
As before we create a recommendation Engine:
- execute
pio template get PredictionIO/template-scala-parallel-recommendation movieRecommender - edit the
engine.jsonfile to add the app name - build the system by executing
pio build
Before we train our system we will do one additional thing.
####Increasing the memory allocated to training By default the amount of memory allocated for training is 512MB. Let's bump that up a bit. For me I am going to allocated 24GB, but use a number that is right for your system.
- Within the PredictionIO directory there is the directory vendors/spark-1.3.0/conf. Within that folder make a copy of the spark-defaults.conf.template and call it spark-defaults.conf.
- Edit the spark-defaults.conf file by uncommenting the
spark.driver.memoryline and changing the value from 5 to 24. - Adding the line
spark.executor.memory 24g - save the file.
The resulting file should look like
# Example:
# spark.master spark://master:7077
# spark.eventLog.enabled true
# spark.eventLog.dir hdfs://namenode:8021/directory
# spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 24g
# spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
spark.executor.memory 24g
Let's go ahead and train our system.
pio train
Depending on your machine training can take a long time. On my machine it took a bit under 15 minutes.
So we are sitting at a bar with our friends, all of whom also built recommenders for the Movie Lens data and we get to arguing about whose recommender is better. Is there an objective measure we could use to settle the argument? A bit ago we talked about using a distance measure—if Jake's real rating of Taylor Swift was a 5 and our recommender predicted a 4.5 and Mary's recommender predicted a 4.75—Mary's recommendation is better since it is closer to the real rating. But maybe Jake's rating of Taylor Swift was a fluke and our recommender would do better with some other rating—say Clara's rating of Jhené Aiko. So we will test on a set of data where we know the ratings. Plus, we will test on data that hasn't been seen by the recommender. And we will compute the average distance between our estimate and the real value. This is called root-mean-square-deviation and the formula is as follows:
n is the number of ratings in our test data.
####The effect of the number of latent features A bit ago when we were looking at the engine.json file:
{
"id": "default",
"description": "Default settings",
"engineFactory": "org.zacharski.RecommendationEngine",
"datasource": {
"params" : {
"appName": "MusicApp"
}
},
"algorithms": [
{
"name": "als",
"params": {
"rank": 10,
"numIterations": 20,
"lambda": 0.01,
"seed": 3
}
}
]
}
Here, rank refers to the number of latent features. I mentioned that the more latent features the more accurate the recommender. Let's see if this is the case with a small experiment. Before I loaded the movielens data I removed 10 ratings. Thus, when the recommender was trained, it did not see these ratings. My plan was to train a recommender that uses 2 features, 10, 50, and 100 and see which is better using RMSE. Here is the raw data of the results:
| User | Movie | Actual | 2 | 10 | 50 | 100 |
|---|---|---|---|---|---|---|
| 2 | Minority Report | 4 | 3.504 | 3.705 | 3.638 | 3,148 |
| 10 | Star Wars IV | 5 | 3.760 | 3.858 | 4.650 | 5.061 |
| 33 | KungFu Panda | 4 | 4.272 | 4.160 | 4.042 | 4.363 |
| 58 | Gravity | 2.5 | 3.597 | 3.293 | 3.265 | 3.043 |
| 89 | Interstellar | 5 | 5.121 | 4.867 | 4.876 | 5.020 |
| 101 | Harry Potter -Chamber | 4 | 4.079 | 4.280 | 3.802 | 4.145 |
| 125 | Mulholland Drive | 3 | 2.622 | 2.410 | 2.962 | 1.877 |
| 156 | Frozen | 5 | 3.281 | 3.703 | 3.618 | 4.645 |
| 211 | Edge of Tomorrow | 4 | 3.476 | 3.044 | 3.598 | 3.819 |
| 872 | Eternal Sunshine | 4 | 4.247 | 4.393 | 3.623 | 3.718 |
And here is the resulting RMSD:
| latent features | 2 | 10 | 50 | 100 |
|---|---|---|---|---|
| RMSD | 0.80755 | 0.7239 | 0.5576 | 0.5174 |
So our tiny experiment supports the previous substantial evidence that the more latent features the better the accuracy. Keep in mind that our experiment with only 10 test cases is tiny. It may offer support in a bar discussion, or may be a good late-night test to make sure we are not doing something screwy, but as we will see later, there are better ways to evaluate algorithm performance.

