In this tutorial we are going to build a simple Movie recommendation Service using Azure Machine Learning Studio. If you are not familiar with Azure Machine Learning Studio read the Getting Started with Azure Machine Learning Studio tutorial to learn a little bit about machine learning and how to use the Azure Machine Studio Service.
What is the Train Matchbox Recommender
We are going to use the Train Matchbox Recommender. The Train Matchbox Recommender module reads a dataset of user-item-rating triples and, optionally, some user and item features. It returns a trained Matchbox recommender. You can then use the trained model to generate recommendations, find related users, or find related items, by using the Score Matchbox Recommender module.
The main aim of a recommendation system is to recommend one or more items to users of the system. Examples of an item could be a movie, restaurant, book, or song. A user could be a person, group of persons, or other entity with item preferences.
There are two principal approaches to recommender systems.
- The first is the content-based approach, which makes use of features for both users and items. Users may be described by properties such as age and gender, and items may be described by properties such as author and manufacturer. Typical examples of content-based recommendation systems can be found on social matchmaking sites.
- The second approach is collaborative filtering, which uses only identifiers of the users and the items and obtains implicit information about these entities from a (sparse) matrix of ratings given by the users to the items. We can learn about a user from the items they have rated and from other users who have rated the same items.
The Matchbox recommender combines these approaches, using collaborative filtering with a content-based approach. It is therefore considered a hybrid recommender.
How this works: When a user is relatively new to the system, predictions are improved by making use of the feature information about the user, thus addressing the well-known “cold-start” problem. However, once you have collected a sufficient number of ratings from a particular user, it is possible to make fully personalized predictions for them based on their specific ratings rather than on their features alone. Hence, there is a smooth transition from content-based recommendations to recommendations based on collaborative filtering. Even if user or item features are not available, Matchbox will still work in its collaborative filtering mode.
We are going to use the IMDB sample datasets provided in Azure ML Studio for the recommendation service, so all you have to do is to open a browser and get started.
Create a new experiment
Open the Azure Machine Learning Studio and create a new blank experiment.
This is the first image you will see:
I renamed the experiment CodeStories Recommender, give any name you like to your experiment.
Then find in the left column 2 datasets, in the Datasets section, the IMDB Movie Titles and Movie Ratings and drag and drop them in the experiment area.
To see the dataset contents right click on the dataset and click Visualize.
Then you can see the dataset content.
Next add the Edit Metadata module and connect it to the first dataset as in the following image. To connect the dataset put your mouse to the dot you want to connect and click and drag to the destination dot in the Edit Metadata Module.
This module edits metadata associated with columns in a dataset. Typical metadata changes might include:
- Treating Boolean or numeric columns as categorical values
- Indicating which column contains the class label, or the values you want to categorize or predict
- Marking columns as features
- Changing date/time values to a numeric value, or vice versa
- Renaming columns
Use Edit Metadata any time you need to modify the definition of a column, typically to meet requirements for a downstream module. For example, some modules can work only with specific data types, or require flags on the columns, such as IsFeature or IsCategorical. After performing the required operation, you can reset the metadata to its original state. Here we are going to use the Edit Metadata to convert the Rating Column into an integer, so it can be used by the machine learning algorithm.
Click on the module and then click on the Launch column selector button in the pane that loads in the right. Choose Rating and close the window. Then in the pane on the right on the first dropdown choose Data type→ Integer
Add the Join Data module. The Join Data module joins two datasets. In the right pane choose MovieId from Movie Ratings and Movie ID from IMDB Movie Titles, so the results can show the Title instead of the Movie Id.
It is very important that the input data used for training contain the right type of data in the correct format:
- The first column must contain user identifiers.
- The second column must contain item identifiers.
- The third column contains the rating for the user-item pair. Rating values must be either numeric or categorical.
During training, the rating values cannot all be the same. Moreover, if numeric, the difference between the minimum and the maximum rating values must be less than 100, and ideally not greater than 20.
Add the Select Columns in Dataset Module to select the proper columns for training. Select the columns UserId, Movie Name, Rating.
Then add the Remove Duplicate Rows module to remove duplicates, as not all items must be the same, meaning a user cannot have more than one rating for the same Movie. Use the combination of UserId and Movie Name in the column selector as shown below.
Next we need to divide source data into training and testing datasets. We use the training dataset to train our recommender module and then we use the testing dataset to test and score the results. To accomplice this we are going to use the Split Data Module. This module is particularly useful when you need to separate data into training and testing sets. You can customize the way that data is divided as well. Some options support randomization of data; others are tailored for a certain data type or model type.
Add the Split Data module and use the Recommender Split option.
Then the next step is to add the Train Matchbox Recommender, to train our recommender model. You can leave the options default for now.
To test and score our recommendation engine we need to use the training dataset as well. The training dataset is users we want to suggest movies to, so we need to remove the ratings from the dataset. All we need is to add UserIds the engine will suggest movies to.
So drag and drop the Remove Duplicate Rows module using the UserId column only this time.
Then add the Partition and Sample Module. It creates multiple partitions of a dataset based on sampling.
Sampling is an important tool in machine learning because it lets you reduce the size of a dataset while maintaining the same ratio of values. This module supports several related tasks that are important in machine learning:
- Dividing your data into multiple subsections of the same size.
You might use the partitions for cross-validation, or to assign cases to random groups.
- Separating data into groups and then working with data from a specific group.
After randomly assigning cases to different groups, you might need to modify the features that are associated with only one group.
You can extract a percentage of the data, apply random sampling, or choose a column to use for balancing the dataset and perform stratified sampling on its values.
- Creating a smaller dataset for testing.
If you have a lot of data, you might want to use only the first n rows while setting up the experiment, and then switch to using the full dataset when you build your model. You can also use sampling to create s smaller dataset for use in development.
The use the Select Columns from Dataset Module to select only the UserId as we did above. Finally add the Score Matchbox Recommender using the following options
- Item Recommendation
- From All Items
View and Evaluate Results
For the bottom ribbon choose SAVE to save your experiment and RUN to run it. If all modules have a tick next to them that means that this step has executed successfully. Otherwise Azure ML Studio will show you the error and a link to its explanation and troubleshooting steps.
After the experiment has run successfully right click on the Score Matchbox recommender and then Score Dataset > Visualize to review the results.
Now we can see the results on the Testing Dataset. What has our model suggested to those users?
We can see that our model will most likely suggest The Shawshank Redemption as a first choice and then the Godfather. If we are not satisfied can add more to tailor the recommender to suggest more relevant content to our users, for example we can add another dataset with User demographics as input so we can personalize the user’s suggestion more. You can retrain your model and customize it as much as you want.
Once you are happy with your model the next step is to publish it as a web service. Click the following link to go to the third part of this series