Spatial similarity as a dimension in Recommender Systems

Location can often play a significant role in a consumer’s lifestyle and her purchasing decisions; so, it forms a vital input to how better we can recommend products or services to her. Perhaps, not so much when it comes to recommending movies or mass consumer products. But, for those companies that sell home furnishings, custom-design goods, design merchandise for homes, goods for outdoors and related activities, location of the consumer matters. For instance, homes in New York, San Francisco Bay Area, Seattle, and Texas differ in their style and space; these factors and also the weather certainly influence the moods and tastes for furnishing the homes. Even in a country like India with diverse cultures and lifestyles, the everyday dresses and jewelry that women wear vastly differ from state to state and location to location. At the eCommerce company that I have been part of, our analyses showed that a geographical location determined not only the number of users, number of orders, and amount revenue, but also the type of merchandize bought (color, style, size, etc.). In this context, collaborative filtering based recommending of products or services that are also considered (viewed/liked/bought) by other users in a nearby location has an upside.

So, how can we build recommender systems like “people similar to you and in your vicinity also viewed these” or “people similar to you and in your vicinity also liked these“, that take spatial similarity (location proximity) between two users into account?

The location of a user can be obtained from a GeoIP technologies, where the user’s IP address is used to obtain the location with a precision that can depend on the SLA with a service provider (external link). That is, the user’s precise latitude-longitude (lat,long) coordinates can be used to build a greater accuracy into the recommender system, or can choose to use the lat-long of the center of user’s city/zipcode which is an approximation of the user’s actual location.

To keep the discussion on point, we assume that you already have the user-item preference matrix built for existing recommender systems (for starters, the user-item matrix can be written with all observation data with each observation having a tuple <user_id, item_id, preference_score>; the data file format can depend on what tool you use to build recommender systems: for example, Apache Mahout consumes a csv file). The matrix is of size n x m, with n number of users and m number of items. The preference score can be built either from user’s implicit actions (view, like, add-to-cart, etc.) or explicit feedback (ratings, reviews, etc.), where most often the former matrix is less sparse than the latter.

Given the user-item matrix and each user’s location (lat,long), here we discuss two possible approaches to build the recommendation model with spatial similarity. The approaches are discussed here using a matrix factorization method; Alternate Least Squares (ALS) method is a preferred choice for matrix factorization (external link) among the latent factor model based collaborative filtering techniques, especially for the less sparse implicit data matrix.

1. Using the ALS matrix factorization method, the n x m user-item matrix U can first be factorized into two vectors, a n x k user vector P and a m x k item vector Q, where k is the number of dimensions, such that U = P * Transpose(Q). The user vector P has n rows, with each row defining a user with k attributes/dimensions. In the conventional method, for a given user the most similar user among remaining n-1 users is the one with nearest values of the k attributes (for simplicity, it can be the Euclidean distance between the two sets of k values). Now, we can factor the k-value distance with the geographical distance between the two users (again, that can be the Euclidean distance using the lat-long coordinates of the two users). The larger the geographical distance between the two users the more the k-value distance is inflated pushing farther in similarity. A smaller geographical distance between the two users will have the opposite effect. Thus, the weighted value ingests the spatial similarity between two users.

The above method can be challenging for one main reason. The same scalability reason that user-user collaborative filtering models are less preferred compared to the item-item collaborative filtering techniques (in addition to the fact that user properties are less static than item properties). That is, for most organizations the number of users will be in the order of millions while the number of items will be in the orders of tens or hundreds of thousands, and so user-user computations are less scalable than the item-item computations.

2. To overcome the above computational scalability problem, another approach is to create clusters of users based on the lat-long data. Once the user base is divided into C clusters (using k-means or other preferred techniques), one can create a user-item matrix separately for each cluster. Then for each matrix, perform the matrix factorization to derive the user vector and the item vector. The user vector can now be used to find a most similar user for any given user. This approach alleviates the scalability problem, as the user-user computations are performed on the matrices of reduced size. Of course, the size of the matrices can depend on the number of clusters chosen and the actual geographical distribution of the user base.

The former approach first derives the k attributes and then applies geographical distance as a weighted factor. On the other hand, the latter approach first applies the spatial proximity by clustering all the users that are geographically close to each other, and then derives the k attributes for each group separately.

As in any problem in data science, the building of a good model for recommender system with spatial similarity will depend on a lot of experimentation, rigorous off-line testing, tuning the model, conducting suitable A/B testing, and then repeating for continuous improvement. The above approaches are some of possible methodologies only to build the model; the final model will depend on the actual data and also the experimentation that involves the tuning of various parameters including the number of dimensions k in both approaches, and also number of clusters C in the second approach. One has to define the metrics such as precision, recall, RMSE, to measure the performance of the model and have the patience to keep experimenting.

Leave a comment