Not to be confused with Real-Time Analytics. It is different.

As a product manager in the domain of predictive analytics, I own the responsibility to build predictive analytics capabilities for consumer facing and/or enterprise platforms; the business applications vary among item recommendations for consumers, prediction of event outcomes based on classification models, demand forecasting for supply optimization, and so on. We usually see the applications where the predictive model built using machine learning technique(s) is leveraged to score the new set of data, and that new set of data is most often fed to the model on-demand as a batch.

However, the more exciting aspect of my recent work has been in the realm of real-time predictive analytics, where each single observation (raw data point) has to be used to compute the predicted outcome; note that this is a continuous process as the stream of new observations continuously arrive and the business decisions based on the predicted outcomes have to be made in real-time. A classic use case for such a scenario is the credit card fraud detection: when a credit card swipe occurs, all the data relevant to the nature of the transaction is fed to a pre-built predictive model in order to classify if the transaction is fraudulent, and if so deny it; all this has to happen in a split second at scale (millions of transactions each second) in real-time. Another exciting use case is the preventive maintenance in Internet of Things (IoT), where continuous streaming data from thousands/millions of smart devices have to be leveraged to predict any possible failure in advance to prevent/reduce downtime.

Let me address some of the common questions that I often receive in the context of real-time predictive analytics.

What exactly is real-time predictive analytics – does that mean we can build the predictive model in real-time? A data scientist requires an aggregated mass of data which forms the historical basis over which the predictive model can be built. The model building exercise is a deep subject by itself and we can have a separate discussion about that; however, the main point to note is that model building for better predictive performance involves rigorous experimentation, requires sufficient historical data, and is a time consuming process. So, a predictive model cannot be built in “real-time” in its true sense.

Can the predictive model be updated in real-time? Again, model building is an iterative process with rigorous experimentation. So, if the premise is to update the model on each new observation arriving in real-time, it is not practical to do so from multiple perspectives. One, the retraining of the model involves feeding the base data set including the new observation data point (choosing either to drop older data points in order to keep the data set size the same or not drop and keep growing the data set size) and so requires rebuilding of the model. There is no practical way of “incrementally updating the model” with each new observation; unless, the model is a simple rule based; for example: predict as “fail” if the observation falls outside the two standard deviations from the sample mean; in such a simple model, it is possible to recompute and update the mean and standard deviation values of the sample data by including the new observation even while the outcome for the current observation is being predicted. But for our discussion on predictive analytics here, we are considering more complex machine learning or statistical techniques.

Second, even if technologies make it possible to feed large volume of data including the new observation each time to rebuild the model in a split second, there is no tangible benefit in doing so. The model does not much with just one more data point. Drawing an analogy, if one wants to measure by how much the weight has reduced from an intensive running program, it is common sense that the needle does not move much if measured after every mile run. One has to accumulate a considerable number of miles before experiencing any tangible change in the weight! Same is true in Data Science. Rebuild the model only after aggregating a considerable volume of data to experience a tangible difference in the model.

(Even the recent developments, such as Cloudera Oryx, that are making efforts to move forward from Apache Mahout and similar tools (limited to only batch processing for both model building and prediction) are focused on real-time prediction and yet rightly so on batch-based model building. For example, Oryx has a computational layer and a serving layer, where the former performs a model building/update periodically on an aggregated data at a batch level in the back-end, and the latter serves queries to the model in real-time via an HTTP REST API)

Then, what is real-time predictive analytics? It is when a predictive model (built/fitted on a set of aggregated data) is deployed to perform run-time prediction on a continuous stream of event data to enable decision making in real-time. In order to achieve this, there are two aspects involved. One, the predictive model built by a Data Scientist via a stand-alone tool (R, SAS, SPSS, etc.) has to be exported in a consumable format (PMML is a preferred method across machine learning environments these days; we have done this and also via other formats). Second, a streaming operational analytics platform has to consume the model (PMML or other format) and translate it into the necessary predictive function (via open-source jPMML or Cascading Pattern or Zementis’ commercial licensed UPPI or other interfaces), and also feed the processed streaming event data (via a stream processing component in CEP or similar) to compute the predicted outcome.

This deployment of a complex predictive model, from its parent machine learning environment to an operational analytics environment, is one possible route in order to successfully achieve a continuous run-time prediction on streaming event data in real-time.

Six out of the eight games in the elimination round of 16 in the on-going FIFA Soccer World Cup were decided by a margin of one goal or less. Of the six, two were decided with penalty goals after the extra time plus the 30-minute over time, and three were decided after the extra time (these include the heartbreaking loss in the USA-Belgium game!). In the quarter-finals, all the four games were decided either by a margin of one goal or by penalty kicks after the over time. That means, all these games in the elimination rounds were fought hard until the last second and certainly were nail-biting finishes.

As the teams advance to the higher stages, the quality of teams increases and so does the quality of the contests. As demonstrated in the elimination rounds, more than in the group stage, the teams have to bring much more than the skill and the expertise to win the games. They have to have the stomach and stamina to fight until last, constant focus to not let the guard down even for a second, perseverance to keep attacking, and hold the nerves until the final whistle is blown.

This is akin to what we experience in our career as we advance to the higher levels. What certainly helps us outperform our peers and grow in the initial stages of the career is our IQ (here for argument sake I am including one’s breadth and depth of the knowledge, subject matter expertise, and the intellectual capabilities, all into the IQ). However, once we are at the higher stages where similarly intellectually capable individuals have also arrived at, what helps us hold edge over others is the emotional intelligence (EI, or popularly referred to as EQ).

As proposed by Daniel Goleman in his book “Leadership: The Power of Emotional Intelligence”, and also widely researched about its influence on leadership abilities, EQ is what matters more in higher stages of the career. There are no substitutes to the abilities to hold the nerve under pressure, stand steady against odds, persevere in spite of failures, and stay focused until the completion for a greater success.

Location can often play a significant role in a consumer’s lifestyle and her purchasing decisions; so, it forms a vital input to how better we can recommend products or services to her. Perhaps, not so much when it comes to recommending movies or mass consumer products. But, for those companies that sell home furnishings, custom-design goods, design merchandise for homes, goods for outdoors and related activities, location of the consumer matters. For instance, homes in New York, San Francisco Bay Area, Seattle, and Texas differ in their style and space; these factors and also the weather certainly influence the moods and tastes for furnishing the homes. Even in a country like India with diverse cultures and lifestyles, the everyday dresses and jewelry that women wear vastly differ from state to state and location to location. At the eCommerce company that I have been part of, our analyses showed that a geographical location determined not only the number of users, number of orders, and amount revenue, but also the type of merchandize bought (color, style, size, etc.). In this context, collaborative filtering based recommending of products or services that are also considered (viewed/liked/bought) by other users in a nearby location has an upside.

So, how can we build recommender systems like “people similar to you and in your vicinity also viewed these” or “people similar to you and in your vicinity also liked these“, that take spatial similarity (location proximity) between two users into account?

The location of a user can be obtained from a GeoIP technologies, where the user’s IP address is used to obtain the location with a precision that can depend on the SLA with a service provider (external link). That is, the user’s precise latitude-longitude (lat,long) coordinates can be used to build a greater accuracy into the recommender system, or can choose to use the lat-long of the center of user’s city/zipcode which is an approximation of the user’s actual location.

To keep the discussion on point, we assume that you already have the user-item preference matrix built for existing recommender systems (for starters, the user-item matrix can be written with all observation data with each observation having a tuple <user_id, item_id, preference_score>; the data file format can depend on what tool you use to build recommender systems: for example, Apache Mahout consumes a csv file). The matrix is of size n x m, with n number of users and m number of items. The preference score can be built either from user’s implicit actions (view, like, add-to-cart, etc.) or explicit feedback (ratings, reviews, etc.), where most often the former matrix is less sparse than the latter.

Given the user-item matrix and each user’s location (lat,long), here we discuss two possible approaches to build the recommendation model with spatial similarity. The approaches are discussed here using a matrix factorization method; Alternate Least Squares (ALS) method is a preferred choice for matrix factorization (external link) among the latent factor model based collaborative filtering techniques, especially for the less sparse implicit data matrix.

1. Using the ALS matrix factorization method, the n x m user-item matrix U can first be factorized into two vectors, a n x k user vector P and a m x k item vector Q, where k is the number of dimensions, such that U = P * Transpose(Q). The user vector P has n rows, with each row defining a user with k attributes/dimensions. In the conventional method, for a given user the most similar user among remaining n-1 users is the one with nearest values of the k attributes (for simplicity, it can be the Euclidean distance between the two sets of k values). Now, we can factor the k-value distance with the geographical distance between the two users (again, that can be the Euclidean distance using the lat-long coordinates of the two users). The larger the geographical distance between the two users the more the k-value distance is inflated pushing farther in similarity. A smaller geographical distance between the two users will have the opposite effect. Thus, the weighted value ingests the spatial similarity between two users.

The above method can be challenging for one main reason. The same scalability reason that user-user collaborative filtering models are less preferred compared to the item-item collaborative filtering techniques (in addition to the fact that user properties are less static than item properties). That is, for most organizations the number of users will be in the order of millions while the number of items will be in the orders of tens or hundreds of thousands, and so user-user computations are less scalable than the item-item computations.

2. To overcome the above computational scalability problem, another approach is to create clusters of users based on the lat-long data. Once the user base is divided into C clusters (using k-means or other preferred techniques), one can create a user-item matrix separately for each cluster. Then for each matrix, perform the matrix factorization to derive the user vector and the item vector. The user vector can now be used to find a most similar user for any given user. This approach alleviates the scalability problem, as the user-user computations are performed on the matrices of reduced size. Of course, the size of the matrices can depend on the number of clusters chosen and the actual geographical distribution of the user base.

The former approach first derives the k attributes and then applies geographical distance as a weighted factor. On the other hand, the latter approach first applies the spatial proximity by clustering all the users that are geographically close to each other, and then derives the k attributes for each group separately.

As in any problem in data science, the building of a good model for recommender system with spatial similarity will depend on a lot of experimentation, rigorous off-line testing, tuning the model, conducting suitable A/B testing, and then repeating for continuous improvement. The above approaches are some of possible methodologies only to build the model; the final model will depend on the actual data and also the experimentation that involves the tuning of various parameters including the number of dimensions k in both approaches, and also number of clusters C in the second approach. One has to define the metrics such as precision, recall, RMSE, to measure the performance of the model and have the patience to keep experimenting.

In the past year we have mulled various options and avenues to provide quality services to individuals and professionals, be it in continued education and training, or in solving business problems. One of the first outcomes was the launching of Kaugment Solutions that is focused on providing training to working professionals via Kaugment Project Labs (project management, product management, business analysis, etc.) and Kaugment Data Labs (big data management, data analytics, and data visualization & reporting).

As 2014 began, finally we came around to launch Kmine Analytics by bringing together a team of business strategists, data scientists, data architects, and solution architects, with an end-to-end expertise in Data Management (Big Data, Hadoop, ETL), Data Analytics (Statistical Analysis, Data Mining, Machine Learning, Predictive Modeling) and Visualization (Tableau, MicroStrategy, etc.).

Our team’s deep expertise in data science and direct experience across industries allow us to quickly understand the business problem and data needs, and propose a project plan with estimated schedules and costs. Based on client needs, we can start with a pilot project, demonstrate the business value of the solution, and then chart the course to develop a full-blown analytics solution. We even have the necessary expertise to work with the client’s software developers and IT teams to develop a fully consumable solution or integrate the analytics with their existing products/dashboards.

Our clients can enjoy advantages we bring with such a wide expertise. Because, even if they are not sure how their business problem(s) relate to the data aggregated, our team has business strategists and data experts with expertise to navigate in the right direction to accurately define the business problem and marry the right data for precise insights. Besides, even if they can provide us only the raw data, our in-house data architects and ETL experts have years of experience working with clients to acquire, set-up, transform, integrate, and load the data for analytics. And, even if they are not sure how to integrate analytics solutions with their existing products or services or dashboards, our team has necessary data/solution integration expertise with years of product development/management experience to work with client’s Software Developers and IT teams to achieve the right outcomes.

Last week we have launched the Kaugment Solutions with a combination of in-class training and on-line portal for adaptive learning by working professionals to achieve growth in their careers.

We at Kaugment have designed the training program with understanding of principles of continued education and leveraging lessons from our own experiences. For example, we all have experienced in our careers that simply obtaining a degree or a certification in a subject will not automatically guarantee professional success. Sustainable success comes only if we truly master the subject to be able to confidently speak the right language/terminology as a professional expert (in the job or next interview) and also apply the knowledge in our day-to-day tasks.

So, at Kaugment we have designed the Project Management Professional (PMP) certification training program for our candidates to absorb the content comprehensively to achieve all the three objectives (pass the exam, confidently speak the language, and apply the knowledge in real world).

Also, we know that working professionals prefer experiential learning and learning through real world examples. We also fully understand that whatever they choose to learn, they often select based on an immediate purpose or application for that knowledge or certification, and for this reason they are also mostly self-directed. Keeping these attributes of continued education of working professionals in mind, we have designed an adaptive training program for project management. The in-class lectures by highly qualified instructors are augmented with the 24/7 on-line portal containing recorded video lectures and practice questions for our students to listen, learn, practice, and evaluate at their own pace and convenience. The instruction in-class and the content on-line is embedded with several real world cases of project management in varied industries. The on-line portal is also equipped with capabilities for students to interact/collaborate with instructors and fellow students anytime and from any place.

And, we at kaugment are continuously striving to find ways to make high quality continued education affordable to all the working professionals.

(See presentation on this topic at

As the business case is constructed to develop a data analytics platform product, as the product sponsor and other primary stakeholders give a go ahead based on estimated market potential, ROI, and other economic factors, the wheels are set in motion for the actual product development and its commercialization. From my experience, here are the steps that effective product managers follow through the development life cycle:

1. Gather Use Case Scenarios: While this is a continuous as well as an iterative process through the life cycle, the product manager (PM) has to have an anchor to initiate the product development. The PM identifies key use cases of the product (possible to have some already identified in business case) that will define its essential features and functionality. A good practice is to have at least 3-5 such business applications for the product to start with the development. A rolling wave planning can be adopted as the product development progresses, and more clarity on the end product applications is obtained. Considering an example product for discussion here, if the overall intent of the data analytics platform being conceived is for personalization of content and/or contextualized search, the few use cases can be in the space of “conversion of customer intent to transaction”, “personalized product/service recommendations”, “delivering competency-based adaptive content”, etc., depending on the client’s industry/vertical application.

2. Identify Key Features/Functions: This is the stage in which the product scope and its roadmap are defined with its essential features and functions identified from the primary use case applications and other stake holder requirements. The PM closely works primarily with the architect and the stakeholders, and if necessary the development team, through this process for an effective depiction of the product scope at the business and engineering architecture levels. For the example analytics platform discussed above, the capabilities can include product/service recommendations to customer based on her location or shopping/purchase history, recommendations based on interests/hobbies/habits, content access based on specific skills, etc.

3. Gather Data Requirements, Specifications, and Formats: The applications and functionality of the product leads the PM and team to identify the required data (e.g., customer transactions, inventory, price logs, resource utilization data, customer traffic, etc.) and specifications (volume/size, variety, velocity/streaming, structured/unstructured, etc.), and formats (numerical, text, voice, video, image, etc.). Through this process, the PM also has to identify internal and external sources for the required data, the barriers and costs to acquire the data, the potential challenges in integrating the source APIs with the product platform input APIs, and all other related issues.

4. Develop Data Warehousing Methods: The PM should have a good understanding of the capabilities available to warehouse, ingest, and manage the data as well as the extended capabilities to be built for the purpose. The warehousing of the data includes the extraction, transformation, and loading of the data for subsequent analyses. The ingestion of business/structured data is usually done with the traditional enterprise data warehouse (EDW), while the unstructured data such as the customer activity (on web-store, social media, etc.) written into log files is warehoused in the Hadoop. A good knowledge of the star-schema for the traditional EDW helps PM to better handle issues during the architecture, design, and development phases. The best practice starts with domain modeling, contextual modeling, and data modeling at logical and physical layers. Success of the platform also depends on choosing the right EDW tool for the right job: the PM and team usually have the choice among IBM’s DB2 or IBM’s Netezza appliance (or both combined) or MPP such as Teradata or a columnar database such as Vertica.

Besides, a good grasp on the requirements for building a suitable Hadoop cluster (capacity, number of nodes, block replication, number of users, etc.) will help the PM work with the architect and the product development teams to build appropriate data warehouse systems. The PM’s knowledge in capabilities required to facilitate the analyses of data across traditional EDW and Hadoop will further augment the team’s expertise (for example, a cross analysis on a customer segment data in EDW and the customer activity in Hadoop cluster to derive a behavior pattern for a particular customer category requires such a capability, and is a common requirement now a days).

5. Develop Data Mining Techniques: As the data warehousing systems and methods are established, the PM and team of data scientists can identify and develop necessary data mining techniques. For the example analytics platform discussed above, some of the data mining techniques include Frequency analysis, Collaborative filtering, Causal analysis, Matrix factorization, Association rule mining, Time series analysis, K-Means or Hierarchical clustering, Regression analysis, Bayesian networks, etc.

6. Develop Reporting Layer: As the platform with analytics engines is built, the insights from analytics have to be displayed/reported and the interface for display depends on the users. In case the users are data scientists internal to the organization, usually the delivery ends at providing the platform, data, reporting & visualization tools (Tableau, Cognos, Datameer, etc.), and plugins/connectors to statistical analysis tools such as R and S, that are highly popular with data scientists. Data curation is also vital to deliver the right data and insights in the right format/structure at the right time to the right stakeholders, so that timely business action is taken. On the other hand, if the end users are individual consumers, then appropriate efforts are to be made to build a sleek interface to deliver the user specific content.

The product development life cycle does not end here. Once the product is launched, information on its adoption, feature usage, and user experiences is gathered and fed back for further evolving the product. And so, the Plan-Do-Check-Act cycle continues.

For a presentation based discussion on this topic, please visit

Thanks for reading. Comments to improve the content and enrich the discussion are most welcome.

We as an each individual have values, beliefs, ambitions, aspirations, dreams, ideas, thoughts, opinions, plans, emotions, behaviors, habits, desires, needs, and then we have actions associated with all these attributes. The proliferation of various channels, physical, mobile, social, etc. has enabled several communication mediums leaving a digital footprint of these actions, leading to a deluge of data, and that in turn made it possible for someone to characterize the individual and then possibly monetize that characterization.

The increasing commercialization of capabilities to manage and analyze the big data has brought in sight the holy grail of marketing – the capability to target and serve each individual customer at a highly personalized level, called by experts as “extreme personalization” or “micro-segmentation” or “a segment of one” or as I call it the “nano-segmentation” due to its sheer scale at a few billion individuals.

As an Innovation Strategist working closely with the world-class technologists, I see an enormous progress being made by the researchers and the data scientists to build analytics that map the attributes of each individual to qualitative and quantitative models of personality for the businesses to leverage. For now though I see that the correlation of these mappings to the actual personalities is at a lower rate of accuracy, for possibly due to the yet evolving mapping algorithms as well as the sociographic models, the research is gathering a tremendous pace to address the shortcomings.

As businesses increasingly search various B2C ways to leverage these social media analytics in order to serve their customers better and deliver an enhanced experience, I see a significant B2E opportunity via these capabilities for enterprises to achieve growth and profitability – an opportunity to understand each employee at a highly personalized level.

Same as their customers, no two employees of an enterprise are the same, and each have their own attributes, not to mention the variations in the level or degree within a same attribute. Enterprises tend to manage a batch of employees with similar responsibilities in a set pattern and impart them a same kind of training as a group. When businesses can clearly see the benefits in serving their customers at a highly personalized level to transcend the transactional relationship and build a long-lasting loyalty-based relationship, it is time the same businesses deploy the capabilities to serve their employees through effective mentoring, training, and management, with personalized data and insights in hand.

When successfully done so, it invariably would lead to a greater productivity and growth for each employee and the organization as a whole.


Get every new post delivered to your Inbox.