I attended the Kellogg Alumni event, on Tuesday evening, at the Facebook campus here in Bay Area. While I got the opportunity to meet some old friends and make some new acquaintances, I was thrilled to watch an expert panel discuss an exciting subject entitled, “Big Data does not Make Decisions – Leaders Do”; thrilled simply because the entire discussion was about what I live everyday in my work, and what I strive to achieve with my team and my organization.

The panel composed of highly accomplished Kellogg alumni (Facebook’s CMO, Marketo’s CMO, and a Senior Managing Partner at a leading venture firm) plus two of the world renowned Kellogg faculty. They all agree, and in fact strongly promote, that the managers as well as executives (CIOs, CTOs, and even CEOs) have to have some “working knowledge of data science” to be able to ask the right questions and steer the data science and data analytics teams in the right direction so as to solve most relevant business problems; that the right culture and the mindset have to be developed that Big Data (however you define it) by itself or the software and/or analytics tools by themselves are not the end solutions, but that it requires the leaders to recognize the importance of domain knowledge combined with the ability to to connect the market problems for a given vertical with the right set of data.

Another key agreement, closest to what I live and breathe everyday, is that “Product Management” is a critical role in bridging the data science world with the market/business requirements; that a good product manager is most essential to connect the business problems with the data science findings. It is observed that, even while there is a shortage of good data scientists, there is even more acute shortage of good product managers who understand the business domain as well as the data science well enough to make a successful connection.

As I often say: “Data Science is about Discovery, Product Management is about Innovation“, the primary focus of a product manager should be the market-centric innovation. Data Scientists (and Engineers) are often driven by precision (and perfection), while good product managers will know the level of sufficient accuracy to take the product to the market at the right time. The three things I recommend that you do to be a successful product manager in the data analytics space are:

  1. Take time to learn the science: if you are looking for some quick learning, many resources are out there (webinars, MOOC classes, etc.) to teach you the subject; even paid training classes (2-day or 3-day) such as those from Cloudera are worth the time and money. If you can and are looking for longer term learning, many schools have undergraduate and graduate level programs in data science and analytics. A product manager with a strong grasp on the algorithms, models, techniques, and tools in data science will not only enjoy the support of the data science teams he/she works with, but also will be able to steer the product roadmap with application of the right models/techniques combined with right data set to the right business problems and market requirements.
  2. Learn by spreading the knowledge: it is given that the product manager have to play a highly proactive role in building the product with the help of data scientists, data engineers, data architects, and other stakeholders on all sides and also in taking it to the market. But, go beyond that expectation and play a highly interactive role in educating the stakeholders and the managers about the data science behind the analytics. Invite managers, executives, engineers, and others for a 30-minute seminar once in a while on a new data science topic; that in itself will be an enormous learning experience for you; after you present the data science model and technique, facilitate the audience to have a discussion on the model’s possible application to various business use cases. As part of the lunch seminar series I have been doing at my company, I recently presented on the topic of “Hidden Markov Models” about which I did not have much prior knowledge; as I took upon the task, I did some reading, found the necessary R packages and relevant data sets to build the model and do the predictions for a real world use case; though the preparation took many hours over two weeks, when I presented I did not have all the answers for the questions from the intelligent people in the room (our CTO, solution architects, engineers, and other product managers) about nuances of the algorithm and the model; but that’s fine, because the key outcome is that I learned something new and the dialogue made me richer as to how this model can be leveraged for a particular use case!
  3. Engage the cross-functional teams to stay ahead in innovation: the product marketing (for B2B) or consumer marketing (for B2C) and sales people may not attend your seminars for either lack of interest and/or background or they might often be on the field; but they are also key stakeholders in your ecosystem. Marketing and sales people are eyes and ears of the company as they provide the ground-level intelligence which are critical for formulating the company’s strategies and tactics; engage them to tap into their field knowledge about what customers are really looking for; often customers and clients understand only tip of the data science and what they really need may be different from what they ask for; so, have the empathy and patience to help the stakeholders understand the difference between descriptive analytics (BI tools and other) and more advanced analytics (prescriptive analytics and predictive analytics with data science); make efforts to collaborate with the marketing teams to generate content, not too technical nor too superficial, to appeal and to be understood by a wider audience. All these efforts will help you garner deeper market knowledge to stay ahead, help you avoid drowning in the day-to-day necessary agile/scrum activities for near-term product development, and actually help build your mid-term and long-term product roadmap.

What are your thoughts and recommendations as a successful product manager in this space? Share with us….

Not to be confused with Real-Time Analytics. It is different.

As a product manager in the domain of predictive analytics, I own the responsibility to build predictive analytics capabilities for consumer facing and/or enterprise platforms; the business applications vary among item recommendations for consumers, prediction of event outcomes based on classification models, demand forecasting for supply optimization, and so on. We usually see the applications where the predictive model built using machine learning technique(s) is leveraged to score the new set of data, and that new set of data is most often fed to the model on-demand as a batch.

However, the more exciting aspect of my recent work has been in the realm of real-time predictive analytics, where each single observation (raw data point) has to be used to compute the predicted outcome; note that this is a continuous process as the stream of new observations continuously arrive and the business decisions based on the predicted outcomes have to be made in real-time. A classic use case for such a scenario is the credit card fraud detection: when a credit card swipe occurs, all the data relevant to the nature of the transaction is fed to a pre-built predictive model in order to classify if the transaction is fraudulent, and if so deny it; all this has to happen in a split second at scale (millions of transactions each second) in real-time. Another exciting use case is the preventive maintenance in Internet of Things (IoT), where continuous streaming data from thousands/millions of smart devices have to be leveraged to predict any possible failure in advance to prevent/reduce downtime.

Let me address some of the common questions that I often receive in the context of real-time predictive analytics.

What exactly is real-time predictive analytics – does that mean we can build the predictive model in real-time? A data scientist requires an aggregated mass of data which forms the historical basis over which the predictive model can be built. The model building exercise is a deep subject by itself and we can have a separate discussion about that; however, the main point to note is that model building for better predictive performance involves rigorous experimentation, requires sufficient historical data, and is a time consuming process. So, a predictive model cannot be built in “real-time” in its true sense.

Can the predictive model be updated in real-time? Again, model building is an iterative process with rigorous experimentation. So, if the premise is to update the model on each new observation arriving in real-time, it is not practical to do so from multiple perspectives. One, the retraining of the model involves feeding the base data set including the new observation data point (choosing either to drop older data points in order to keep the data set size the same or not drop and keep growing the data set size) and so requires rebuilding of the model. There is no practical way of “incrementally updating the model” with each new observation; unless, the model is a simple rule based; for example: predict as “fail” if the observation falls outside the two standard deviations from the sample mean; in such a simple model, it is possible to recompute and update the mean and standard deviation values of the sample data by including the new observation even while the outcome for the current observation is being predicted. But for our discussion on predictive analytics here, we are considering more complex machine learning or statistical techniques.

Second, even if technologies make it possible to feed large volume of data including the new observation each time to rebuild the model in a split second, there is no tangible benefit in doing so. The model does not much with just one more data point. Drawing an analogy, if one wants to measure by how much the weight has reduced from an intensive running program, it is common sense that the needle does not move much if measured after every mile run. One has to accumulate a considerable number of miles before experiencing any tangible change in the weight! Same is true in Data Science. Rebuild the model only after aggregating a considerable volume of data to experience a tangible difference in the model.

(Even the recent developments, such as Cloudera Oryx, that are making efforts to move forward from Apache Mahout and similar tools (limited to only batch processing for both model building and prediction) are focused on real-time prediction and yet rightly so on batch-based model building. For example, Oryx has a computational layer and a serving layer, where the former performs a model building/update periodically on an aggregated data at a batch level in the back-end, and the latter serves queries to the model in real-time via an HTTP REST API)

Then, what is real-time predictive analytics? It is when a predictive model (built/fitted on a set of aggregated data) is deployed to perform run-time prediction on a continuous stream of event data to enable decision making in real-time. In order to achieve this, there are two aspects involved. One, the predictive model built by a Data Scientist via a stand-alone tool (R, SAS, SPSS, etc.) has to be exported in a consumable format (PMML is a preferred method across machine learning environments these days; we have done this and also via other formats). Second, a streaming operational analytics platform has to consume the model (PMML or other format) and translate it into the necessary predictive function (via open-source jPMML or Cascading Pattern or Zementis’ commercial licensed UPPI or other interfaces), and also feed the processed streaming event data (via a stream processing component in CEP or similar) to compute the predicted outcome.

This deployment of a complex predictive model, from its parent machine learning environment to an operational analytics environment, is one possible route in order to successfully achieve a continuous run-time prediction on streaming event data in real-time.

Six out of the eight games in the elimination round of 16 in the on-going FIFA Soccer World Cup were decided by a margin of one goal or less. Of the six, two were decided with penalty goals after the extra time plus the 30-minute over time, and three were decided after the extra time (these include the heartbreaking loss in the USA-Belgium game!). In the quarter-finals, all the four games were decided either by a margin of one goal or by penalty kicks after the over time. That means, all these games in the elimination rounds were fought hard until the last second and certainly were nail-biting finishes.

As the teams advance to the higher stages, the quality of teams increases and so does the quality of the contests. As demonstrated in the elimination rounds, more than in the group stage, the teams have to bring much more than the skill and the expertise to win the games. They have to have the stomach and stamina to fight until last, constant focus to not let the guard down even for a second, perseverance to keep attacking, and hold the nerves until the final whistle is blown.

This is akin to what we experience in our career as we advance to the higher levels. What certainly helps us outperform our peers and grow in the initial stages of the career is our IQ (here for argument sake I am including one’s breadth and depth of the knowledge, subject matter expertise, and the intellectual capabilities, all into the IQ). However, once we are at the higher stages where similarly intellectually capable individuals have also arrived at, what helps us hold edge over others is the emotional intelligence (EI, or popularly referred to as EQ).

As proposed by Daniel Goleman in his book “Leadership: The Power of Emotional Intelligence”, and also widely researched about its influence on leadership abilities, EQ is what matters more in higher stages of the career. There are no substitutes to the abilities to hold the nerve under pressure, stand steady against odds, persevere in spite of failures, and stay focused until the completion for a greater success.

Location can often play a significant role in a consumer’s lifestyle and her purchasing decisions; so, it forms a vital input to how better we can recommend products or services to her. Perhaps, not so much when it comes to recommending movies or mass consumer products. But, for those companies that sell home furnishings, custom-design goods, design merchandise for homes, goods for outdoors and related activities, location of the consumer matters. For instance, homes in New York, San Francisco Bay Area, Seattle, and Texas differ in their style and space; these factors and also the weather certainly influence the moods and tastes for furnishing the homes. Even in a country like India with diverse cultures and lifestyles, the everyday dresses and jewelry that women wear vastly differ from state to state and location to location. At the eCommerce company that I have been part of, our analyses showed that a geographical location determined not only the number of users, number of orders, and amount revenue, but also the type of merchandize bought (color, style, size, etc.). In this context, collaborative filtering based recommending of products or services that are also considered (viewed/liked/bought) by other users in a nearby location has an upside.

So, how can we build recommender systems like “people similar to you and in your vicinity also viewed these” or “people similar to you and in your vicinity also liked these“, that take spatial similarity (location proximity) between two users into account?

The location of a user can be obtained from a GeoIP technologies, where the user’s IP address is used to obtain the location with a precision that can depend on the SLA with a service provider (external link). That is, the user’s precise latitude-longitude (lat,long) coordinates can be used to build a greater accuracy into the recommender system, or can choose to use the lat-long of the center of user’s city/zipcode which is an approximation of the user’s actual location.

To keep the discussion on point, we assume that you already have the user-item preference matrix built for existing recommender systems (for starters, the user-item matrix can be written with all observation data with each observation having a tuple <user_id, item_id, preference_score>; the data file format can depend on what tool you use to build recommender systems: for example, Apache Mahout consumes a csv file). The matrix is of size n x m, with n number of users and m number of items. The preference score can be built either from user’s implicit actions (view, like, add-to-cart, etc.) or explicit feedback (ratings, reviews, etc.), where most often the former matrix is less sparse than the latter.

Given the user-item matrix and each user’s location (lat,long), here we discuss two possible approaches to build the recommendation model with spatial similarity. The approaches are discussed here using a matrix factorization method; Alternate Least Squares (ALS) method is a preferred choice for matrix factorization (external link) among the latent factor model based collaborative filtering techniques, especially for the less sparse implicit data matrix.

1. Using the ALS matrix factorization method, the n x m user-item matrix U can first be factorized into two vectors, a n x k user vector P and a m x k item vector Q, where k is the number of dimensions, such that U = P * Transpose(Q). The user vector P has n rows, with each row defining a user with k attributes/dimensions. In the conventional method, for a given user the most similar user among remaining n-1 users is the one with nearest values of the k attributes (for simplicity, it can be the Euclidean distance between the two sets of k values). Now, we can factor the k-value distance with the geographical distance between the two users (again, that can be the Euclidean distance using the lat-long coordinates of the two users). The larger the geographical distance between the two users the more the k-value distance is inflated pushing farther in similarity. A smaller geographical distance between the two users will have the opposite effect. Thus, the weighted value ingests the spatial similarity between two users.

The above method can be challenging for one main reason. The same scalability reason that user-user collaborative filtering models are less preferred compared to the item-item collaborative filtering techniques (in addition to the fact that user properties are less static than item properties). That is, for most organizations the number of users will be in the order of millions while the number of items will be in the orders of tens or hundreds of thousands, and so user-user computations are less scalable than the item-item computations.

2. To overcome the above computational scalability problem, another approach is to create clusters of users based on the lat-long data. Once the user base is divided into C clusters (using k-means or other preferred techniques), one can create a user-item matrix separately for each cluster. Then for each matrix, perform the matrix factorization to derive the user vector and the item vector. The user vector can now be used to find a most similar user for any given user. This approach alleviates the scalability problem, as the user-user computations are performed on the matrices of reduced size. Of course, the size of the matrices can depend on the number of clusters chosen and the actual geographical distribution of the user base.

The former approach first derives the k attributes and then applies geographical distance as a weighted factor. On the other hand, the latter approach first applies the spatial proximity by clustering all the users that are geographically close to each other, and then derives the k attributes for each group separately.

As in any problem in data science, the building of a good model for recommender system with spatial similarity will depend on a lot of experimentation, rigorous off-line testing, tuning the model, conducting suitable A/B testing, and then repeating for continuous improvement. The above approaches are some of possible methodologies only to build the model; the final model will depend on the actual data and also the experimentation that involves the tuning of various parameters including the number of dimensions k in both approaches, and also number of clusters C in the second approach. One has to define the metrics such as precision, recall, RMSE, to measure the performance of the model and have the patience to keep experimenting.

In the past year we have mulled various options and avenues to provide quality services to individuals and professionals, be it in continued education and training, or in solving business problems. One of the first outcomes was the launching of Kaugment Solutions that is focused on providing training to working professionals via Kaugment Project Labs (project management, product management, business analysis, etc.) and Kaugment Data Labs (big data management, data analytics, and data visualization & reporting).

As 2014 began, finally we came around to launch Kmine Analytics by bringing together a team of business strategists, data scientists, data architects, and solution architects, with an end-to-end expertise in Data Management (Big Data, Hadoop, ETL), Data Analytics (Statistical Analysis, Data Mining, Machine Learning, Predictive Modeling) and Visualization (Tableau, MicroStrategy, etc.).

Our team’s deep expertise in data science and direct experience across industries allow us to quickly understand the business problem and data needs, and propose a project plan with estimated schedules and costs. Based on client needs, we can start with a pilot project, demonstrate the business value of the solution, and then chart the course to develop a full-blown analytics solution. We even have the necessary expertise to work with the client’s software developers and IT teams to develop a fully consumable solution or integrate the analytics with their existing products/dashboards.

Our clients can enjoy advantages we bring with such a wide expertise. Because, even if they are not sure how their business problem(s) relate to the data aggregated, our team has business strategists and data experts with expertise to navigate in the right direction to accurately define the business problem and marry the right data for precise insights. Besides, even if they can provide us only the raw data, our in-house data architects and ETL experts have years of experience working with clients to acquire, set-up, transform, integrate, and load the data for analytics. And, even if they are not sure how to integrate analytics solutions with their existing products or services or dashboards, our team has necessary data/solution integration expertise with years of product development/management experience to work with client’s Software Developers and IT teams to achieve the right outcomes.

Last week we have launched the Kaugment Solutions with a combination of in-class training and on-line portal for adaptive learning by working professionals to achieve growth in their careers.

We at Kaugment have designed the training program with understanding of principles of continued education and leveraging lessons from our own experiences. For example, we all have experienced in our careers that simply obtaining a degree or a certification in a subject will not automatically guarantee professional success. Sustainable success comes only if we truly master the subject to be able to confidently speak the right language/terminology as a professional expert (in the job or next interview) and also apply the knowledge in our day-to-day tasks.

So, at Kaugment we have designed the Project Management Professional (PMP) certification training program for our candidates to absorb the content comprehensively to achieve all the three objectives (pass the exam, confidently speak the language, and apply the knowledge in real world).

Also, we know that working professionals prefer experiential learning and learning through real world examples. We also fully understand that whatever they choose to learn, they often select based on an immediate purpose or application for that knowledge or certification, and for this reason they are also mostly self-directed. Keeping these attributes of continued education of working professionals in mind, we have designed an adaptive training program for project management. The in-class lectures by highly qualified instructors are augmented with the 24/7 on-line portal containing recorded video lectures and practice questions for our students to listen, learn, practice, and evaluate at their own pace and convenience. The instruction in-class and the content on-line is embedded with several real world cases of project management in varied industries. The on-line portal is also equipped with capabilities for students to interact/collaborate with instructors and fellow students anytime and from any place.

And, we at kaugment are continuously striving to find ways to make high quality continued education affordable to all the working professionals.

(See presentation on this topic at http://www.slideshare.net/RamSangireddy/data-analyticsproduct-practices)

As the business case is constructed to develop a data analytics platform product, as the product sponsor and other primary stakeholders give a go ahead based on estimated market potential, ROI, and other economic factors, the wheels are set in motion for the actual product development and its commercialization. From my experience, here are the steps that effective product managers follow through the development life cycle:

1. Gather Use Case Scenarios: While this is a continuous as well as an iterative process through the life cycle, the product manager (PM) has to have an anchor to initiate the product development. The PM identifies key use cases of the product (possible to have some already identified in business case) that will define its essential features and functionality. A good practice is to have at least 3-5 such business applications for the product to start with the development. A rolling wave planning can be adopted as the product development progresses, and more clarity on the end product applications is obtained. Considering an example product for discussion here, if the overall intent of the data analytics platform being conceived is for personalization of content and/or contextualized search, the few use cases can be in the space of “conversion of customer intent to transaction”, “personalized product/service recommendations”, “delivering competency-based adaptive content”, etc., depending on the client’s industry/vertical application.

2. Identify Key Features/Functions: This is the stage in which the product scope and its roadmap are defined with its essential features and functions identified from the primary use case applications and other stake holder requirements. The PM closely works primarily with the architect and the stakeholders, and if necessary the development team, through this process for an effective depiction of the product scope at the business and engineering architecture levels. For the example analytics platform discussed above, the capabilities can include product/service recommendations to customer based on her location or shopping/purchase history, recommendations based on interests/hobbies/habits, content access based on specific skills, etc.

3. Gather Data Requirements, Specifications, and Formats: The applications and functionality of the product leads the PM and team to identify the required data (e.g., customer transactions, inventory, price logs, resource utilization data, customer traffic, etc.) and specifications (volume/size, variety, velocity/streaming, structured/unstructured, etc.), and formats (numerical, text, voice, video, image, etc.). Through this process, the PM also has to identify internal and external sources for the required data, the barriers and costs to acquire the data, the potential challenges in integrating the source APIs with the product platform input APIs, and all other related issues.

4. Develop Data Warehousing Methods: The PM should have a good understanding of the capabilities available to warehouse, ingest, and manage the data as well as the extended capabilities to be built for the purpose. The warehousing of the data includes the extraction, transformation, and loading of the data for subsequent analyses. The ingestion of business/structured data is usually done with the traditional enterprise data warehouse (EDW), while the unstructured data such as the customer activity (on web-store, social media, etc.) written into log files is warehoused in the Hadoop. A good knowledge of the star-schema for the traditional EDW helps PM to better handle issues during the architecture, design, and development phases. The best practice starts with domain modeling, contextual modeling, and data modeling at logical and physical layers. Success of the platform also depends on choosing the right EDW tool for the right job: the PM and team usually have the choice among IBM’s DB2 or IBM’s Netezza appliance (or both combined) or MPP such as Teradata or a columnar database such as Vertica.

Besides, a good grasp on the requirements for building a suitable Hadoop cluster (capacity, number of nodes, block replication, number of users, etc.) will help the PM work with the architect and the product development teams to build appropriate data warehouse systems. The PM’s knowledge in capabilities required to facilitate the analyses of data across traditional EDW and Hadoop will further augment the team’s expertise (for example, a cross analysis on a customer segment data in EDW and the customer activity in Hadoop cluster to derive a behavior pattern for a particular customer category requires such a capability, and is a common requirement now a days).

5. Develop Data Mining Techniques: As the data warehousing systems and methods are established, the PM and team of data scientists can identify and develop necessary data mining techniques. For the example analytics platform discussed above, some of the data mining techniques include Frequency analysis, Collaborative filtering, Causal analysis, Matrix factorization, Association rule mining, Time series analysis, K-Means or Hierarchical clustering, Regression analysis, Bayesian networks, etc.

6. Develop Reporting Layer: As the platform with analytics engines is built, the insights from analytics have to be displayed/reported and the interface for display depends on the users. In case the users are data scientists internal to the organization, usually the delivery ends at providing the platform, data, reporting & visualization tools (Tableau, Cognos, Datameer, etc.), and plugins/connectors to statistical analysis tools such as R and S, that are highly popular with data scientists. Data curation is also vital to deliver the right data and insights in the right format/structure at the right time to the right stakeholders, so that timely business action is taken. On the other hand, if the end users are individual consumers, then appropriate efforts are to be made to build a sleek interface to deliver the user specific content.

The product development life cycle does not end here. Once the product is launched, information on its adoption, feature usage, and user experiences is gathered and fed back for further evolving the product. And so, the Plan-Do-Check-Act cycle continues.

For a presentation based discussion on this topic, please visit http://www.slideshare.net/RamSangireddy/data-analyticsproduct-practices

Thanks for reading. Comments to improve the content and enrich the discussion are most welcome.


Get every new post delivered to your Inbox.