Today, I’m going to take you into a deeper dive off the power of Vantage analytic functions. One of the big things about Vantage that you are going to keep hearing over and over again is the ability to deliver impactful business outcomes. After all, what’s the point of doing analytics if you can’t really deliver outcomes that make a difference in the lives of people and businesses? So here’s one example of a business outcome. Customer churn. Retailers in many other industries are always interested in acquiring more customers and making more customers stay. So here’s an example of what I’m going to go through today to talk about how an analyst is going to deal with the issue of customer churn. So here’s an example as you will see of customers engaging in numbers of activities and in this case, this retailer wants to understand how to look at behaviors that end up in members cancelling their membership. They want to look at their behaviors that lead up to membership cancellation. They want to see if there are certain kinds of sentiments that are being extracted during the cancellation process. And number three, if they can actually model it somehow they can predict who is going to cancel and when.
Hi. My name is Sri Raghavan. I have been in the analytics profession for over 20 years working with hundreds of companies to deliver analytic implementations across a number of verticals and use cases. Today, I’m going to take you into a deeper dive off the power of Vantage analytic functions. One of the big things about Vantage that you are going to keep hearing over and over again is the ability to deliver impactful business outcomes. After all, what’s the point of doing analytics if you can’t really deliver outcomes that make a difference in the lives of people and businesses? So here’s one example of a business outcome. Customer churn. Retailers in many other industries are always interested in acquiring more customers and making more customers stay. So here’s an example of what I’m going to go through today to talk about how an analyst is going to deal with the issue of customer churn. So here’s an example as you will see of customers engaging in numbers of activities and in this case, this retailer wants to understand how to look at behaviors that end up in members cancelling their membership. They want to look at their behaviors that lead up to membership cancellation. They want to see if there are certain kinds of sentiments that are being extracted during the cancellation process. And number three, if they can actually model it somehow they can predict who is going to cancel and when. So before we get into the meat of the use case let me share with you a couple of important graphics and visualizations they can deliver through Vantage. Here’s an example of a Sankey diagram which you can deliver through native Vantage app, center applications. So here is an example of customers churning. So here’s a membership cancellation. Now before cancellation occurs in this case the path analysis that you see here visualized in the form of Sankey indicates that, you know, some complaint call has been registered. Or in some cases, some webchat has occurred or different kinds of activities. Now the path analysis is used to determine the behavior. Great. You can see the different paths. Next up, what the analysts, be it the data scientists or the business analyst wants to understand is what exactly has been expressed by way of sentiments. So here, remember webchat and online feedback has been provided. Here is an example of a word cloud that’s been generated using Vantage analysts on all of those sentiments that had been expressed. Here is something which is confused, tired, expensive, frustrated, whatever, right. So clearly, it indicates that in addition to certain paths, certain patterns of behavior, specifically within those patterns of behavior certain sentiments have been expressed. Okay, that’s number two. First you did the path, now you did the sentiment. Number three. Now the analyst wants to understand and what the retailer wants to understand, how can I model this behavior? How can I predict before somebody actually cancels the membership that they’re indeed going to cancel? So here is an example of a Vantage analyst application which brings together a machine learning model, in this case, the model’s called Decision Forest, to predict the risk that a customer is going to churn. And as a part of that they’re bringing in a number of different activities as a part of it. Not only are they able to model the behavior that leads to churn, but they’re also able to evaluate the performance of the model. Here the model indicates in this case 92 percent accuracy, which is actually pretty good. And it also tells you which are the important areas that contribute towards churn. Here for instance, you see things like the customer days or the age of that particular customer seems to make a difference and other variables. What’s my point here? I went through three different apps, all available in Vantage, starting with behavioral analytics using path analysis to sentiment extraction used on data that’s collected on customers who engage with the retailer to modeling that behavior so that I can predict customer churn. So these are the three things that happen in the course of understanding why customers churn. Let’s take it one step further. All of these things can be done programmatically. What do I mean by that? Using native vantage functions as we talked about before. Okay. Now a couple of things are of importance. So here is an example of some data preparation that needs to be done. Remember, one of the big things about analytics and many analytics professionals, in fact, most analytics professionals will tell you, 80 percent of the data is actually spent in wrangling the data, right. I mean, 80 percent of the time, rather, is spent in wrangling the data. So here is an example of a native Vantage data prep function called Sessionize which allows you to deconstruct all of the data into specific sessions, meaning that if I have for every customer, so take for example this customer, 28199. She or he engages in product browsing, neutral call webchat, return policy. There is roughly about one, two, three, four, five, six, seven transactions. Sessionization as a data per function simply compresses all of the seven transactions into three transactions or three session IDs. Now guess what? You have gotten a much smaller dataset to work with making it much easier for you to do analytics, but most importantly, without forsaking any of the native intelligence in your raw data. That is done through the sessionize function simply by calling it and you as an analyst don’t have to worry about any of the logic. Okay. We’re still on the first part, which is NPath, path analysis. Now I use path analysis. Remember I said the first thing you need to do in the Sankey chart is to figure out the different paths that had been taken. Here is an example of the native function. How did the Sankey come up? It came up through the application of the native function called NPath or path analysis. Again, pretty straight forward. I just say select star from NPath. This is an example of SQL code, right. In SQL, I am pulling up this function called NPath directly from Vantage. Look at cell number 8 and tell me, do you see any logic for NPath expressed here, meaning that the underlying guts of NPath? The answer is no. All of that remains transparent to the data scientist or the business analyst who executes this. All that it’s looking for, Vantage is looking for is to tell what exactly do you want to do the path analysis on? Here is where you specify the event. I said membership cancellation is what I’m looking for. So specify the event of your interest. Vantage does the rest of the magic. It invokes all the logic in NPath on that event and gives you all of the detail that you see here. So for instance, here is where you see online feedback, store visit, product return, whatever leading up to membership cancellation. All of these paths. Imagine, and I’ll just show you an example of one customer, right. Imagine all of these paths for all the customers, that is what is encapsulated in the Sankey diagram from the very first one that I showed you in NPath as part of the behavioral analytics. So phase number one is complete. You have now understood that you have taken all of this data across multiple sources, put them in Vantage and ran path analytics on Vantage without writing much code about the native logic for NPath. Okay. All of these things I showed you using SQL. Now a lot of the time when I present like I am presenting to you, data scientists always ask me, “Look, SQL is great. A lot of us love SQL. But you know what? A lot of us are new data scientists. We’re not particularly big on SQL. Some of us like Python. Some of us like R. So are you telling me that SQL is the only way to go? Not at all. I will come to showing you alternative programming languages through which you can run all of these analytics functions, but for now, bear with me a little bit. We’re going to stick to SQL for one more part of that. Remember I said after path analysis we need to look at sentiments and I showed you the word cloud. Now all of these data, right, you bring into the picture you’re running sentiment extractions on. Is it positive? Is it negative? Is it neutral, what have you? Same thing as before. Native Vantage function called Sentiment Extractor. Again, I want you to look carefully at cell number 5. Is there any logic which tells you how to pull up what is considered to be positive versus negative versus neutral? The answer is no. All of these things are available for Vantage to figure out. All you have to tell Vantage is what exactly is that particular field in which you’re trying to—on which you’re trying to run sentiment extractions. Comments, customer comments is what I want it on. And then there’s something whenever you do things like sentiment extraction, it’s actually an underlying model. In this case, it’s called supervised model, meaning that this is a model, the sentiment is extracted based on prior learnings. And prior learnings are got from this thing called a dictionary. A dictionary as all of you know consists of word meanings. This dictionary in the case of Sentiment Extractor in Vantage tells you how to designate a particular phrase as being positive versus negative versus neutral. Now you might come back and say, “Hey, I’m not interested in your dictionary. I’ve got my own dictionary.” Fair enough. No problem. Just substitute your dictionary with mine. Just put it in the model file. Advantage will pick it up and run Sentiment Extractor. And guess what? You are now able to get things like positive and negative and neutral and what have you. The point is your ability to extract these sentiments are only as good as your dictionary that you use. So whether it’s yours or whether it’s mine, if it’s content of yours, just pull it and do it. But you don’t have to do any of the hard lifting. Vantage does it for you through SQL. Now let’s get back to the SQL versus other languages question. Here I’m going to show you an example. We completed phase 2. We did path analysis towards understanding customer churn. We’ve done sentiment extraction towards understanding customer churn. The last one is model selection. How can I create a model, in this case a machine learning model, which tells me who is likely to churn versus not. So here is an example of a machine learning model available to be executed within Vantage, implemented rather within Vantage using Python. Python to a lot of data scientists is a holy grail. That’s what they want to program on. Fine. No problem. In addition to SQL, in addition to all these intuitive user and faces, we have Python here. So real quick, how does it work within Vantage? So in Python, there is a package called Teradata ML, which is available if you purchase Vantage it comes automatically as a part of it. Teradata ML is a package which consists of what we call wrapper functions. So the NPath function which I showed you, that same NPath function is available in this Teradata ML package. The only difference is that you use Python vernacular to execute it. A lot of you who are in the Python community are familiar with this world called pandas, which is the native execution framework for Python. That’s a vernacular that Python users use. pandas is available in our implementation of Python, so which means whatever language you use in the open source world, it’s exactly the same thing that you use in Vantage. The better thing is native functions are available. So instead of having to code all those different modeling functions, you just use in Vantage. Okay. But before we get into the actual model, right, there are a few things that become important. One of the big things that typically goes into model building is model data prep, meaning in this case remember, all my data has got things like events in plain English. Things like webchat, online feedback, store visit, membership cancellation. To prepare the data means in this case we have to translate all of these things into a language that machine learning models understand. In this case, all of the data that has to be transformed into what we call binary classifications. I’m not going to go deep into it, but if, let’s say, for instance, I have done product browsing. Basically, instead of saying product browsing, product browsing would be replaced by the number 1. If I haven’t done it, then it’ll be 0. So it’s a binary classification. All my data is going to be translated into 0s and 1s and that’s a transformation process, it’s a data prep process. That happens directly on Vantage which results in this dataset here. Okay. Everything as you can see has been shifted into 0s and 1s. Now I showed it to you really quickly, right. Thirty seconds I just went from one part of the screen to the other. But if you think about it, a lot of work goes into it. And if you don’t have Vantage, you typically have to take the dataset to a different place, code all those transformations, pull it back up into separate samples and bring it together. It really takes a lot of work. I’m able to skip all those things and I’m bringing it directly into Vantage in one script. Next, I say, okay, whenever I do predictive modeling I need to figure out which variables could potentially be important. One of the things that’s important to me is the number of events you participate in. See, here’s the deal, right? Vantage is not about coding, it’s about using your mind as an analytics professional. Whether you’re a data scientist or whether you’re a business analyst, whether you’re a line or business manager, your job is to start hypothesizing. So one hypothesis that I have here in my data is, hey, does the number of events that I engage in make a difference? Meaning that if I engage in five different events versus ten different events, you know, does it make a difference in terms of me churning? So I use a calculated field here. I create a calculated field where I just basically total the number of events simply to understand every person, every customer and how many events she or he engages in. Okay, that’s one hypothesis. Then I say, hey, does gender make a difference? Maybe. Men are more likely to be churners. Okay, maybe. I don’t know. But here’s what makes it easy. In Vantage, I just pull it up and say, “Okay, let’s see. I don’t have the answer to that but this is a great hypothesis.” The next step I say, does my income and age make a difference? Yeah, maybe it does. Maybe it doesn’t have any impact at all. Okay, I don’t know. Let’s check it again. Maybe it does have an impact. So here, I’ve gone outside. I’ve brought gender and income as a part of that. Last but not the least, whenever you run any kind of statistical model or machine learning model, what have you, even AI models, you need to create something called a test and a training dataset. The training dataset is what’s used to run the model on and the test dataset is what is used to evaluate the model, right. Now typically, when you do it outside of Vantage, I have to go out and I have to create separate datasets and what have you. Here, I use a native Vantage function called Sample to simply do that and I can change it. Sometimes, you know, maybe it’s here I got 20/80, but I actually change the code to 70/30, right. But let’s say I want—Let’s keep it at 70/30. But I can change it to 6 if I want, right, 6 and 4, 60 percent and 40 percent, 50/50. I can do whatever. But for now, let’s keep it at 70/30 or let’s keep it at 20/80. But my point is you can change it to different things. So the test dataset is roughly 1600. The training is—I’m sorry, the training is 1600 roughly and the test is about 6500. Usually in modeling, it’s 80/20, it’s the other way around, but I just decided to do it this way because I wanted to find out how much tweaking I have to do in the model. But anyway, long story short, now I’m ready. I’ve got all my data prepped. I’ve transformed all my data. I’ve created my test and training data. And guess what? I’m applying this Xgboost model, really, at a high level. What’s Xgboost? In machine learning, think of Xgboost as a classification model. When you apply this model it’s able to predict if you’re going to be a churner or a no churner in this case. And Xgboost can be applied across multiple industries. Bankers want to understand is this a fraudulent transaction or is this a legitimate transaction? Manufacturing companies want to understand, is my machine going to fail or maybe my machine is at the cusp of failure or it’s not going to fail, it’s perfectly fine. Many different outcomes. Machine learning provides you the classification. In this case, I’m using Xgboost to classify. Great. A couple of things are important here in this piece of code, right, sometimes. These are called parameters. How deep do you want to go? How many times do you want to learn? This is what’s called parameter tuning. In Vantage, it’s as simple as going in the screen and changing your parameters. I don’t want to max step to 10. Let me put it as 20. You can change these things. The number of iterations, meaning how many times does this model run on—and all of the data and keep learning each time. Maybe I don’t want it to be 10. You know what, let me reduce it to 5 and let’s see how good it is. Look at how cool Vantage is. In cell number 25, first to fall, can you see any logic for Xgboost? That’s the right answer, no. Vantage pulls up all the logic from underneath. Vantage allows you to change all the parameters, right. And it runs. It creates the model. All that’s here. But more importantly, I take that model and I operationalize it. So here is for instance all the prediction that I do on the model, right. I not only create the model but I operationalize it through native Vantage functions called the predict function. So here I am predicting it. But look. In addition to predicting it, how do I know? This is where the AD 20, remember the test in the training comes into the picture. How do I know my model is correct? Is it going to be accurate 100 percent of the time? The answer is most likely not. Never, ever in my years of experience doing this have I ever seen a model which is 100 percent accurate. It’s usually impossible. But you want to be as accurate as you can possibly be, right. So in this case let’s say I predicted this customer 80914 to be a no churner. Sure enough, she’s no churner. But check this out. 6140 I predicted to be a no churner but they’re actually a churner. Does it mean my modeling skills are terrible, that this model is bad, that Xgboost is a bad function? Probably some combination or maybe not, maybe that’s what the data is all about. Confusion matrix goes one step further towards actually evaluating the accuracy of your model. Again, another native Vantage function that comes to your ConfusionMatrix. Now ConfusionMatrix, what it does is it takes your model and evaluates it and says, “Hey, let me give your matrix of false positives and false negatives and true positives and true negatives.” So here is the outcome of that. As you can see, every time a actual churn has occurred I’m predicting 2,854 times correctly. But I’ve predicted it wrongly 861 times. Guess what? My entire model’s accuracy is 76 percent. Now this is where Vantage can tell you a lot of facts about your model and what you’ve done. But it can’t tell you if 76 is good or bad. That’s not Vantage or no other software solution, whatever, can ever tell you that. Your job as a data scientist is made much easier because of Vantage because now you can look at it and say, “Hey, for my business and my context, 76 percent is not all that bad. I can live with it. It’s great.” But sometimes, you know, you could be in a different industry like health care and you say, “Hey, 76 percent accuracy is terrible. It’s appalling. Let me actually rejigger the model. Let me put more data, more variables or change the scope and have a different model.” My point is, all of these things make you think. Vantage provides enough tools, makes it easy for the data scientists and the business analysts and other analytics professionals to think through the hypothesis building process and to deliver insights in a manner that comports with the importance of the insights that they are getting out to the business. That’s a big deal. Now oftentimes, you know, customers come and ask me and some of you ask me as well, “Hey, the machine learning model is the machine learning model. Whether I do it in open source or whether I do it in Vantage, how different can Xgboost be?” And my answer is you’re absolutely right. Xgboost is Xgboost. Decision Forest is Decision Forest. How bad can it be? How different can it be?” But you know, guess what? There is a big difference not in the actual scope of the model but the fact that you’re able to do it on all your data in Vantage and the fact that you’re able to do all your data prep, all your sampling, all your tests and training datasets and predict it u sing the predict functions in Vantage to apply the model, to operationalize the results of the model, all of that can be done in Vantage at scale. I have tons of examples for the number of customers I meet. Right, let me give you two examples in particular. Large retailer wanted to run propensity models on the entire population of the United States but they—We said, “Oh, yeah, that’s great.” But they came back and said, “Well, hold on a second. We’re not done. There are 320 million people in the United States. You know that, right?” And we said, “Yeah, we get it.” But they said, “I want to run a model every three hours to see if there’s any change. And, by the way, I have different geographies and different demographics that I want to run these models on.” So now, you’re talking enormous sizes, not only of data but also different models that needs to be run. How are you going to do it outside? Right? Oftentimes one of the things that comes up is people come and tell me, “Hey, I can do it on my laptop.” Sure, if it’s a small sample of 300 records or 1,000 records or I give you even a million records, sure, use the laptop. But when you’re talking serious analytics which is done at the scales that we’re talking about like the United States populations, and different models for the population of the United States to understand propensity to buy, you’re talking serious enterprise scale. That’s the power of Vantage. You can do these things. Let me give you one more example, also staying in retail. Retailer has got enormous numbers of stores across the world. Stores in Europe, Africa, Asia, Australia, you get the deal. But not only are these at the continent level, they are in different cities. So this retailer has got 175,000 stores, right, and 100,000 SKUs. They want to run a model on each SKU at each store. Think about this. 175,000 stores times 100,000. I can’t even count that many models. Can you run them on your laptop? Absolutely not. This is why functions like Xgboost running at enterprise scale on Vantage can run it and here’s an example of that with model evaluation. Let me go one step further. Sometimes folks say, “Hey, okay, you showed me Python. What about R?” I don’t want to forget the R folks, right. So here is an example of R. The only difference between R and Python here is a library called tdplyer which again, just like Teradata ML contains of all the wrapper functions using the same dplyr vernacular that R programmers are used to. You get the story. I’m not going to go through a lot of the data prep which can be done using Vantage. But the difference here is that I’m running a different kind of model. In this case, it’s a Decision Forest model. For those of you for at a high level, Decision Forest is basically a decision tree but lots of decision trees put together in a forest because a tree, a bunch of trees make a forest and taking the most frequent outcome that comes out of it. Again, here is a Decision Forest. This is an application of the model here in R. For R programmers, you’re able to recognize all of the R functions here. Again, parameter tuning, the number of trees let’s say I want to change from 42 I want to do 36. Sure. No problem. It’s all here, you can do it. Hyper parameter tuning. The model is created and the model also tells you the importance of the functions. So in this case, previous purchase apparently makes a difference when it comes to customers churning positively. The more you purchase, the less likely you are to churn. Now it seems obvious, right? But at least it’s borne out by the facts. Income makes a difference apparently, right. Age or I’m sorry, gender doesn’t make that much of a difference. It’s low on the totem pole. But it tells you the importance of all these variables that they’ve brought into the picture. Again, using R you can also operationalize. So there is a predict function that I talked to you about before. Not only do I create the model, but I’m able to predict it. Right. And then last but not the least, in R is the same ConfusionMatrix that I showed before. Whatever you were able to do in Python, you’re also able to do in R in terms of evaluating your model. Same ConfusionMatrix is here and same accuracy metric is here. This model seems to indicate a slightly higher level of accuracy at 80 percent. Is that good? There’s no way Vantage is ever going to tell you that. But what Vantage will allow you to do, enable you to do, empower you to do is to think through that. In all these things I’ve shown you, Vantage Analyst, with the different visualizations, as I went from path analysis to sediment [ph?] extraction to model behavior and likelihood of those behaviors, and I showed you the code, one of the questions that keep coming up is, “Hey, Vantage is fantastic. I can understand the scale. I can understand the many different models you can do. But you know what? I’ve already made certain investments. I’ve purchased, for example, BI tools like Tableau. I’ve got an ecosystem in play. I have Jupyter Notebooks already in play. This is again the beauty of Vantage. In addition to all those things I talked to you about, scalability, performance, many different kinds of models, highly enterprise grade durability on one platform without sampling and all that, the last but not least thing is ecosystem compatibility. You can buy whatever things you’ve already bought. Nobody from Teradata will ever come and tell you you’ve got to get rid of your existing investments. If you have Tableau, we have Tableau compatibility. If you have Jupyter Notebooks, no problem. You can see, I use Jupyter to run these pieces of code. You can use the same Jupyter to run these functions through a client interface, no problem at all. Ecosystem compatibility is big. If you have data lakes, no problem. Hey, we can connect to these data lakes to get all the data to run Vantage. So that is a power of Vantage. To recap, it’s about the size of the data, the performance of the data, the scalability of the analytics, the numbers of models you can run, the different personas who can run these models and consume those models and operationalize and it’s a compatibility with all your existing investments, all in one platform meant for many different personas with different skill sets in the organization. That is a power that we want to evoke and hopefully, you’ve seen that. Thank you.