Chapter 05

Algorithms for Predictive Analytics

Predictive analytics algorithms are mostly derived from either traditional statistical methods or from contemporary machine learning techniques. According to a wealth of comparative studies published in the literature, while statistical methods are well established and theoretically sound, machine learning techniques are more accurate, informative, and actionable/practical. The statistical methods that have had the biggest impact in the evolution of predictive analytics and data mining include discriminant analysis, linear regression, and logistic regression. The most popular machine learning techniques used in numerous successful predictive analytics projects include decision trees, k-nearest neighbor, artificial neural networks, and support vector machines. All of these machine learning techniques can handle both classification as well as regression-type prediction problems. Often, they are applied to complex prediction problems where other, more traditional, techniques are not capable of producing satisfactory results.

Because of the popularity of machine learning techniques in the predictive analytics literature, in this chapter, we explain some of the most common ones in detail but without getting too technical or algorithmic. We also briefly explain other mentioned statistical techniques—linear and logistic regression and time-series analysis—to provide balanced coverage of the algorithmic spectrum of predictive analytics.

Naive Bayes

Naive Bayes is a simple probability-based classification method derived from the well-known Bayes’ theorem. The method requires output variables to have nominal/categorical values. Although the input variables can be a mix of numeric and nominal types, the numeric output variable needs to be discretized via some type of binning method before it can be used in the Naive Bayes classifier. The “Naive” part of the name Naive Bayes comes from the strong, somewhat unrealistic assumption of independence among the input variables. Simply put, a Naive Bayes classifier assumes that the input variables do not depend on each other, and the presence (or absence) of a particular variable in the mix of the predictors does not have anything to do with the presence or absence of any other variables.

Naive Bayes classification models can be developed very efficiently (i.e., rather rapidly with very little computational effort) and effectively (i.e., quite accurately) in a supervised machine learning environment. That is, using a set of training data (which may not be very large), the parameters for Naive Bayes classification models can be obtained using the maximum likelihood method. In other words, because of the independence assumption, we can develop Naive Bayes models without strictly complying with all the rules and requirements of Bayes’ theorem. First, let us review the Bayes’ theorem.

Bayes’ Theorem

In order to appreciate the Naive Bayes classification method, it is important to understand the basic idea of Bayes’ theorem and the exact Bayesian classifier (the one without the strong naive independence assumption). Bayes’ theorem (also called Bayes’ rule), named after the British mathematician Thomas Bayes (1701–1761), is a mathematical formula for determining conditional probabilities. In this formula, shown shortly, Y denotes the hypothesis and X denotes the data/evidence. This vastly popular theorem provides a way to revise and improve prediction probabilities by using additional evidence.

The following formulas show the relationship between the probabilities of two events Y and X. P(Y) is the prior probability of Y. It is “prior” in the sense that it does not take into account any information about X. P(Y|X) is the conditional probability of Y, given X. It is also called the posterior probability because it is derived from (or depends on) the specified value of X. P(X|Y) is the conditional probability of X given Y. It is also called the likelihood. P(X) is the prior probability of X, which is also called the evidence and acts as the normalizing constant.

To numerically illustrate these formulas, let us look at a simple example. Say that, based on the weather report, we know that there is a 40% chance of rain on Saturday. From the historical data, we also know that if it rains on Saturday, there is a 10% chance it will also rain on Sunday; and if it doesn’t rain on Saturday, there is an 80% chance it will rain on Sunday. Let us say that “Raining on Sunday” is event Y, and “Raining on Monday” is event X. Based on the description, we can write the following:

P(Y) = Probability of Raining on Saturday = 0.40

P(X|Y) = Probability of Raining on Sunday if It Rained on Saturday = 0.10

P(X) = Probability of Raining on Monday = Sum of the Probability of “Raining on Saturday and Raining on Sunday” and “Not Raining on Saturday and Raining on Sunday” = 0.40 × 0.10 + 0.60 × 0.80 = 0.52

Now if we were to calculate the probability for “It Rained on Saturday?” given “Rained on Sunday,” we would use Bayes’ theorem, which would allow us to calculate the probability of an earlier event, given the result of a later event:

Therefore, in this example, if it rained on Sunday, there’s a 7.69% chance it rained on Saturday.

Naive Bayes Classifier

The Bayes classifier uses Bayes’ theorem without the simplifying strong independence assumption. In a classification-type prediction problem, the Bayes classifier works as follow: Given a new sample to classify, it finds all other samples exactly like it (i.e., all predictor variables having the same values as the sample being classified), determines the class labels that they all belong to, and classifies the new sample into the most representative class. If none of the samples has the exact value match with the new class, then the classifier will fail in assigning the new sample into a class label (because it could not find any strong evidence to do so). Here is a very simple example. Using the Bayes classifier, say that we want to decide whether to play golf (Yes or No) for the following situation (Outlook is Sunny, Temperature is Hot, Humidity is High, and Windy is No).

Process of Developing Naive Bayes Classifier

Similar to other machine learning methods, Naive Bayes employs a two-phase model development and scoring/deployment process. The first phase is training, where the model/parameters are estimated, and the second phase is testing, where the classification/prediction is performed on new cases. The process is described in the following sections.

Training Phase

The following steps are involved in the training phase:

Step 1. Obtain the data, clean the data, and organize it in a flat file format (i.e., columns as variables and rows as cases).

Step 2. Make sure the variables are all nominal variables. If any of the variables is numeric/continuous, then the numeric variables need to go through a data transformation; that is, you need to convert a numeric variable into a nominal variable by using discretization, such as binning.

Step 3. Calculate the prior probability of all class labels for the dependent variable.

Step 4. Calculate the likelihood of all predictive variables and their possible values with respect to the dependent variable. In the case of mixed variables types (categorical and continuous), the likelihood (conditional probability) of each variable is estimated with the proper method for the specific variable type. Likelihoods for nominal and numeric predictor variables are calculated as follows:

  • For categorical variables, the likelihood (i.e., the conditional probability) is estimated as the simple fraction of the training samples for the variable value with respect to the dependent variable.

  • For numeric variables, the likelihood is calculated by (1) calculating the mean and variance for each predictor variable for each dependent variable value (i.e., class) and then (2) calculating the likelihood by using the following formula:

Quite often, the continuous/numeric independent/input variables are discretized (using an appropriate binning method), and then a categorical variable estimation method is used to calculate the conditional probabilities (likelihood parameters). If performed properly, this method tends to produce better-predicting Naive Bayes models.

Testing Phase

Using the two sets of parameters produced in steps 3 and 4 of the training phase, any new sample can be classified into a class label by using the following formula:

Because the denominator is constant (i.e., the same for all class labels), we can remove it from the formula, which leaves us with the following simpler formula, which is essentially the joint probability:

Naive Bayes is not very commonly used in predictive analytics projects today due to its relatively poor prediction performance in a wide variety of application domains. However, one of its extensions, called Bayesian network, is gaining surprisingly rapid popularity among data scientists in the analytics world.

Linear Regression

Regression, especially linear regression, is perhaps the most widely used analysis technique in statistics. Historically speaking, the roots of regression date back to the 1920s and 1930s, to the early work on inherited characteristics of sweet peas by Sir Francis Galton and subsequently Karl Pearson. Since then, regression has become the statistical technique for characterization of relationships between explanatory (input) variables and response (output) variables.

As popular as it is, regression is essentially a relatively simple statistical technique for modeling the dependence of a variable (response or output variable) on one or more explanatory (input) variables. Once identified, the relationship between the variables can be formally represented as a linear/additive function or equation. Like many other modeling techniques, regression aims to capture the functional relationship between and among the characteristics of the real world and describe this relationship with a mathematical model, which may then be used to discover and understand the complexities of reality—that is, explore and explain relationships or forecast future occurrences.

Regression can be used for two different purposes: hypothesis testing—investigating potential relationships between different variables—and prediction/forecasting—estimating values of response variables based on one or more explanatory variables. These two uses are not mutually exclusive. The explanatory power of regression is also the foundation of its prediction ability. In hypothesis testing (theory building), regression analysis can reveal the existence and strength and directions of relationships between a number of explanatory variables (often represented as x1x_1) and the response variable (often represented as yy). In prediction, regression uses an equation to identify additive mathematical relationships between one or more explanatory variables and a response variable. Once determined, this equation can be used to forecast the values of the response variable for a given set of values of the explanatory variables.

Correlation Versus Regression

Because regression analysis originated in correlation studies, and because both methods attempt to describe the association between two (or more) variables, the terms regression and correlation are often confused, even by scientists. Correlation makes no a priori assumption as to whether one variable is dependent on the other(s) and is not concerned with the relationship between variables; instead, it gives an estimate of the degree of association between the variables. On the other hand, regression attempts to describe the dependence of a response variable on one or more explanatory variables, where it implicitly assumes that there is a one-way causal effect from the explanatory variables to the response variable, regardless of whether the path of effect is direct or indirect. Also, whereas correlation is interested in low-level relationships between two variables, regression is concerned with the relationships between all explanatory variables and the response variable.

Simple Versus Multiple Regression

If a regression equation is built between one response variable and one explanatory variable, then it is called simple regression. For instance, the regression equation built to predict or explain the relationship between the height of a person (explanatory variable) and the weight of a person (response variable) is a good example of simple regression. Multiple regression is an extension of simple regression in which there are multiple explanatory variables. For instance, in the previous example, if we were to include not only the height of a person but also other personal characteristics (e.g., BMI, gender, ethnicity) to predict the weight of the person, then we would be performing multiple regression analysis. In both cases, the relationships between the response variables and the explanatory variables are linear and additive in nature. If the relationships are not linear, then we might want to use one of many other nonlinear regression methods to better capture the relationships between the input and output variables.

How to Develop a Linear Regression Model

To understand the relationship between two variables, the simplest thing to do is to draw a scatterplot, with the y-axis representing the values of the response variable and the x-axis representing the values of the explanatory variable (see Figure 5.7). Such a scatterplot would show the changes in the response variable as a function of the changes in the explanatory variable. In the example shown in Figure 5.7, there seems to be a positive relationship between the two: As the explanatory variable values increase, so does the response variable.

Figure 5.7 A Scatter Plot with a Simple Linear Regression Line

Simple regression analysis aims to find a mathematical representation of this relationship. In reality, it tries to find the signature of a straight line passing right in between the plotted dots (representing the observation/historical data) in such a way that it minimizes the distance between the dots and the line (the predicted values on the theoretical regression line). Several methods/algorithms have been proposed to identify the regression line, and the one that is most commonly used is called the ordinary least squares (OLS) method. The OLS method aims to minimize the sum of squared residuals (i.e., squared vertical distances between the observation and the regression point) and leads to a mathematical expression for the estimated value of the regression line (known as the β parameter). For simple linear regression, the aforementioned relationship between the response variable (y) and the explanatory variable(s) (x) can be shown as a simple equation, as follows:

y = β0 + β1x

In this equation, β0 is called the intercept, and β1 is called the slope. Once OLS determines the values of these two coefficients, the simple equation can be used to forecast the values of y for given values of x. The sign and the value of β1 also reveal the direction and the strength of the relationship between the two variables.

If the model is of a multiple linear regression type, then more coefficients need to be determined—one for each additional explanatory variable. As the following formula shows, the additional explanatory variable would be multiplied with new βi coefficients and summed together to establish a linear additive representation of the response variable:

y = β0 + β1x1 + β2x2 + ... + βnxn

How to Tell Whether a Model Is Good Enough

For a variety of reasons, sometimes models do not do a good job representing reality. Regardless of the number of explanatory variables included, there is always a possibility of not having a good model, and therefore a linear regression model needs to be assessed for its fit (i.e., the degree to which it represents the response variable). In the simplest sense, a well-fitting regression model results in predicted values close to the observed data values. For the numeric assessment, three statistical measures are often used in evaluating the fit of a regression model: R2 (R-squared), the overall F-test, and the root mean squared error (RMSE). All three of these measures are based on sums of square errors (i.e., how far the data is from the mean and how far the data is from the model’s predicted values). Different combinations of these two values provide different information about how the regression model compares to the mean model.

Of the three, R2 has the most useful and understandable meaning because of its intuitive scale. The value of R2 ranges from 0 to 1 (corresponding to the amount of variability, expressed as a percentage), with 0 indicating that the relationship and the prediction power of the proposed model is not good and 1 indicating that the proposed model is a perfect fit that produces exact predictions (which is almost never the case). Good R2 values usually come close to 1, and the closeness is a matter of the phenomenon being modeled; for example, while an R2 value of 0.3 for a linear regression model in social science may be considered good enough, an R2 value of 0.7 in engineering might not be considered a good fit. Improvement in the regression model can be achieved by adding more explanatory variables, taking some of the variables out of the model, or using different data transformation techniques, which would result in comparative increases in an R2 value.

Figure 5.8 shows the process flow involved in developing regression models. As shown in the figure, the model development task is followed by model assessment. Model assessment involves assessing the fit of the model and also, because of restrictive assumptions with which linear models have to comply, it involves examining the validity of the model.

Figure 5.8 Process Flow for Developing Regression Models

The Most Important Assumptions in Linear Regression

Even though they are still the top choice of many data analysts (both for explanatory as well as predictive modeling purposes), linear regression models suffer from several highly restrictive assumptions. The validity of a linear model depends on its ability to comply with these assumptions. These are the most common assumptions:

  • Linearity. This assumption states that the relationship between the response variable and the explanatory variables are linear. That is, the expected value of the response variable is a straight-line function of each explanatory variable, with all other explanatory variables held fixed. Also, the slope of the line does not depend on the values of the other variables. This implies that the effects of different explanatory variables on the expected value of the response variable are additive in nature.

  • Independence (of errors). This assumption states that the errors of the response variable are not correlated with each other. This independence of the errors is weaker than actual statistical independence, which is a stronger condition and is often not needed for linear regression analysis.

  • Normality (of errors). This assumption states that the errors of the response variable are normally distributed. That is, they are supposed to be totally random and should not represent any nonrandom patterns.

  • Constant variance (of errors). This assumption, also called homoscedasticity, states that the response variables must have the same variance in their error, regardless of the values of the explanatory variables. In practice, this assumption is invalid if the response variable varies over a wide enough range or scale.

  • Multicollinearity. This assumption states that the explanatory variables are not correlated (i.e., do not replicate the same but provide a different perspective of the information needed for the model). Multicollinearity can be triggered by having two or more perfectly correlated explanatory variables present in the model (e.g., if the same explanatory variable is mistakenly included in the model twice, one with a slight transformation of the other variable). A correlation-based data assessment usually catches this error.

Statistical techniques have been developed to identify the violation of these assumptions, and several techniques have been created to mitigate them. A data modeler needs to be aware of the existence of these assumptions and put in place a way to assess models to make sure they are compliant with the assumptions they are built on.

Logistic Regression

Logistic regression is a very popular, statistically sound, probability-based classification algorithm that employs supervised learning. It was developed in the 1940s as a complement to linear regression and linear discriminant analysis methods. It has been used extensively in numerous disciplines, including the medical and social science fields. Logistic regression is similar to linear regression in that it also aims to regress to a mathematical function that explains the relationship between the response variable and the explanatory variables, using a sample of past observations (training data). It differs from linear regression on one major point: Its output (response variable) is a class as opposed to a numeric variable. That is, whereas linear regression is used to estimate a continuous numeric variable, logistic regression is used to classify a categorical variable. Even though the original form of logistic regression was developed for a binary output variable (e.g., 1/0, yes/no, pass/fail, accept/reject), the present-day modified version is capable of predicting multiple-class output variables (i.e., multinomial logistic regression). If there is only one predictor variable and one predicted variable, the method is called simple logistic regression; similarly, simple linear regression is the term for a linear regression model with only one independent variable.

In predictive analytics, logistic regression models are used to develop probabilistic models between one or more explanatory or predictor variables (which may be a mix of both continuous and categorical variables) and a class or response variable (which may be binomial/binary or multinomial/multiple-class variables). Unlike ordinary linear regression, logistic regression is used for predicting categorical (often binary) outcomes of the response variable—treating the response variable as the outcome of a Bernoulli trial. Therefore, logistic regression takes the natural logarithm of the odds of the response variable to create a continuous criterion as a transformed version of the response variable. Thus the logit transformation is referred to as the link function in logistic regression; even though the response variable in logistic regression is categorical or binomial, the logit is the continuous criterion on which linear regression is conducted.

Figure 5.9 shows a logistic regression function where the odds are represented on the x-axis (i.e., a linear function of the independent variables), and the probabilistic outcome is shown on the y-axis (i.e., response variable values change between 0 and 1).

Figure 5.9 The Logistic Function

The logistic function, which is f(y) in Figure 5.9, is the core of logistic regression and can only take values between 0 and 1. The following equation is a simple mathematical representation of this function:

The logistic regression coefficients (the βs) are usually estimated using the maximum-likelihood estimation method. Unlike with linear regression with normally distributed residuals, it is not possible to find a closed-form expression for the coefficient values that maximizes the likelihood function, and an iterative process must be used instead. This process begins with a tentative starting solution and then revises the parameters slightly to see if the solution can be improved; it repeats this iterative revision until no improvement can be achieved or the improvements are very minimal, at which point the process is said to have completed, or converged.

Time-Series Forecasting

Sometimes the variable of interest (i.e., the response variable) may not have distinctly identifiable explanatory variables, or there may be too many of them in a highly complex relationship. In such cases, if the data is available in the desired format, a prediction model called a time-series forecast can be developed. A time series is a sequence of data points of the variable of interest, measured and represented at successive points in time and spaced at uniform time intervals. Examples of time series include monthly rain volumes in a geographic area, the daily closing value of the stock market indices, and daily sales totals for a grocery store. Often, time series are visualized using a line chart. Figure 5.10 shows an example of a time series of sales volumes for the years 2015 through 2019, on a quarterly basis.

Figure 5.10 Sample Time-Series Data on Quarterly Sales Volumes

Time-series forecasting involves using a mathematical model to predict future values of the variable of interest, based on previously observed values. Time-series plots or charts look very similar to simple linear regression scatterplots in that there are two variables: the response variable and the time variable. Beyond this similarity, there is hardly any other commonality between the two. Whereas regression analysis is often used in testing theories to see if current values of one or more explanatory variables explain (and hence predict) the response variable, time-series models are focused on extrapolating the time-varying behavior to estimate future values.

Time-series forecasting assumes that all the explanatory variables are aggregated and consumed in the response variable’s time-variant behavior. Therefore, capturing time-variant behavior is a way to predict the future values of the response variable. To do that, the pattern is analyzed and decomposed into its main components: random variations, time trends, and seasonal cycles. The time-series example shown in Figure 5.10 illustrates all these distinct patterns.

The techniques used to develop time-series forecasts range from very simple (e.g., the naive forecast, which suggests that today’s forecast is the same as yesterday’s actual) to very complex (e.g., ARIMA, which combines autoregressive and moving average patterns in data). The most popular techniques are perhaps the averaging methods that include simple averages, moving averages, weighted moving averages, and exponential smoothing. Many of these techniques also have advanced versions in which seasonality and trend can also be taken into account for better and more accurate forecasting. The accuracy of a method is usually assessed by computing its error (i.e., calculated deviation between actuals and forecasts for the past observations) via mean absolute error (MAE), mean squared error (MSE), or mean absolute percent error (MAPE). Even though they all use the same core error measure, these three assessment methods emphasize different aspects of the error, some panelizing larger errors more than the others.

Refer PDF (Application Example: Data Mining for Complex Medical Procedures)

Last updated