Chapter 03

Cross-Industry Standard Process for Data Mining (CRISP-DM)

Another standardized data mining processes—arguably the most popular one—is called Cross-Industry Standard Process for Data Mining (CRISP DM), and it was proposed in the mid- to late 1990s by a European consortium of companies.

Today, the CRISP-DM process is used for business analytics and data science projects, and it is sometimes called CRISP-BA (for business analytics), CRISP-DS (for data science), or just CRISP (which makes it a more generalized term by omitting any analytics-related specification).

The CRISP-DM Standardized Process for Data Analytics/Data Science

Even though these steps are shown in sequential order in the graphical representation, there is usually a great deal of backtracking.

A significant portion of the data analytics/data science relies on experience and best practices–driven experimentation; that is, there is no one magical way that works for every project and every data scientist.

Therefore, depending on the problem situation and the skills, knowledge, and experience of the analyst, the process can be rather iterative (i.e., it may require going back and forth through the steps quite a few times) and time-intensive. Because each step is built on the outcome of the immediate predecessor, it’s important to pay attention to the earlier steps in order to avoid putting an entire study on an incorrect path from the outset.

Step 1: Business Understanding

The key to success in any data mining project is to know what the study is for.

Specific goals are needed such as “What are typical profiles of our customers, and how much value does each of them provide to us?”.

A project plan for collecting the data, analyzing the data, and reporting the findings.

A budget to support the study should also be established with rough numbers.

Intimately knowing the business purpose is critical to achieving success.

Step 2: Data Understanding

The second step is to make a perfect match between the business problem and the data being used to address it.

It is important to identify the relevant data from many available data sources.

First and foremost, an analyst should be clear and concise in describing the data mining task so that the most relevant data can be identified.

For example, a retail data mining project might seek to identify spending behaviors of female shoppers who purchase seasonal clothes based on their demographics, credit card transactions, and socioeconomic attributes.

Furthermore, the analyst should gain an intimate understanding of the data sources.

For example, where the relevant data is stored and in what form, whether data collection is automated or happens manually, who collects the data, and how often the data are updated.

The analyst should also understand variables by seeking answers to questions such as “What are the most relevant variables?” “Are there any synonymous and/or homonymous variables?” and “Are the variables independent of each other—? That is, do they stand as a complete information source without overlapping or conflicting information?

For better understand the data, an analyst often uses a variety of statistical and graphical techniques, such as simple statistical descriptors/summaries of each variable.

For example, for numeric variables, the average, minimum/maximum, median, and standard deviation are among the calculated measures, whereas for categorical variables the mode and frequency tables are calculated), correlation analysis, scatterplots, histograms, and box plots.

Careful identification and selection of data sources and the most relevant variables can make it easier for data mining algorithms to quickly discover useful knowledge patterns.

Data sources for data selection can vary.

Normally, data sources for business applications include:

demographic data (e.g., income, education, number of households, age), sociographic data (e.g., hobbies, club memberships, entertainment), and transactional data (e.g., sales record, credit card spending, issued checks), among other types.

Data can be categorized as quantitative and qualitative. Quantitative data is measured using numeric values. It can be discrete (e.g., integers) or continuous (e.g., real numbers). Qualitative data, also known as categorical data, contains both nominal and ordinal data.

Step 3: Data Preparation

Commonly called data preprocessing.

In CRISP-DM, data preprocessing consumes the most time and effort—roughly 80% of the total time spent on a data mining project.

Real-world data is generally incomplete.

It needs to be converted to a consistent and unified format.

Data cleaning means filtering, aggregating, and filling in missing values (a.k.a. imputation).

An analyst examines the selected variables for outliers and redundancies.

Outliers may occur for many reasons, such as human errors or technical errors, or they may naturally occur in a data set due to extreme events.

If the age of a credit card holder is recorded as 12, this is likely a data entry error—most likely made by a human. However, there might actually be an independently wealthy preteen with important purchasing habits. Arbitrarily deleting this outlier could dismiss valuable information.

Data may also be redundant, with the same information recorded in several different ways.

Aggregating data reduces data dimensions. Note that although an aggregated data set has a small volume, the information remains.

If a marketing promotion for furniture sales is considered in the next three or four years, the available daily sales data can be aggregated as annual sales data. The size of the sales data is then dramatically reduced. Smoothing data helps find missing values of the selected data, and new or reasonable values can be added. These added values could be the average number of the variable (mean) or the mode. A missing value often means no solution is found when a data mining algorithm is applied to discover the knowledge patterns.

Step 4: Model Building

Various modeling techniques are selected and applied to an already prepared data set in order to address the specific business need.

Step 5: Testing and Evaluation

Step 6: Deployment

SEMMA

Last updated