Advanced Analytics – Theory and Methods

Copyright © 2014 EMC Corporation. All Rights Reserved.

Copyright © 2014 EMC Corporation. All rights reserved.

Module 4: Analytics Theory/Methods

1

Module 4: Analytics Theory/Methods

1

Advanced Analytics – Theory and Methods

Upon completion of this module, you should be able to:

• Examine analytic needs and select an appropriate technique based on

business objectives; initial hypotheses; and the data’s structure and volume

• Apply some of the more commonly used methods in Analytics solutions

• Explain the algorithms and the technical foundations for the commonly used

methods

• Explain the environment (use case) in which each technique can provide the

most value

• Use appropriate diagnostic methods to validate the models created

• Use R and in-database analytical functions to fit, score and evaluate models

Copyright © 2014 EMC Corporation. All Rights Reserved.

Module 4: Analytics Theory/Methods

2

The objectives of this module are listed. The Analytical methods covered are:

Categorization (un-supervised) :

1.K-means clustering

2. Association Rules

Regression

3. Linear

4. Logistic

Classification (supervised)

5.Naïve Bayesian classifier

6. Decision Trees

7. Time Series Analysis

8. Text Analysis

Copyright © 2014 EMC Corporation. All rights reserved.

Module 4: Analytics Theory/Methods

2

Where “R” we?

• In Module 3 we reviewed R skills and basic statistics

• You can use R to:

Generate summary statistics to investigate a data set

Visualize Data

Perform statistical tests to analyze data and evaluate models

• Now that you have data, and you can see it, you need to plan the

analytic model and determine the analytic method to be used

Copyright © 2014 EMC Corporation. All Rights Reserved.

Module 4: Analytics Theory/Methods

3

Module 4 focuses on the most commonly used analytic methods, detailing:

a) Prominent use cases for the method

b) Algorithms to implement the method

c) Diagnostics that are most commonly used to evaluate the effectiveness of the method

d) The Reasons to Choose (+) and Cautions (-) (where the method is most and least effective)

Copyright © 2014 EMC Corporation. All rights reserved.

Module 4: Analytics Theory/Methods

3

Applying the Data Analytics Lifecycle

Discovery

Operationalize

Data Prep

• In a typical Data Analytics Problem – you would have gone

•

through:

Communicate

Model

Results

Planning

• Phase

1 – Discovery – have the problem framed

• Phase 2 – Data Preparation – have the data prepared

Model and determine the method to

Now you need to plan the model

Building

be used.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Module 4: Analytics Theory/Methods

4

Here we recall phases of analytic life cycle we would have gone through before we plan for the

analytic method we should be using with the data.

Copyright © 2014 EMC Corporation. All rights reserved.

Module 4: Analytics Theory/Methods

4

Phase 3 – Model Planning

Discovery

How Operationalize

do people generally solve this

problem with the kind of data and

resources I have?

Communicate

• Does

that work well enough? Or do I have

Results

to come

up with something new?

• What are related or analogous problems?

Model

How are they solved? Can I do that?

Building

Is the model robust

enough? Have we

failed for sure?

Copyright © 2014 EMC Corporation. All Rights Reserved.

Data Prep

Model

Planning

Do I have a good idea

about the type of model

to try? Can I refine the

analytic plan?

Module 4: Analytics Theory/Methods

5

Model planning is the process of determining the appropriate analytic method based on the

problem. It also depends on the type of data and the computational resources available.

Copyright © 2014 EMC Corporation. All rights reserved.

Module 4: Analytics Theory/Methods

5

What Kind of Problem do I Need to Solve?

How do I Solve it?

The Problem to Solve

The Category of

Techniques

Covered in this Course

I want to group items by similarity.

I want to find structure (commonalities)

in the data

Clustering

K-means clustering

I want to discover relationships between

actions or items

Association Rules

Apriori

I want to determine the relationship

between the outcome and the input

variables

Regression

Linear Regression

Logistic Regression

I want to assign (known) labels to

objects

Classification

Naïve Bayes

Decision Trees

I want to find the structure in a temporal

process

I want to forecast the behavior of a

temporal process

Time Series Analysis

ACF, PACF, ARIMA

I want to analyze my text data

Text Analysis

Regular expressions, Document

representation (Bag of Words), TFIDF

Copyright © 2014 EMC Corporation. All Rights Reserved.

Module 4: Analytics Theory/Methods

6

This table lists the typical business questions (column 1) addressed by a category of techniques or

analytical methods (column 2)

Some of the typical business questions for different category of techniques are listed below:

Clustering

images by

Association Rules

Regression

the

Classification

Time Series Analysis

be

Text Analysis

How do I group these documents by topic? How do I group these

similarity? (More businesslike questions)

What do other people like this person tend to like/buy/watch?

I want to predict the lifetime value of this customer. I want to predict

probability that this loan will default.

Where in the catalog should I place this product? Is this email spam?

What is the likely future price of this stock? What will my sales volume

next month?

Is this a positive product review or a negative one?

As it can be observed that these category of techniques overlap with each other with the type of

problem they can be used to solve.

Questions such as “How do I group these documents?” and “Is this email spam?” , “Is this a positive

product review” can all be answered with a “classification”. But these questions can also be considered

as a Text analysis problem which we cover in this module. Text analysis is defined as term for the

specific process of representing, manipulating, and predicting or learning over text. The tasks

themselves can often be classified as clustering, or classification.

Similarly more than one method can be used to solve the same problem. For example Time Series

Analysis can be used to predict prices over time. Time series is used in cases where the past is

observable to the participants, which is often true of stock, and real estate. Sometimes we can use

regression methods as well. However, regression is most effective when assigning effects to

complicated patterns of treatment.

Column 3 in the table above lists the specific analytical methods that are detailed in the subsequent

lessons in this module.

Copyright © 2014 EMC Corporation. All rights reserved.

Module 4: Analytics Theory/Methods

6

Why These Example Techniques?

• Most popular, frequently used:

Provide the foundation for Data

Science skills on which to build

• Relatively easy for new Data

Scientists to understand &

comprehend

• Applicable to a broad range of

problems in several verticals

Copyright © 2014 EMC Corporation. All Rights Reserved.

Module 4: Analytics Theory/Methods

7

We present in this module K-means clustering, Apriori algorithm for Association rules, Linear

and logistic regression, Classification methods with Naïve Bayesian method and Decision Trees,

Time Series Analysis with Box-Jenkins ARIMA modeling and key concepts such as TF-IDF.

Regular expressions and document representation methods with “bag of words” are chosen to

be presented in this module among several techniques available for the Data Scientists to use

to solve analytic problems. The reasons for which these techniques are chosen among all the

available techniques are listed on this slide.

Copyright © 2014 EMC Corporation. All rights reserved.

Module 4: Analytics Theory/Methods

7

Module 4: Advanced Analytics – Theory and Methods

Lesson 1: K-means Clustering

During this lesson the following topics are covered:

• Clustering – Unsupervised learning method

• K-means clustering:

•

Use cases

•

The algorithm

•

Determining the optimum value for K

•

Diagnostics to evaluate the effectiveness of the method

•

Reasons to Choose (+) and Cautions (-) of the method

Copyright © 2014 EMC Corporation. All Rights Reserved.

Module 4: Analytics Theory/Methods

8

This lesson covers K-means clustering with these topics.

Copyright © 2014 EMC Corporation. All rights reserved.

Module 4: Analytics Theory/Methods

8

Clustering

How do I group these documents by topic?

How do I group my customers by purchase patterns?

• Sort items into groups by similarity:

Items in a cluster are more similar to each other than they are to

items in other clusters.

Need to detail the properties that characterize “similarity”

Or of distance, the “inverse” of similarity

• Not a predictive method; finds similarities, relationships

• Our Example: K-means Clustering

Copyright © 2014 EMC Corporation. All Rights Reserved.

Module 4: Analytics Theory/Methods

9

In machine learning, “unsupervised” refers to the problem of finding a hidden structure within

unlabeled data. In this lesson and the following lesson we will be discussing two unsupervised

learning methods clustering and Association Rules.

Clustering is a popular method used to form homogenous groups within a data set based on

their internal structure. Clustering is a method often used for exploratory analysis of the data.

There are no ”predictions” of any values done with clustering just finding the similarity

between the data and grouping them into clusters

The notion of similarities can be explained with the following examples:

Consider questions such as

1. How do I group these documents by topic?

2. How do I perform customer segmentation to allow for targeted or special marketing

programs.

The definition of “similarity” is specific to the problem domain. We are defining similarity as

those data points with the same “topic” tag or customers who can be profiled in to a same

“age group/income/gender” or a “purchase pattern”.

If we have a vector of measurements of an attribute of the data, the data points that are

grouped into a cluster will have values for the measurement close to each other than to those

data points grouped in a different cluster. In other words the distance, (an inverse of similarity)

between the points within a cluster are always lower than the distance between points in a

different cluster. In a cluster we end up with a tight group (homogeneous) of data points that

are far apart from those data points that end up in a different cluster.

There are many clustering techniques and we are going to discuss one of the most popular

clustering method known as “K-means clustering” in this lesson.

Copyright © 2014 EMC Corporation. All rights reserved.

Module 4: Analytics Theory/Methods

9

K-Means Clustering – What is it?

• Used for clustering numerical data, usually a set of

•

measurements about objects of interest.

Input: numerical. There must be a distance metric defined over

the variable space.

Euclidian distance

• Output: The centers of each discovered cluster, and the

assignment of each input datum to a cluster.

Centroid

Copyright © 2014 EMC Corporation. All Rights Reserved.

Module 4: Analytics Theory/Methods 10

K-means clustering is used to cluster numerical data.

In K-means we define two measures of distances, between two data points(records) and the distance

between two clusters. Distance can be measured (calculated) in a number of ways but four principles

tend to hold true.

1. Distance is not negative (it is stated as an absolute value)

2. Distance from one record to itself is zero.

3. Distance from record I to record J is the same as the distance from record J to record I, again since

the distance is stated as an absolute value, the starting and end points can be reversed.

4. Distance between two records can not be greater than the sum of the distance between each

record and a third record.

Euclidean distance is the most popular method for calculating distance. Euclidian distance is a

“ordinary” distance that one could measure with a ruler. In a single dimension the Euclidian distance is

the absolute value of the differences between two points. The straight line distance between two

points. In a plane with p1 at (x1, y1) and p2 at (x2, y2), it is √((x1 – x2)² + (y1 – y2)²).

In N dimensions, the Euclidean distance between two points p and q is √(∑i=1N (pi-qi)²) where pi (or qi) is

the coordinate of p (or q) in dimension i.

Though there are many other distance measures, the Euclidian distance is the most commonly used

distance measure and many packages use this measure.

The Euclidian distance is influenced by the scale of the variables. Changing the scale (for example from

feet to inches) can significantly influence the results.Second, the equation ignores the relationship

between variables. Lastly, the clustering algorithm is sensitive to outliers. If the data has outliers and

removal of them is not possible, the results of the clustering can be substantially distorted.

The centroid is the center of the discovered cluster. K-means clustering provides this as an output.

When the number of clusters is fixed to k, K-means clustering gives a formal definition as an

optimization problem: find the k cluster centers and assign the objects to the nearest cluster center,

such that the squared distances from the cluster are minimized.

Copyright © 2014 EMC Corporation. All rights reserved.

Module 4: Analytics Theory/Methods

10

Use Cases

• Often an exploratory technique:

Discover structure in the data

Summarize the properties of each cluster

• Sometimes a prelude to classification:

“Discovering the classes“

• Examples

The height, weight and average lifespan of animals

Household income, yearly purchase amount in dollars, number of

household members of customer households

Patient record with measures of BMI, HBA1C, HDL

Copyright © 2014 EMC Corporation. All Rights Reserved.

Module 4: Analytics Theory/Methods 11

K-means clustering is often used as a lead-in to classification. It is primarily an

exploratory technique to discover the structure of the data that you might not have

notice before and as a prelude to more focused analysis or decision processes.

Some examples of the set of measurements based on which clustering can be

performed are detailed in the slide.

In the patient record where we have measures such as BMI, HBA1C, HDL with which we

could cluster patients into groups that define varying degrees of risk of a heart disease.

In Classification the labels are known. Whereas in clustering the labels are not known.

Hence clustering can be used to determine the structure in the data and summarize the

properties of each cluster in terms of the measured centroids for the group. The

clusters can define what the initial classes could be.

In low dimensions we can visualize the clusters. It gets very hard to visualize as the

dimensions increase.

There are a lot of applications of the K-mean clustering, examples include pattern

recognition, classification analysis, artificial intelligence, image processing, machine

vision, etc.

In principle, you have several objects and each object has several attributes. You want

to classify the objects based on the attributes, then you can apply this algorithm. For

Data Scientists, K-means is an excellent tool to understand the structure of data and

validate some of the assumptions that are provided by the domain experts pertaining

to the data. We will look into a specific use-case in the following slide.

Copyright © 2014 EMC Corporation. All rights reserved.

Module 4: Analytics Theory/Methods

11

Use-Case Example – On-line Retailer

LTV – Lifetime Customer Value

Copyright © 2014 EMC Corporation. All Rights Reserved.

Module 4: Analytics Theory/Methods 12

Here we present a fabricated example of an on-line retailer. The unique selling point of this retailer is that they

make the “returns” simple with an assumption that this policy encourages use and “frequent customers are more

valuable”. So let us validate this assumption.

We took a sample set of customers clustered on purchase frequency, return rate, and lifetime customer value

(LTV).

We define purchase frequency as the number of visits a customer made in a month on average that had a

shopping cart transaction.

We can easily see that return rate has an important effect on customer value.

We clustered the customers into 4 groups, and the plotted 3 graphs taking two of the attributes in a graph. The

data points are represented in the graphs by different colors for each cluster and larger “dot” represents the

centroid for the group.

The groups can be defined broadly as follows:

GP1: Visit less frequently, low return rate, moderate LTV(ranked 3rd)

GP2: Visit often, return a lot of their purchases. Lowest avg LTV (counter intuitive)

GP3: Visit often, return things moderately, High LTV (ranked 2nd) (happy medium)

GP4: Visit rarely, don’t return purchases. Highest avg LTV

It appears that GP3 is the ideal group – they visit often, return things moderately, and are high value. The next

questions are

– Why is it that GP3 is ideal?

– What are the people in these different groups buying?

– Is that affecting LTV?

– Can we raise the LTV of our frequent customers, perhaps by lowering the cost of returns, or by somehow

discouraging customers who return goods too frequently?

– Can we encourage GP4 customers to visit more (without lowering their LTV?)

– Are more frequent customers more valuable?

You can see the range of questions that a Data Scientist can address with the initial analysis with k-means

clustering.

Copyright © 2014 EMC Corporation. All rights reserved.

Module 4: Analytics Theory/Methods

12

The Algorithm

1. Choose K; then select K

random “centroids”

In our example, K=3

2. Assign records to the

cluster with the closest

centroid

Copyright © 2014 EMC Corporation. All Rights Reserved.

Module 4: Analytics Theory/Methods 13

Step 1 – K-means clustering begins with the data set segmented into K clusters.

Step 2- Observations are moved from cluster to cluster to help reduce the distance from the

observation to the cluster centroid.

Copyright © 2014 EMC Corporation. All rights reserved.

Module 4: Analytics Theory/Methods

13

The Algorithm (Continued)

3. Recalculate the resulting

centroids

Centroid: the mean value of all

the records in the cluster

4. Repeat steps 2 & 3 until record

assignments no longer change

Model Output:

• The final cluster centers

• The final cluster assignments of

the training data

Copyright © 2014 EMC Corporation. All Rights Reserved.

Module 4: Analytics Theory/Methods 14

Step 3 – When observations are moved to a new cluster, the centroid for the affected clusters

needs to be recalculated.

Step 4 – This movement and recalculation is repeated until movement no longer results in an

improvement.

The model output is the final cluster centers and the final cluster assignments for the data.

Selecting the appropriate number of clusters, K, can be done upfront if you possess some

knowledge on what the right number may be. Alternatively you can try the exercise with

different values for K and decide which clusters best suit your needs. Since it is rare that the

appropriate number of clusters in a dataset is known, it is good practice to select a few values

for k and compare the results.

The first partitioning should be done with the same knowledge used to select the appropriate

value of K, for example domain knowledge about the market or industries.

If K was selected without external knowledge, the partitioning can be done without any inputs.

Once all observations are assigned to their closest cluster, the clusters can be evaluated for

their “in-cluster dispersion.” Clusters with the smallest average distance are the most

homogenous. We can also examine the distance between clusters and decide if it makes sense

to combine clusters which may be located close together. We can also use the distance

between clusters to assess how successful the clustering exercise has been. Ideally, the

clusters should not be located close together as the clusters should be well separated.

Copyright © 2014 EMC Corporation. All rights reserved.

Module 4: Analytics Theory/Methods

14

Picking K

Heuristic: find the “elbow” of the within-sum-of-squares (wss) plot

as a function of K.

K: # of clusters

ni: # points in ith cluster

ci: centroid of ith cluster

xij: jth point of ith cluster

“Elbows” at k=2,4,6

Copyright © 2014 EMC Corporation. All Rights Reserved.

Module 4: Analytics Theory/Methods 15

Practically based on the domain knowledge, a value for K is picked and the centroids are

computed. Then a different K is chosen and the model is repeated to observe if it enhanced the

cohesiveness of the data points within the cluster group. However if there is no apparent

structure in the data we may have to try multiple values for K. It is an exploratory process.

We present here one of the heuristic approaches used for picking the optimal “K” for the given

dataset. “Within Sum of Squares” – WSS is a measure of how tight on average each cluster is.

For k=1, WSS can be considered the overall dispersion of the data. WSS primarily is a measure

of homogeneity. In general more clusters result in tighter clusters. But having too many clusters

is over-fitting. The formula that defines WSS is shown. The graph depicts the value of WSS on

the Y-axis and the number of clusters on the X-axis. The online retailer example data we

reviewed earlier is the data with which the graph shown here is generated. We repeated the

clustering for 12 different values .When we went from one cluster to two there is a significant

drop in the value of WSS, since with two clusters you get more homogeneity. We look for the

elbow of the curve which provides the optimal number of clusters for the given data.

Visualizing the data helps in confirming the optimal number of clusters. Reviewing the three

pair-wise graphs we plotted for the online retailer example earlier you can see that having four

groups sufficiently explained the data and from the graph above we can also see the elbow of

the curve is at 4.

Copyright © 2014 EMC Corporation. All rights reserved.

Module 4: Analytics Theory/Methods

15

Diagnostics – Evaluating the Model

• Do the clusters look separated in at least some of the plots when

you do pair-wise plots of the clusters?

Pair-wise plots can be used when there are not many variables

• Do you have any clusters with few data points?

Try decreasing the value of K

• Are there splits on variables that you would expect, but don’t

see?

Try increasing the value K

• Do any of the centroids seem too close to each other?

Try decreasing the value of K

Copyright © 2014 EMC Corporation. All Rights Reserved.

Module 4: Analytics Theory/Methods 16

How do we know that we have good clusters?

Pair-wise plots of the clusters provide a good visual confirmation that the clusters are

homogeneous. When the dimensions of the data are not significantly large this method helps

in determining the optimal number of clusters. With these plots you should be able to

determine if the clusters look separated in at least some of the plots. They won’t be very

separated in all of the plots. This can be seen even with the on-line retailer example we saw

earlier. Some of the clusters get mixed in together in some dimensions.

If you feel that your clusters are too small it indicates that you have a large value for K and K

needs to be reduced (try a smaller K). It may be the outliers in the data that tend to cluster

into clusters with less data points.

Alternatively if you see there are splits that you expected but are not seen in the clusters, for

example you expect two different income groups and you don’t see them, you should try a

bigger value for K.

If the centroids seem too close to each other then you should try decreasing the value of K.

Copyright © 2014 EMC Corporation. All rights reserved.

Module 4: Analytics Theory/Methods

16

K-Means Clustering – Reasons to Choose (+) and

Cautions (-)

Reasons to Choose (+)

Cautions (-)

Easy to implement

Easy to assign new data to existing

clusters

Which is the nearest cluster center?

Concise output

Coordinates the K cluster centers

Copyright © 2014 EMC Corporation. All Rights Reserved.

Doesn’t handle categorical variables

Sensitive to initialization (first guess)

Variables should all be measured on

similar or compatible scales

Not scale-invariant!

K (the number of clusters) must be

known or decided a priori

Wrong guess: possibly poor results

Tends to produce “round” equi-sized

clusters.

Not always desirable

Module 4: Analytics Theory/Methods 17

K-means clustering is easy to implement and it produces concise output. It is easy to assign

new data to the existing clusters by determining which centroid the new data point is closest to

it.

However K-means works only on the numerical data and does not handle categorical variables.

It is sensitive to the initial guess on the centroids. It is important that the variables must be all

measured on similar or compatible scales. If you measure the living space of a house in square

feet, the cost of the house in thousands of dollars (that is, 1 unit is $1000), and then you

change the cost of the house to dollars (so one unit is $1), then the clusters may change. K

should be decided ahead of the modeling process. Wrong guesses for K may lead to improper

clustering.

K-means tends to produce rounded and equal sized clusters. If you have clusters which are

elongated or crescent shaped, K-means may not be able to find these clusters appropriately.

The data in this case may have to be transformed before modeling.

Copyright © 2014 EMC Corporation. All rights reserved.

Module 4: Analytics Theory/Methods

17

Check Your Knowledge

1. Why do we consider K-means clustering as a unsupervised

Your Thoughts?

machine learning algorithm?

2. How do you use “pair-wise” plots to evaluate the effectiveness

of the clustering?

3. Detail the four steps in the K-means clustering algorithm.

4. How do we use WSS to pick the value of K?

5. What is the most common measure of distance used with Kmeans clustering algorithms?

6. The attributes of a data set are “purchase decision (Yes/No),

Gender (M/F), income group (50K). Can you

use K-means to cluster this data set?

Copyright © 2014 EMC Corporation. All Rights Reserved.

Module 4: Analytics Theory/Methods 18

Record your answers here.

Copyright © 2014 EMC Corporation. All rights reserved.

Module 4: Analytics Theory/Methods

18

Module 4: Advanced Analytics – Theory and Methods

Lesson 1: K-means Clustering – Summary

During this lesson the following topics were covered:

• Clustering – Unsupervised learning method

• What is K-means clustering

• Use cases with K-means clustering

• The K-means clustering algorithm

• Determining the optimum value for K

• Diagnostics to evaluate the effectiveness of K-means clustering

• Reasons to Choose (+) and Cautions (-) of K-means clustering

Copyright © 2014 EMC Corporation. All Rights Reserved.

Module 4: Analytics Theory/Methods 19

Summary of key-topics presented in this lesson are listed. Take a moment to review them.

Copyright © 2014 EMC Corporation. All rights reserved.

Module 4: Analytics Theory/Methods

19

Purchase answer to see full

attachment

Copyright © 2014 EMC Corporation. All Rights Reserved.

Copyright © 2014 EMC Corporation. All rights reserved.

Module 4: Analytics Theory/Methods

1

Module 4: Analytics Theory/Methods

1

Advanced Analytics – Theory and Methods

Upon completion of this module, you should be able to:

• Examine analytic needs and select an appropriate technique based on

business objectives; initial hypotheses; and the data’s structure and volume

• Apply some of the more commonly used methods in Analytics solutions

• Explain the algorithms and the technical foundations for the commonly used

methods

• Explain the environment (use case) in which each technique can provide the

most value

• Use appropriate diagnostic methods to validate the models created

• Use R and in-database analytical functions to fit, score and evaluate models

Copyright © 2014 EMC Corporation. All Rights Reserved.

Module 4: Analytics Theory/Methods

2

The objectives of this module are listed. The Analytical methods covered are:

Categorization (un-supervised) :

1.K-means clustering

2. Association Rules

Regression

3. Linear

4. Logistic

Classification (supervised)

5.Naïve Bayesian classifier

6. Decision Trees

7. Time Series Analysis

8. Text Analysis

Copyright © 2014 EMC Corporation. All rights reserved.

Module 4: Analytics Theory/Methods

2

Where “R” we?

• In Module 3 we reviewed R skills and basic statistics

• You can use R to:

Generate summary statistics to investigate a data set

Visualize Data

Perform statistical tests to analyze data and evaluate models

• Now that you have data, and you can see it, you need to plan the

analytic model and determine the analytic method to be used

Copyright © 2014 EMC Corporation. All Rights Reserved.

Module 4: Analytics Theory/Methods

3

Module 4 focuses on the most commonly used analytic methods, detailing:

a) Prominent use cases for the method

b) Algorithms to implement the method

c) Diagnostics that are most commonly used to evaluate the effectiveness of the method

d) The Reasons to Choose (+) and Cautions (-) (where the method is most and least effective)

Copyright © 2014 EMC Corporation. All rights reserved.

Module 4: Analytics Theory/Methods

3

Applying the Data Analytics Lifecycle

Discovery

Operationalize

Data Prep

• In a typical Data Analytics Problem – you would have gone

•

through:

Communicate

Model

Results

Planning

• Phase

1 – Discovery – have the problem framed

• Phase 2 – Data Preparation – have the data prepared

Model and determine the method to

Now you need to plan the model

Building

be used.

Copyright © 2014 EMC Corporation. All Rights Reserved.

Module 4: Analytics Theory/Methods

4

Here we recall phases of analytic life cycle we would have gone through before we plan for the

analytic method we should be using with the data.

Copyright © 2014 EMC Corporation. All rights reserved.

Module 4: Analytics Theory/Methods

4

Phase 3 – Model Planning

Discovery

How Operationalize

do people generally solve this

problem with the kind of data and

resources I have?

Communicate

• Does

that work well enough? Or do I have

Results

to come

up with something new?

• What are related or analogous problems?

Model

How are they solved? Can I do that?

Building

Is the model robust

enough? Have we

failed for sure?

Copyright © 2014 EMC Corporation. All Rights Reserved.

Data Prep

Model

Planning

Do I have a good idea

about the type of model

to try? Can I refine the

analytic plan?

Module 4: Analytics Theory/Methods

5

Model planning is the process of determining the appropriate analytic method based on the

problem. It also depends on the type of data and the computational resources available.

Copyright © 2014 EMC Corporation. All rights reserved.

Module 4: Analytics Theory/Methods

5

What Kind of Problem do I Need to Solve?

How do I Solve it?

The Problem to Solve

The Category of

Techniques

Covered in this Course

I want to group items by similarity.

I want to find structure (commonalities)

in the data

Clustering

K-means clustering

I want to discover relationships between

actions or items

Association Rules

Apriori

I want to determine the relationship

between the outcome and the input

variables

Regression

Linear Regression

Logistic Regression

I want to assign (known) labels to

objects

Classification

Naïve Bayes

Decision Trees

I want to find the structure in a temporal

process

I want to forecast the behavior of a

temporal process

Time Series Analysis

ACF, PACF, ARIMA

I want to analyze my text data

Text Analysis

Regular expressions, Document

representation (Bag of Words), TFIDF

Copyright © 2014 EMC Corporation. All Rights Reserved.

Module 4: Analytics Theory/Methods

6

This table lists the typical business questions (column 1) addressed by a category of techniques or

analytical methods (column 2)

Some of the typical business questions for different category of techniques are listed below:

Clustering

images by

Association Rules

Regression

the

Classification

Time Series Analysis

be

Text Analysis

How do I group these documents by topic? How do I group these

similarity? (More businesslike questions)

What do other people like this person tend to like/buy/watch?

I want to predict the lifetime value of this customer. I want to predict

probability that this loan will default.

Where in the catalog should I place this product? Is this email spam?

What is the likely future price of this stock? What will my sales volume

next month?

Is this a positive product review or a negative one?

As it can be observed that these category of techniques overlap with each other with the type of

problem they can be used to solve.

Questions such as “How do I group these documents?” and “Is this email spam?” , “Is this a positive

product review” can all be answered with a “classification”. But these questions can also be considered

as a Text analysis problem which we cover in this module. Text analysis is defined as term for the

specific process of representing, manipulating, and predicting or learning over text. The tasks

themselves can often be classified as clustering, or classification.

Similarly more than one method can be used to solve the same problem. For example Time Series

Analysis can be used to predict prices over time. Time series is used in cases where the past is

observable to the participants, which is often true of stock, and real estate. Sometimes we can use

regression methods as well. However, regression is most effective when assigning effects to

complicated patterns of treatment.

Column 3 in the table above lists the specific analytical methods that are detailed in the subsequent

lessons in this module.

Copyright © 2014 EMC Corporation. All rights reserved.

Module 4: Analytics Theory/Methods

6

Why These Example Techniques?

• Most popular, frequently used:

Provide the foundation for Data

Science skills on which to build

• Relatively easy for new Data

Scientists to understand &

comprehend

• Applicable to a broad range of

problems in several verticals

Copyright © 2014 EMC Corporation. All Rights Reserved.

Module 4: Analytics Theory/Methods

7

We present in this module K-means clustering, Apriori algorithm for Association rules, Linear

and logistic regression, Classification methods with Naïve Bayesian method and Decision Trees,

Time Series Analysis with Box-Jenkins ARIMA modeling and key concepts such as TF-IDF.

Regular expressions and document representation methods with “bag of words” are chosen to

be presented in this module among several techniques available for the Data Scientists to use

to solve analytic problems. The reasons for which these techniques are chosen among all the

available techniques are listed on this slide.

Copyright © 2014 EMC Corporation. All rights reserved.

Module 4: Analytics Theory/Methods

7

Module 4: Advanced Analytics – Theory and Methods

Lesson 1: K-means Clustering

During this lesson the following topics are covered:

• Clustering – Unsupervised learning method

• K-means clustering:

•

Use cases

•

The algorithm

•

Determining the optimum value for K

•

Diagnostics to evaluate the effectiveness of the method

•

Reasons to Choose (+) and Cautions (-) of the method

Copyright © 2014 EMC Corporation. All Rights Reserved.

Module 4: Analytics Theory/Methods

8

This lesson covers K-means clustering with these topics.

Copyright © 2014 EMC Corporation. All rights reserved.

Module 4: Analytics Theory/Methods

8

Clustering

How do I group these documents by topic?

How do I group my customers by purchase patterns?

• Sort items into groups by similarity:

Items in a cluster are more similar to each other than they are to

items in other clusters.

Need to detail the properties that characterize “similarity”

Or of distance, the “inverse” of similarity

• Not a predictive method; finds similarities, relationships

• Our Example: K-means Clustering

Copyright © 2014 EMC Corporation. All Rights Reserved.

Module 4: Analytics Theory/Methods

9

In machine learning, “unsupervised” refers to the problem of finding a hidden structure within

unlabeled data. In this lesson and the following lesson we will be discussing two unsupervised

learning methods clustering and Association Rules.

Clustering is a popular method used to form homogenous groups within a data set based on

their internal structure. Clustering is a method often used for exploratory analysis of the data.

There are no ”predictions” of any values done with clustering just finding the similarity

between the data and grouping them into clusters

The notion of similarities can be explained with the following examples:

Consider questions such as

1. How do I group these documents by topic?

2. How do I perform customer segmentation to allow for targeted or special marketing

programs.

The definition of “similarity” is specific to the problem domain. We are defining similarity as

those data points with the same “topic” tag or customers who can be profiled in to a same

“age group/income/gender” or a “purchase pattern”.

If we have a vector of measurements of an attribute of the data, the data points that are

grouped into a cluster will have values for the measurement close to each other than to those

data points grouped in a different cluster. In other words the distance, (an inverse of similarity)

between the points within a cluster are always lower than the distance between points in a

different cluster. In a cluster we end up with a tight group (homogeneous) of data points that

are far apart from those data points that end up in a different cluster.

There are many clustering techniques and we are going to discuss one of the most popular

clustering method known as “K-means clustering” in this lesson.

Copyright © 2014 EMC Corporation. All rights reserved.

Module 4: Analytics Theory/Methods

9

K-Means Clustering – What is it?

• Used for clustering numerical data, usually a set of

•

measurements about objects of interest.

Input: numerical. There must be a distance metric defined over

the variable space.

Euclidian distance

• Output: The centers of each discovered cluster, and the

assignment of each input datum to a cluster.

Centroid

Copyright © 2014 EMC Corporation. All Rights Reserved.

Module 4: Analytics Theory/Methods 10

K-means clustering is used to cluster numerical data.

In K-means we define two measures of distances, between two data points(records) and the distance

between two clusters. Distance can be measured (calculated) in a number of ways but four principles

tend to hold true.

1. Distance is not negative (it is stated as an absolute value)

2. Distance from one record to itself is zero.

3. Distance from record I to record J is the same as the distance from record J to record I, again since

the distance is stated as an absolute value, the starting and end points can be reversed.

4. Distance between two records can not be greater than the sum of the distance between each

record and a third record.

Euclidean distance is the most popular method for calculating distance. Euclidian distance is a

“ordinary” distance that one could measure with a ruler. In a single dimension the Euclidian distance is

the absolute value of the differences between two points. The straight line distance between two

points. In a plane with p1 at (x1, y1) and p2 at (x2, y2), it is √((x1 – x2)² + (y1 – y2)²).

In N dimensions, the Euclidean distance between two points p and q is √(∑i=1N (pi-qi)²) where pi (or qi) is

the coordinate of p (or q) in dimension i.

Though there are many other distance measures, the Euclidian distance is the most commonly used

distance measure and many packages use this measure.

The Euclidian distance is influenced by the scale of the variables. Changing the scale (for example from

feet to inches) can significantly influence the results.Second, the equation ignores the relationship

between variables. Lastly, the clustering algorithm is sensitive to outliers. If the data has outliers and

removal of them is not possible, the results of the clustering can be substantially distorted.

The centroid is the center of the discovered cluster. K-means clustering provides this as an output.

When the number of clusters is fixed to k, K-means clustering gives a formal definition as an

optimization problem: find the k cluster centers and assign the objects to the nearest cluster center,

such that the squared distances from the cluster are minimized.

Copyright © 2014 EMC Corporation. All rights reserved.

Module 4: Analytics Theory/Methods

10

Use Cases

• Often an exploratory technique:

Discover structure in the data

Summarize the properties of each cluster

• Sometimes a prelude to classification:

“Discovering the classes“

• Examples

The height, weight and average lifespan of animals

Household income, yearly purchase amount in dollars, number of

household members of customer households

Patient record with measures of BMI, HBA1C, HDL

Copyright © 2014 EMC Corporation. All Rights Reserved.

Module 4: Analytics Theory/Methods 11

K-means clustering is often used as a lead-in to classification. It is primarily an

exploratory technique to discover the structure of the data that you might not have

notice before and as a prelude to more focused analysis or decision processes.

Some examples of the set of measurements based on which clustering can be

performed are detailed in the slide.

In the patient record where we have measures such as BMI, HBA1C, HDL with which we

could cluster patients into groups that define varying degrees of risk of a heart disease.

In Classification the labels are known. Whereas in clustering the labels are not known.

Hence clustering can be used to determine the structure in the data and summarize the

properties of each cluster in terms of the measured centroids for the group. The

clusters can define what the initial classes could be.

In low dimensions we can visualize the clusters. It gets very hard to visualize as the

dimensions increase.

There are a lot of applications of the K-mean clustering, examples include pattern

recognition, classification analysis, artificial intelligence, image processing, machine

vision, etc.

In principle, you have several objects and each object has several attributes. You want

to classify the objects based on the attributes, then you can apply this algorithm. For

Data Scientists, K-means is an excellent tool to understand the structure of data and

validate some of the assumptions that are provided by the domain experts pertaining

to the data. We will look into a specific use-case in the following slide.

Copyright © 2014 EMC Corporation. All rights reserved.

Module 4: Analytics Theory/Methods

11

Use-Case Example – On-line Retailer

LTV – Lifetime Customer Value

Copyright © 2014 EMC Corporation. All Rights Reserved.

Module 4: Analytics Theory/Methods 12

Here we present a fabricated example of an on-line retailer. The unique selling point of this retailer is that they

make the “returns” simple with an assumption that this policy encourages use and “frequent customers are more

valuable”. So let us validate this assumption.

We took a sample set of customers clustered on purchase frequency, return rate, and lifetime customer value

(LTV).

We define purchase frequency as the number of visits a customer made in a month on average that had a

shopping cart transaction.

We can easily see that return rate has an important effect on customer value.

We clustered the customers into 4 groups, and the plotted 3 graphs taking two of the attributes in a graph. The

data points are represented in the graphs by different colors for each cluster and larger “dot” represents the

centroid for the group.

The groups can be defined broadly as follows:

GP1: Visit less frequently, low return rate, moderate LTV(ranked 3rd)

GP2: Visit often, return a lot of their purchases. Lowest avg LTV (counter intuitive)

GP3: Visit often, return things moderately, High LTV (ranked 2nd) (happy medium)

GP4: Visit rarely, don’t return purchases. Highest avg LTV

It appears that GP3 is the ideal group – they visit often, return things moderately, and are high value. The next

questions are

– Why is it that GP3 is ideal?

– What are the people in these different groups buying?

– Is that affecting LTV?

– Can we raise the LTV of our frequent customers, perhaps by lowering the cost of returns, or by somehow

discouraging customers who return goods too frequently?

– Can we encourage GP4 customers to visit more (without lowering their LTV?)

– Are more frequent customers more valuable?

You can see the range of questions that a Data Scientist can address with the initial analysis with k-means

clustering.

Copyright © 2014 EMC Corporation. All rights reserved.

Module 4: Analytics Theory/Methods

12

The Algorithm

1. Choose K; then select K

random “centroids”

In our example, K=3

2. Assign records to the

cluster with the closest

centroid

Copyright © 2014 EMC Corporation. All Rights Reserved.

Module 4: Analytics Theory/Methods 13

Step 1 – K-means clustering begins with the data set segmented into K clusters.

Step 2- Observations are moved from cluster to cluster to help reduce the distance from the

observation to the cluster centroid.

Copyright © 2014 EMC Corporation. All rights reserved.

Module 4: Analytics Theory/Methods

13

The Algorithm (Continued)

3. Recalculate the resulting

centroids

Centroid: the mean value of all

the records in the cluster

4. Repeat steps 2 & 3 until record

assignments no longer change

Model Output:

• The final cluster centers

• The final cluster assignments of

the training data

Copyright © 2014 EMC Corporation. All Rights Reserved.

Module 4: Analytics Theory/Methods 14

Step 3 – When observations are moved to a new cluster, the centroid for the affected clusters

needs to be recalculated.

Step 4 – This movement and recalculation is repeated until movement no longer results in an

improvement.

The model output is the final cluster centers and the final cluster assignments for the data.

Selecting the appropriate number of clusters, K, can be done upfront if you possess some

knowledge on what the right number may be. Alternatively you can try the exercise with

different values for K and decide which clusters best suit your needs. Since it is rare that the

appropriate number of clusters in a dataset is known, it is good practice to select a few values

for k and compare the results.

The first partitioning should be done with the same knowledge used to select the appropriate

value of K, for example domain knowledge about the market or industries.

If K was selected without external knowledge, the partitioning can be done without any inputs.

Once all observations are assigned to their closest cluster, the clusters can be evaluated for

their “in-cluster dispersion.” Clusters with the smallest average distance are the most

homogenous. We can also examine the distance between clusters and decide if it makes sense

to combine clusters which may be located close together. We can also use the distance

between clusters to assess how successful the clustering exercise has been. Ideally, the

clusters should not be located close together as the clusters should be well separated.

Copyright © 2014 EMC Corporation. All rights reserved.

Module 4: Analytics Theory/Methods

14

Picking K

Heuristic: find the “elbow” of the within-sum-of-squares (wss) plot

as a function of K.

K: # of clusters

ni: # points in ith cluster

ci: centroid of ith cluster

xij: jth point of ith cluster

“Elbows” at k=2,4,6

Copyright © 2014 EMC Corporation. All Rights Reserved.

Module 4: Analytics Theory/Methods 15

Practically based on the domain knowledge, a value for K is picked and the centroids are

computed. Then a different K is chosen and the model is repeated to observe if it enhanced the

cohesiveness of the data points within the cluster group. However if there is no apparent

structure in the data we may have to try multiple values for K. It is an exploratory process.

We present here one of the heuristic approaches used for picking the optimal “K” for the given

dataset. “Within Sum of Squares” – WSS is a measure of how tight on average each cluster is.

For k=1, WSS can be considered the overall dispersion of the data. WSS primarily is a measure

of homogeneity. In general more clusters result in tighter clusters. But having too many clusters

is over-fitting. The formula that defines WSS is shown. The graph depicts the value of WSS on

the Y-axis and the number of clusters on the X-axis. The online retailer example data we

reviewed earlier is the data with which the graph shown here is generated. We repeated the

clustering for 12 different values .When we went from one cluster to two there is a significant

drop in the value of WSS, since with two clusters you get more homogeneity. We look for the

elbow of the curve which provides the optimal number of clusters for the given data.

Visualizing the data helps in confirming the optimal number of clusters. Reviewing the three

pair-wise graphs we plotted for the online retailer example earlier you can see that having four

groups sufficiently explained the data and from the graph above we can also see the elbow of

the curve is at 4.

Copyright © 2014 EMC Corporation. All rights reserved.

Module 4: Analytics Theory/Methods

15

Diagnostics – Evaluating the Model

• Do the clusters look separated in at least some of the plots when

you do pair-wise plots of the clusters?

Pair-wise plots can be used when there are not many variables

• Do you have any clusters with few data points?

Try decreasing the value of K

• Are there splits on variables that you would expect, but don’t

see?

Try increasing the value K

• Do any of the centroids seem too close to each other?

Try decreasing the value of K

Copyright © 2014 EMC Corporation. All Rights Reserved.

Module 4: Analytics Theory/Methods 16

How do we know that we have good clusters?

Pair-wise plots of the clusters provide a good visual confirmation that the clusters are

homogeneous. When the dimensions of the data are not significantly large this method helps

in determining the optimal number of clusters. With these plots you should be able to

determine if the clusters look separated in at least some of the plots. They won’t be very

separated in all of the plots. This can be seen even with the on-line retailer example we saw

earlier. Some of the clusters get mixed in together in some dimensions.

If you feel that your clusters are too small it indicates that you have a large value for K and K

needs to be reduced (try a smaller K). It may be the outliers in the data that tend to cluster

into clusters with less data points.

Alternatively if you see there are splits that you expected but are not seen in the clusters, for

example you expect two different income groups and you don’t see them, you should try a

bigger value for K.

If the centroids seem too close to each other then you should try decreasing the value of K.

Copyright © 2014 EMC Corporation. All rights reserved.

Module 4: Analytics Theory/Methods

16

K-Means Clustering – Reasons to Choose (+) and

Cautions (-)

Reasons to Choose (+)

Cautions (-)

Easy to implement

Easy to assign new data to existing

clusters

Which is the nearest cluster center?

Concise output

Coordinates the K cluster centers

Copyright © 2014 EMC Corporation. All Rights Reserved.

Doesn’t handle categorical variables

Sensitive to initialization (first guess)

Variables should all be measured on

similar or compatible scales

Not scale-invariant!

K (the number of clusters) must be

known or decided a priori

Wrong guess: possibly poor results

Tends to produce “round” equi-sized

clusters.

Not always desirable

Module 4: Analytics Theory/Methods 17

K-means clustering is easy to implement and it produces concise output. It is easy to assign

new data to the existing clusters by determining which centroid the new data point is closest to

it.

However K-means works only on the numerical data and does not handle categorical variables.

It is sensitive to the initial guess on the centroids. It is important that the variables must be all

measured on similar or compatible scales. If you measure the living space of a house in square

feet, the cost of the house in thousands of dollars (that is, 1 unit is $1000), and then you

change the cost of the house to dollars (so one unit is $1), then the clusters may change. K

should be decided ahead of the modeling process. Wrong guesses for K may lead to improper

clustering.

K-means tends to produce rounded and equal sized clusters. If you have clusters which are

elongated or crescent shaped, K-means may not be able to find these clusters appropriately.

The data in this case may have to be transformed before modeling.

Copyright © 2014 EMC Corporation. All rights reserved.

Module 4: Analytics Theory/Methods

17

Check Your Knowledge

1. Why do we consider K-means clustering as a unsupervised

Your Thoughts?

machine learning algorithm?

2. How do you use “pair-wise” plots to evaluate the effectiveness

of the clustering?

3. Detail the four steps in the K-means clustering algorithm.

4. How do we use WSS to pick the value of K?

5. What is the most common measure of distance used with Kmeans clustering algorithms?

6. The attributes of a data set are “purchase decision (Yes/No),

Gender (M/F), income group (50K). Can you

use K-means to cluster this data set?

Copyright © 2014 EMC Corporation. All Rights Reserved.

Module 4: Analytics Theory/Methods 18

Record your answers here.

Copyright © 2014 EMC Corporation. All rights reserved.

Module 4: Analytics Theory/Methods

18

Module 4: Advanced Analytics – Theory and Methods

Lesson 1: K-means Clustering – Summary

During this lesson the following topics were covered:

• Clustering – Unsupervised learning method

• What is K-means clustering

• Use cases with K-means clustering

• The K-means clustering algorithm

• Determining the optimum value for K

• Diagnostics to evaluate the effectiveness of K-means clustering

• Reasons to Choose (+) and Cautions (-) of K-means clustering

Copyright © 2014 EMC Corporation. All Rights Reserved.

Module 4: Analytics Theory/Methods 19

Summary of key-topics presented in this lesson are listed. Take a moment to review them.

Copyright © 2014 EMC Corporation. All rights reserved.

Module 4: Analytics Theory/Methods

19

Purchase answer to see full

attachment

## Leave a comment