Business Analytics With SAS: February 2016

Friday, February 19, 2016

Relavant Files for your reference.

Please find the relavant files at below location
https://drive.google.com/drive/folders/0B6ze94x59xHnMTZ5RXVlNmozdjA

Regression-Introduction to Simple Linear Regression

Statisticians are often called upon to develop methods to predict one variable from other variables. For example, one might want to predict college grade point average from high school grade point average. Or, one might want to predict income from the number of years of education.

Introduction to Linear Regression

Author(s)

David M. Lane

Prerequisites

Measures of Variability, Describing Bivariate Data

Learning Objectives

Define linear regression
Identify errors of prediction in a scatter plot with a regression line

In simple linear regression, we predict scores on one variable from the scores on a second variable. The variable we are predicting is called the criterion variable and is referred to as Y. The variable we are basing our predictions on is called the predictor variable and is referred to as X. When there is only one predictor variable, the prediction method is called simple regression. In simple linear regression, the topic of this section, the predictions of Y when plotted as a function of X form a straight line.

The example data in Table 1 are plotted in Figure 1. You can see that there is a positive relationship between X and Y. If you were going to predict Y from X, the higher the value of X, the higher your prediction of Y.

Table 1. Example data.

X	Y
1.00	1.00
2.00	2.00
3.00	1.30
4.00	3.75
5.00	2.25

Figure 1. A scatter plot of the example data.

Linear regression consists of finding the best-fitting straight line through the points. The best-fitting line is called a regression line. The black diagonal line in Figure 2 is the regression line and consists of the predicted score on Y for each possible value of X. The vertical lines from the points to the regression line represent the errors of prediction. As you can see, the red point is very near the regression line; its error of prediction is small. By contrast, the yellow point is much higher than the regression line and therefore its error of prediction is large.

Figure 2. A scatter plot of the example data. The black line consists of the predictions, the points are the actual data, and the vertical lines between the points and the black line represent errors of prediction.

The error of prediction for a point is the value of the point minus the predicted value (the value on the line). Table 2 shows the predicted values (Y') and the errors of prediction (Y-Y'). For example, the first point has a Y of 1.00 and a predicted Y (called Y') of 1.21. Therefore, its error of prediction is -0.21.

Table 2. Example data.

X	Y	Y'	Y-Y'	(Y-Y')²
1.00	1.00	1.210	-0.210	0.044
2.00	2.00	1.635	0.365	0.133
3.00	1.30	2.060	-0.760	0.578
4.00	3.75	2.485	1.265	1.600
5.00	2.25	2.910	-0.660	0.436

You may have noticed that we did not specify what is meant by "best-fitting line." By far, the most commonly-used criterion for the best-fitting line is the line that minimizes the sum of the squared errors of prediction. That is the criterion that was used to find the line in Figure 2. The last column in Table 2 shows the squared errors of prediction. The sum of the squared errors of prediction shown in Table 2 is lower than it would be for any other regression line.

The formula for a regression line is

Y' = bX + A

where Y' is the predicted score, b is the slope of the line, and A is the Y intercept. The equation for the line in Figure 2 is

Y' = 0.425X + 0.785

For X = 1,

Y' = (0.425)(1) + 0.785 = 1.21.

For X = 2,

Y' = (0.425)(2) + 0.785 = 1.64.

Computing the Regression Line

In the age of computers, the regression line is typically computed with statistical software. However, the calculations are relatively easy, and are given here for anyone who is interested. The calculations are based on the statistics shown in Table 3. M_X is the mean of X, M_Y is the mean of Y, s_X is the standard deviation of X, s_Y is the standard deviation of Y, and r is the correlation between X and Y.

Formula for standard deviation
Formula for correlation

Table 3. Statistics for computing the regression line.

M_X	M_Y	s_X	s_Y	r
3	2.06	1.581	1.072	0.627

The slope (b) can be calculated as follows:

b = r s_Y/s_X

and the intercept (A) can be calculated as

A = M_Y - bM_X.

For these data,

b = (0.627)(1.072)/1.581 = 0.425

A = 2.06 - (0.425)(3) = 0.785

Note that the calculations have all been shown in terms of sample statistics rather than population parameters. The formulas are the same; simply use the parameter values for means, standard deviations, and the correlation.

Standardized Variables

The regression equation is simpler if variables are standardized so that their means are equal to 0 and standard deviations are equal to 1, for then b = r and A = 0. This makes the regression line:

Z_Y' = (r)(Z_X)

where Z_Y' is the predicted standard score for Y, r is the correlation, and Z_X is the standardized score for X. Note that the slope of the regression equation for standardized variables is r.

A Real Example

The case study "SAT and College GPA" contains high school and university grades for 105 computer science majors at a local state school. We now consider how we could predict a student's university GPA if we knew his or her high school GPA.

Figure 3 shows a scatter plot of University GPA as a function of High School GPA. You can see from the figure that there is a strong positive relationship. The correlation is 0.78. The regression equation is

University GPA' = (0.675)(High School GPA) + 1.097

Therefore, a student with a high school GPA of 3 would be predicted to have a university GPA of

University GPA' = (0.675)(3) + 1.097 = 3.12.

Figure 3. University GPA as a function of High School GPA.

Assumptions

It may surprise you, but the calculations shown in this section are assumption-free. Of course, if the relationship between X and Y were not linear, a different shaped function could fit the data better. Inferential statistics in regression are based on several assumptions, and these assumptions are presented in a later section of this chapter.

Friday, February 12, 2016

Learn Analytics using a business case study : Part II

The sequel episode, of most trilogies, is often the most interesting. This is because the first episode builds a foundation, of the overall plot, and includes many mandatory but non-interesting events, but by the second episode, the director has to build sufficient interest to ensure that the viewer is hooked to the entire season. We assume that you have gone through the first part of this case study (http://bit.ly/CDcasePart1). Let’s make this article more interesting by creating a near-life simulation of the newly built business.

A quick Recap :
You have recently started a Video CD rental shop. After 2 months you realize that there is tough competition in the market and you need to create a more customer centric strategy to stand out. You have already built all the datasets and collected data for the first 2 months. You stand in the third month and want to maximize your profit, using data analytics.

Month 3 :
Here are some high level metrics indicating health of your business :

Your shop is just like all other shops and your product is no different either. Why will your customer see you differently? As of now, you have burnt around INR 32k and earned only 6.75k. This is what happens in almost all new businesses. You have no loyal customers yet; and customer acquisitions are always costly. But starting month 2, you need to stop following the old-school way of doing business.
Let’s think of all possible changes you might want to bring to your business. Following are some of them which are on top of my head :

Customise the theme basis what customers may like
Customise the banners basis the localities you are going to place them in.
Identify the most popular genre and stock more CDs pertaining to the same.
Map and profile localities which get you more business & profits. Identify more such localities for more customer acquisitions.

Now, you might have noticed that all the solutions above general in nature, and do not require any predictive modeling. This is because the business just 3 months old, and probably doesn’t have much data to build predictive models. But there is still a lot you can do to build the business further.

1. Make theme more customer-centric :
The Current Situation: As of now you have put up a banner of Hindi Comedy movies and have stacked Hindi Action movies in the front counter. Is this the right display for your target audience?

Assumption : The profile spread of your current portfolio and the future acquisition will be similar.
Idea: Imagine a new customer looking for an English Action movie looks at two competing CD shops. He will make a perception that the probability of finding a CD of his choice is more in the shop with banner of English movies.

Analysis : Our target should be, to display more genres that people will want to rent. People, who would like to rent a CD, may not really be interested in Hindi Comedy and Hindi Action movies. Since the movies available in each genre are similar, and the choice of movie genre is not restricted by availability, the two categories, which stand out are Hindi Romantic and English Action. Hence, both, the banner and the display should correspond to only these two categories.
2. Make each advertising banner focused to local taste :
The second pointer is very similar to first analysis. But here you break the audience basis the locality or area they belong to, and decide on the most popular genre in the area. This thought process should govern your marketing banners strategy.
3. Genre wise Inventory/Stock optimization :

The third pointer is derived from the same analysis, as described in point 2. Our analysis showed that people prefer Hindi Romantic and English Action movies. Now, after revising the display and banners, the client base would expect a larger variety corresponding to these two genres, even more. Hence, stocking the same number of CDs, in each genre, may not be required and wouldn’t be justified.
4. Target most profitable societies for acquisition :
The fourth pointer however needs a slightly different analysis. This pointer would need us to identify areas with the highest profitability. The tricky part is, that you would need to find out the profit at an individual customer level. In the table below, the cost incurred on a society is the cost of installing the customised banners. The revenue generated, is calculated by multiplying INR 13, with the CDs rented in total by the society. In the data below, we have assumed that the same amount of money was spent in marketing, in 5 different societies.

If we had not done this calculation, we may have assumed that targeting a high standard society, would be most profitable because the number of CDs bought by each customer was higher. But from this analysis it is clear that the conversion rate in such a society is also less. Hence, it is best to target the Medium standard society.
Note that this rank order might change with time. Given the trends, with time the number of repeat customers in high standard society might raise significantly higher, but the same cannot be concluded as of now. Hence you will want to market more and acquire more customers from medium class societies.

Here is a summary of strategies we recommend for the month 3 :
1. Change the banner and front desk of movie genre to English Action and Hindi Romantic
2. Put customized banners in different kinds of societies, according to the society’s taste
3. Stock more titles in the English Action and Hindi Romantic genre.
4. Market the business, more in Medium standard societies similar to “HEWO society”.

Month 5 :
Here are some of the high level metrics for your business :

You see a good improvement in the total sales and the net loss you make month on month. If the trend goes, you would start making a profit over the month on month, running costs in the next 1 or 2 months. This itself is an insight, and the first step towards prediction. Since you are ambitious, you would want to introduce some cutting edge marketing strategy. Here is what you can do:

Identify most valuable customers, send movie reviews & other value added services to these customers
Innovate, and come up with new services like home delivery, weekly packages, promotional packages etc. to targeted customers
Create targeted marketing campaigns for each type of new product.
Create a website to promote your shop and collect information of trailers viewed by the customers.
Create an optimized inventory management system to make sure you are never stocked out
Create proper retention/win-back processes

As you might have observed, all the pointers are customer level marketing tools. Now we are not talking at the portfolio level, but have drilled down to the customer level. This is because now you have about 750 customers’ data to play with. The thumb rule here is, that customer analytics should be done on a statistically significant number, which in most of the industries is taken to be at least 500. We will talk about customer level analysis in the next article and take each of the pointers one by one to find some key insights and frame strategies from the 5^th month onwards.
End Notes :
In this part of the case study, we looked at framing initial portfolio level strategies, and how data guides us to take some key initial decisions in a new business. Next, we will look at some interesting customer level strategies which can be derived using data sources mentioned in part I of the case study.
The reason we have segregated customer level analytics from porfolio level analytics is that the two analysis are done with very different mind sets. In the portfolio level, we focused more on mass media and themes. Whereas, in the customer level, you will focus more on targeted marketing, engagement and other key strategies.

Did you find the article useful? Share with us how you would have approached making strategies mentioned in the article. Do let us know your thoughts about this article in the box below.

Learn Analytics using a business case study : Part I

Best way to learn analytics is through experience and solving case studies. Here, I will present you a complete business model and take you through a step by step process of how analytics is set up in a new business, how is it used in daily processes and some of the advanced analytics techniques which a business can use to make meaningful segmentation and prediction to optimize its marketing & sales campaign.

Background :

You have recently started a Video CD rent shop. After 2 months you realize that there is tough competition in the market and you need to make a more customer centric strategy to stand out in the market. Hence, you want to collect the most granular details of your customer behavior and build strategy accordingly.

Business case layout :

This business case have been broken down into 3 articles. Following is a plot of the articles and each article will be strongly dependent on findings in the previous articles:

1. How do you collect data so as to capture all the important information?

2. Deep dive into customer behavior and using basic data analysis with business knowledge to optimize daily operations

3. How do you use data with advanced analytics to make your marketing/sales startegies more targeted?

Case study part I :

Did you ever wonder why do you deal with so many datamarts in your company. Let’s try to understand as the owner of the busienss what all data sources do you need.

1. Transactions Table :

You rent out Video CDs and the most important data for you will be transactional data. Transactional data is by far the richest data throughout all industries. Each row in transactional data corresponds to one transaction made. This transaction mostly are monetary transaction. To identify each transaction, you need a distinct transaction code associated with each transaction. What other fields can you think of to be captured along with each transaction. Following is a small list of such variables :

1. Transaction ID

2. Customer ID : Identifying the customer to whom you have rented out the CD

3. Rent due : How much does the customer need to pay as rent

4. Issue date : When was the movie rented out

5. Recieved date : When was the movie recieved. Blank if CD is still due

6. Movie ID : Identifying the movie

Following is a sample transactional data set :

2. Product Table :

If you have transaction table, you basically have the linkage between the customers and the products. But why does transaction table not have the discription of products? The simplest reason for the same is that total number of products are limited in any industry, and the same product is repeated throughout the transactions table. If we add description in every single line, it adds enormously to the overall size of transaction table, which anyway is huge. Hence, we keep the products table seperate and merge it with required transactions for specific analysis.

Product table is unique on product id, which maps to transactions table. What other parameters can you think of that make sense for you to include? Following is a list of possible variables :

1. Movie ID : Unique ID of movie

2. Origin yearmonth : When was the CD bought?

3. Genre of movie

4. Language of movie

5. Star Cast of movie

6. Movie name

Note : Product ID generally can be decoded to know product details. For example, here H denotes “Hindi” and E denotes “English”. This coding makes the analysis simpler.

3. Customer Table :

The other hand of transaction table is the customer table. Using the above two tables, you almost have everything except the details of the customer. While making any kind of customer centric strategy, its very essential to consider the customer profile.This table helps you find the customer profile. This table is unique on customer id. What other parameters can you think of that make sense for you to include? Following is a list of possible variables :

1. Customer ID : Unique ID of customer

2. Age

3. Gender

4. Area : Area where the customer lives

5. Package : Package customer has enrolled to

6. Enrol_date : When did the customer make his first transaction

7. Name

Note : Similar to Product ID, Customer ID also generally can be decoded to know customer details.

4. Engagement Tracker :

All the three tables together can be used to create any kind of analysis to build marketing and sales strategy. What they do not cover is the engagement you had with your customers till date. Say, I called Kunal 1 week back to tell him about a movie X. Now, it might not be the best idea to call Kunal again this week to tell about the same movie. Hence, we need to keep a track on all kinds of engagement we have with our customer on daily basis. This is similar to transactions table but this include all the non-monetary interactions we have with out customers till date. These interactions can be inbound or outbound. This table is unique on engagement_id. What other parameters can you think of that make sense for you to include? Following is a list of possible variables :

1. Engagement id : Unique engagement identification

2. Inbound flag : 1 if inbound, 0 if outbound

3. Type of engagement : Code for the type of engagement

4. Product ID : ID of product in question

5. Derived tables :

Because the data sizes become huge with time, it is always recommended to keep some monthly snapshots handy. One of such table can be transactions data rolled up at customer level. Following is a list of such possible variables :

1. Customer ID

2. # lifetime transactions : No. of transactions made by customer till date

3. Enrol date

4. # English Movies : No. of English movies taken by customer till date

5. Last engement date

6. Last transaction date

Such derived tables come very handy to make quick analysis. Say, you have acquired 10 new english movies and want to market them. You might want to market these movies to customers who watch english movies, who responded to recent engagements and who have done recent transactions. For such a targeting list, imagine the process you might need to follow. Following is a possible way to achieve the same :

Imagine how easy this analysis gets if you have the derived monthly snapshot handy.

Graph schemas:

The article till now focuses on use of traditional relational databases. Graph based databases (e.g. Neo4j) are a strong alternate to these traditional databases. They add a lot of flexibility to your database, where you can change the schema very easily.

This kind of flexibility is required in case your data formats can change and you can not have much control on it. Also, you can add new structures and relationships very quickly. Before we go in these details, a typical graph schema in this case would look something like:

Blue nodes represent customers, Red represent movies and Green represents various package available. Every edge is a relationship in between nodes. For example, if a customer rents out a movie, we can draw an edge between the 2.

Now by calculating things like number of edges from a node, you can look at things like most active customer, most rented and least rented movies. You can also start looking at what kind of customers are renting what kind of movies.

P.S. Like all data model designs, there are various alternates to this design and you should choose the best depending on your usage.

End Notes :

We discussed relational database and graph database for representing a typical business problem. The data tables we discussed in this article is almost parallel to datamarts in any industry. We will look at some interesting strategies which can be derived using these data sources for the CD rental business case. Some of these strategies which are very basic in nature and needs more of business sense than modelling will be discussed in the next article. This will make you understand how effective strategies can be built if you mix business knowledge with simple data analysis.Knowledge of data is very essential regardless of the industry you work for.

Did you find the article useful? Share with us any other problem statements you can think of. Do let us know your thoughts about this article in the box below.

Data analysis

Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, in different business, science, and social science domains.

Data mining is a particular data analysis technique that focuses on modeling and knowledge discovery for predictive rather than purely descriptive purposes. Business intelligence covers data analysis that relies heavily on aggregation, focusing on business information. In statistical applications, some people divide data analysis into descriptive statistics, exploratory data analysis (EDA), and confirmatory data analysis (CDA). EDA focuses on discovering new features in the data and CDA on confirming or falsifying existing hypotheses. Predictive analytics focuses on application of statistical models for predictive forecasting or classification, while text analytics applies statistical, linguistic, and structural techniques to extract and classify information from textual sources, a species of unstructured data. All are varieties of data analysis.

Data integration is a precursor to data analysis, and data analysis is closely linked to data visualization and data dissemination. The term data analysis is sometimes used as a synonym for data modeling.

Contents

1 The process of data analysis

6 Other topics

7 Practitioner notes

7.1 Initial data analysis

7.2 Main data analysis

11 Further reading

The process of data analysis

Data science process flowchart

Analysis refers to breaking a whole into its separate components for individual examination. Data analysis is a process for obtaining raw data and converting it into information useful for decision-making by users. Data is collected and analyzed to answer questions, test hypotheses or disprove theories.[1]

Statistician John Tukey defined data analysis in 1961 as: "Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data."[2]

There are several phases that can be distinguished, described below. The phases are iterative, in that feedback from later phases may result in additional work in earlier phases.[3]

Data requirements

The data necessary as inputs to the analysis are specified based upon the requirements of those directing the analysis or customers who will use the finished product of the analysis. The general type of entity upon which the data will be collected is referred to as an experimental unit (e.g., a person or population of people). Specific variables regarding a population (e.g., age and income) may be specified and obtained. Data may be numerical or categorical (i.e., a text label for numbers).[3]

Data collection

Data is collected from a variety of sources. The requirements may be communicated by analysts to custodians of the data, such as information technology personnel within an organization. The data may also be collected from sensors in the environment, such as traffic cameras, satellites, recording devices, etc. It may also be obtained through interviews, downloads from online sources, or reading documentation.[3]

Data processing

The phases of the intelligence cycle used to convert raw information into actionable intelligence or knowledge are conceptually similar to the phases in data analysis.

Data initially obtained must be processed or organized for analysis. For instance, this may involve placing data into rows and columns in a table format for further analysis, such as within a spreadsheet or statistical software.[3]

Data cleaning

Once processed and organized, the data may be incomplete, contain duplicates, or contain errors. The need for data cleaning will arise from problems in the way that data is entered and stored. Data cleaning is the process of preventing and correcting these errors. Common tasks include record matching, deduplication, and column segmentation.[4] Such data problems can also be identified through a variety of analytical techniques. For example, with financial information, the totals for particular variables may be compared against separately published numbers believed to be reliable.[5] Unusual amounts above or below pre-determined thresholds may also be reviewed. There are several types of data cleaning that depend on the type of data. Quantitative data methods for outlier detection can be used to get rid of likely incorrectly entered data. Textual data spellcheckers can be used to lessen the amount of mistyped words, but it is harder to tell if the words themselves are correct.[6]

Exploratory data analysis

Once the data is cleaned, it can be analyzed. Analysts may apply a variety of techniques referred to as exploratory data analysis to begin understanding the messages contained in the data.[7][8] The process of exploration may result in additional data cleaning or additional requests for data, so these activities may be iterative in nature. Descriptive statistics such as the average or median may be generated to help understand the data. Data visualization may also be used to examine the data in graphical format, to obtain additional insight regarding the messages within the data.[3]

Modeling and algorithms

Mathematical formulas or models called algorithms may be applied to the data to identify relationships among the variables, such as correlation or causation. In general terms, models may be developed to evaluate a particular variable in the data based on other variable(s) in the data, with some residual error depending on model accuracy (i.e., Data = Model + Error).[1]

Inferential statistics includes techniques to measure relationships between particular variables. For example, regression analysis may be used to model whether a change in advertising (independent variable X) explains the variation in sales (dependent variable Y). In mathematical terms, Y (sales) is a function of X (advertising). It may be described as Y = aX + b + error, where the model is designed such that a and b minimize the error when the model predicts Y for a given range of values of X. Analysts may attempt to build models that are descriptive of the data to simplify analysis and communicate results.[1]

Data product

A data product is a computer application that takes data inputs and generates outputs, feeding them back into the environment. It may be based on a model or algorithm. An example is an application that analyzes data about customer purchasing history and recommends other purchases the customer might enjoy.[3]

Communication

Data visualization to understand the results of a data analysis.[9]

Main article: Data visualization

Once the data is analyzed, it may be reported in many formats to the users of the analysis to support their requirements. The users may have feedback, which results in additional analysis. As such, much of the analytical cycle is iterative.[3]

When determining how to communicate the results, the analyst may consider data visualization techniques to help clearly and efficiently communicate the message to the audience. Data visualization uses information displays such as tables and charts to help communicate key messages contained in the data. Tables are helpful to a user who might lookup specific numbers, while charts (e.g., bar charts or line charts) may help explain the quantitative messages contained in the data.

Quantitative messages

Main article: Data visualization

A time series illustrated with a line chart demonstrating trends in U.S. federal spending and revenue over time.

A scatterplot illustrating correlation between two variables (inflation and unemployment) measured at points in time.

Author Stephen Few described eight types of quantitative messages that users may attempt to understand or communicate from a set of data and the associated graphs used to help communicate the message. Customers specifying requirements and analysts performing the data analysis may consider these messages during the course of the process.

Time-series: A single variable is captured over a period of time, such as the unemployment rate over a 10-year period. A line chart may be used to demonstrate the trend.
Ranking: Categorical subdivisions are ranked in ascending or descending order, such as a ranking of sales performance (the measure) by sales persons (the category, with each sales person a categorical subdivision) during a single period. A bar chart may be used to show the comparison across the sales persons.
Part-to-whole: Categorical subdivisions are measured as a ratio to the whole (i.e., a percentage out of 100%). A pie chart or bar chart can show the comparison of ratios, such as the market share represented by competitors in a market.
Deviation: Categorical subdivisions are compared against a reference, such as a comparison of actual vs. budget expenses for several departments of a business for a given time period. A bar chart can show comparison of the actual versus the reference amount.
Frequency distribution: Shows the number of observations of a particular variable for given interval, such as the number of years in which the stock market return is between intervals such as 0-10%, 11-20%, etc. A histogram, a type of bar chart, may be used for this analysis.
Correlation: Comparison between observations represented by two variables (X,Y) to determine if they tend to move in the same or opposite directions. For example, plotting unemployment (X) and inflation (Y) for a sample of months. A scatter plot is typically used for this message.
Nominal comparison: Comparing categorical subdivisions in no particular order, such as the sales volume by product code. A bar chart may be used for this comparison.
Geographic or geospatial: Comparison of a variable across a map or layout, such as the unemployment rate by state or the number of persons on the various floors of a building. A cartogram is a typical graphic used.[10][11]

Techniques for analyzing quantitative data

Other topics

Analytics and business intelligence

Main article: Analytics

Analytics is the "extensive use of data, statistical and quantitative analysis, explanatory and predictive models, and fact-based management to drive decisions and actions." It is a subset of business intelligence, which is a set of technologies and processes that use data to understand and analyze business performance.[18]

Education

Analytic activities of data visualization users

In education, most educators have access to a data system for the purpose of analyzing student data.[19] These data systems present data to educators in an over-the-counter data format (embedding labels, supplemental documentation, and a help system and making key package/display and content decisions) to improve the accuracy of educators’ data analyses.[20]

Practitioner notes

This section contains rather technical explanations that may assist practitioners but are beyond the typical scope of a Wikipedia article.

Initial data analysis

The most important distinction between the initial data analysis phase and the main analysis phase, is that during initial data analysis one refrains from any analysis that is aimed at answering the original research question. The initial data analysis phase is guided by the following four questions:[21]

Quality of data

The quality of the data should be checked as early as possible. Data quality can be assessed in several ways, using different types of analysis: frequency counts, descriptive statistics (mean, standard deviation, median), normality (skewness, kurtosis, frequency histograms, n: variables are compared with coding schemes of variables external to the data set, and possibly corrected if coding schemes are not comparable.

Test for common-method variance.

The choice of analyses to assess the data quality during the initial data analysis phase depends on the analyses that will be conducted in the main analysis phase.[22]

Quality of measurements

The quality of the measurement instruments should only be checked during the initial data analysis phase when this is not the focus or research question of the study. One should check whether structure of measurement instruments corresponds to structure reported in the literature.

There are two ways to assess measurement

Analysis of homogeneity (internal consistency), which gives an indication of the reliability of a measurement instrument. During this analysis, one inspects the variances of the items and the scales, the Cronbach's α of the scales, and the change in the Cronbach's alpha when an item would be deleted from a scale.[23]

Initial transformations

After assessing the quality of the data and of the measurements, one might decide to impute missing data, or to perform initial transformations of one or more variables, although this can also be done during the main analysis phase.[24]

Possible transformations of variables are:[25]

Square root transformation (if the distribution differs moderately from normal)
Log-transformation (if the distribution differs substantially from normal)
Inverse transformation (if the distribution differs severely from normal)
Make categorical (ordinal / dichotomous) (if the distribution differs severely from normal, and no transformations help)

Did the implementation of the study fulfill the intentions of the research design?

One should check the success of the randomization procedure, for instance by checking whether background and substantive variables are equally distributed within and across groups.

If the study did not need or use a randomization procedure, one should check the success of the non-random sampling, for instance by checking whether all subgroups of the population of interest are represented in sample.

Other possible data distortions that should be checked are:

dropout (this should be identified during the initial data analysis phase)
Item nonresponse (whether this is random or not should be assessed during the initial data analysis phase)
Treatment quality (using manipulation checks).[26]

Characteristics of data sample

In any report or article, the structure of the sample must be accurately described. It is especially important to exactly determine the structure of the sample (and specifically the size of the subgroups) when subgroup analyses will be performed during the main analysis phase.

The characteristics of the data sample can be assessed by looking at:

Basic statistics of important variables
Scatter plots
Correlations and associations
Cross-tabulations[27]

Final stage of the initial data analysis

During the final stage, the findings of the initial data analysis are documented, and necessary, preferable, and possible corrective actions are taken.

Also, the original plan for the main data analyses can and should be specified in more detail or rewritten.

In order to do this, several decisions about the main data analyses can and should be made:

In the case of non-normals: should one transform variables; make variables categorical (ordinal/dichotomous); adapt the analysis method?
In the case of missing data: should one neglect or impute the missing data; which imputation technique should be used?
In the case of outliers: should one use robust analysis techniques?
In case items do not fit the scale: should one adapt the measurement instrument by omitting items, or rather ensure comparability with other (uses of the) measurement instrument(s)?
In the case of (too) small subgroups: should one drop the hypothesis about inter-group differences, or use small sample techniques, like exact tests or bootstrapping?
In case the randomization procedure seems to be defective: can and should one calculate propensity scores and include them as covariates in the main analyses?[28]

Analysis

Several analyses can be used during the initial data analysis phase:[29]

Univariate statistics (single variable)
Bivariate associations (correlations)
Graphical techniques (scatter plots)

It is important to take the measurement levels of the variables into account for the analyses, as special statistical techniques are available for each level:[30]

Nominal and ordinal variables

Frequency counts (numbers and percentages)
Associations

circumambulations (crosstabulations)
hierarchical loglinear analysis (restricted to a maximum of 8 variables)
loglinear analysis (to identify relevant/important variables and possible confounders)

Exact tests or bootstrapping (in case subgroups are small)
Computation of new variables

Continuous variables

Distribution

Statistics (M, SD, variance, skewness, kurtosis)
Stem-and-leaf displays
Box plots

Nonlinear analysis

Nonlinear analysis will be necessary when the data is recorded from a nonlinear system. Nonlinear systems can exhibit complex dynamic effects including bifurcations, chaos, harmonics and subharmonics that cannot be analyzed using simple linear methods. Nonlinear data analysis is closely related to nonlinear system identification.[31]

Main data analysis

In the main analysis phase analyses aimed at answering the research question are performed as well as any other relevant analysis needed to write the first draft of the research report.[32]

Exploratory and confirmatory approaches

In the main analysis phase either an exploratory or confirmatory approach can be adopted. Usually the approach is decided before data is collected. In an exploratory analysis no clear hypothesis is stated before analysing the data, and the data is searched for models that describe the data well. In a confirmatory analysis clear hypotheses about the data are tested.

Exploratory data analysis should be interpreted carefully. When testing multiple models at once there is a high chance on finding at least one of them to be significant, but this can be due to a type 1 error. It is important to always adjust the significance level when testing multiple models with, for example, a Bonferroni correction. Also, one should not follow up an exploratory analysis with a confirmatory analysis in the same dataset. An exploratory analysis is used to find ideas for a theory, but not to test that theory as well. When a model is found exploratory in a dataset, then following up that analysis with a confirmatory analysis in the same dataset could simply mean that the results of the confirmatory analysis are due to the same type 1 error that resulted in the exploratory model in the first place. The confirmatory analysis therefore will not be more informative than the original exploratory analysis.[33]

Stability of results

It is important to obtain some indication about how generalizable the results are.[34] While this is hard to check, one can look at the stability of the results. Are the results reliable and reproducible? There are two main ways of doing this:

Cross-validation: By splitting the data in multiple parts we can check if an analysis (like a fitted model) based on one part of the data generalizes to another part of the data as well.
Sensitivity analysis: A procedure to study the behavior of a system or model when global parameters are (systematically) varied. One way to do this is with bootstrapping.

Statistical methods

Many statistical methods have been used for statistical analyses. A very brief list of four of the more popular methods is:

General linear model: A widely used model on which various methods are based (e.g. t test, ANOVA, ANCOVA, MANOVA). Usable for assessing the effect of several predictors on one or more continuous dependent variables.
Generalized linear model: An extension of the general linear model for discrete dependent variables.
Structural equation modelling: Usable for assessing latent structures from measured manifest variables.
Item response theory: Models for (mostly) assessing one latent variable from several binary measured variables (e.g. an exam).

Free software for data analysis

Data Applied - an online data mining and data visualization solution.
DataMelt - a multiplatform (Java-based) data analysis framework from the jWork.ORG community of developers led by Dr. S.Chekanov
DevInfo - a database system endorsed by the United Nations Development Group for monitoring and analyzing human development.
ELKI - data mining framework in Java with data mining oriented visualization functions.
KNIME - the Konstanz Information Miner, a user friendly and comprehensive data analytics framework.
MEPX - cross platform tool for regression and classification problems.
PAW - FORTRAN/C data analysis framework developed at CERN
Orange - A visual programming tool featuring interactive data visualization and methods for statistical data analysis, data mining, and machine learning.
R - a programming language and software environment for statistical computing and graphics.
ROOT - C++ data analysis framework developed at CERN
dotplot - cloud based visual designer to create analytic models[35]
SciPy - A set of Python tools for data analysis http://scipy.org/stackspec.html
Statsmodels - a Python module that allows users to explore data, estimate statistical models, and perform statistical tests http://statsmodels.sourceforge.net/
Pandas - A software library written for the Python programming language for data manipulation and analysis.
myInvenio [36]- a cloud based solution to automatically discover processes from event logs.

#	Task	General Description	Pro Forma Abstract	Examples
1	Retrieve Value	Given a set of specific cases, find attributes of those cases.	What are the values of attributes {X, Y, Z, ...} in the data cases {A, B, C, ...}?	- What is the mileage per gallon of the Audi TT? - How long is the movie Gone with the Wind?
2	Filter	Given some concrete conditions on attribute values, find data cases satisfying those conditions.	Which data cases satisfy conditions {A, B, C...}?	- What Kellogg's cereals have high fiber? - What comedies have won awards? - Which funds underperformed the SP-500?
3	Compute Derived Value	Given a set of data cases, compute an aggregate numeric representation of those data cases.	What is the value of aggregation function F over a given set S of data cases?	- What is the average calorie content of Post cereals? - What is the gross income of all stores combined? - How many manufacturers of cars are there?
4	Find Extremum	Find data cases possessing an extreme value of an attribute over its range within the data set.	What are the top/bottom N data cases with respect to attribute A?	- What is the car with the highest MPG? - What director/film has won the most awards? - What Robin Williams film has the most recent release date?
5	Sort	Given a set of data cases, rank them according to some ordinal metric.	What is the sorted order of a set S of data cases according to their value of attribute A?	- Order the cars by weight. - Rank the cereals by calories.
6	Determine Range	Given a set of data cases and an attribute of interest, find the span of values within the set.	What is the range of values of attribute A in a set S of data cases?	- What is the range of film lengths? - What is the range of car horsepowers? - What actresses are in the data set?
7	Characterize Distribution	Given a set of data cases and a quantitative attribute of interest, characterize the distribution of that attribute’s values over the set.	What is the distribution of values of attribute A in a set S of data cases?	- What is the distribution of carbohydrates in cereals? - What is the age distribution of shoppers?
8	Find Anomalies	Identify any anomalies within a given set of data cases with respect to a given relationship or expectation, e.g. statistical outliers.	Which data cases in a set S of data cases have unexpected/exceptional values?	- Are there exceptions to the relationship between horsepower and acceleration? - Are there any outliers in protein?
9	Cluster	Given a set of data cases, find clusters of similar attribute values.	Which data cases in a set S of data cases are similar in value for attributes {X, Y, Z, ...}?	- Are there groups of cereals w/ similar fat/calories/sugar? - Is there a cluster of typical film lengths?
10	Correlate	Given a set of data cases and two attributes, determine useful relationships between the values of those attributes.	What is the correlation between attributes X and Y over a given set S of data cases?	- Is there a correlation between carbohydrates and fat? - Is there a correlation between country of origin and MPG? - Do different genders have a preferred payment method? - Is there a trend of increasing film length over the years?