EXPLORATORY DATA ANALYSIS

UNI-VARIATE, BI-VARIATE AND MULTI-VARIATE ANALYSIS TECHNIQUES

Saurabh Nagar
7 min readDec 3, 2021

The process of cleaning, transforming, interpreting, analyzing, and visualizing the data to extract useful information and gain valuable insights is called Data Analysis. Data analysis can be organized into 4 categories : 1. Exploratory Data Analysis 2. Descriptive Analysis 3.Inferential Analysis 4.Predictive Analysis.

While I was working on my last machine learning case study (Regression), I did not focus much on the exploratory data analysis part and moved to cleaning and training , testing part directly. Then while evaluating the model I faced the issue of Overfitting. I thought , okay… not a problem, I can deal with this situation as I knew how to handle an Overfitting situation. I did some feature engineering and trained the model again and guess what..!! This time the overfitting problem was gone, immediately clapped for myself. But …But when I checked it again, the model was now Underfitting !! I got frustrated and didn’t know where is the problem exactly ? Then my Mentor asked me : Did you thoroughly do the EDA part ??? I said : No , He replied : Hmmm ! At that point of time I exactly knew what to do. So, Never Ever skip Exploratory Data Analysis part … never ! Because this is the part of the pipeline which gives us the best understanding of our data.

In this article we will discuss EDA in depth.

EXPLORATORY DATA ANALYSIS

EDA is the preliminary analysis of the data to discover the relationships between measures in the data and gaining insights of trends, patterns and relationship of features with target and among themselves with the help of statistics and visualization tools. Again, this can graphical and non-graphical also. We can perform Univariate, Bivariate and Multivariate analysis based on type of data and our convenience.

1.UNIVARIATE ANALYSIS

As the name itself indicates that it only considers one variable at a time to explore. Objective of univariate analysis is to define ,summarize and analyze the patterns in the data. This is useful for both : Numerical and categorical variables. Some patterns that can be easily analyzed using this are : 1. Central tendency measures ( Mean, Median and Mode) 2. Measure of dispersion( range, variance, Standard deviation) 3. Quartiles (Inter Quartile Range)

1.1 FREQUENCY DISTRIBUTION TABLES

It shows how often/frequent a data point has occurred in the data set, giving a brief detail about data and making it easy to find patterns. Example : list= [dog, cat, dog, dog, dog, cat, dog, cat, dog, cat, dog, dog, cat]

Frequency table = [dog : 08, cat : 05]

FREQUENCY TABLE EXAMPLE

1.2 BAR CHARTS

Bar charts are very convenient way for comparing categories of a data or groups of data. Bar graph is graphical representation of categorical data using rectangular bars where the length of each bar represents the value they represent. You can create bar charts to understand categorical variables.

1.3 HISTOGRAMS

A histogram is the graphical representation of Numerical or Quantitative data where data is grouped into continuous number ranges (bins) and each range corresponds to a vertical(rectangular) bar. In most instances, numerical data in histogram will be continuous. Histograms look similar to bar charts but the difference is the type of data they are used to represent. Use histograms to understand Numerical data.

1.4 PIE CHARTS

Pie charts can be used to show percentages of a whole, and represents percentages at a set point in time. Unlike bar graphs and line graphs, pie charts do not show changes over time. The whole chart represents 100% and the slices in it represent the size of categories related to the whole.

1.5 DISTRIBUTION PLOT

Distribution plots are used to check the distribution of continuous data. There are various types of distributions of continuous data : Symmetric Distribution , Right Skewed Distribution and Left Skewed Distribution. This plot you can use to verify continuous feature or target variable.

1.6 BOX PLOT

Box plots are used to identify the outliers present in the data variables. According to ideal box plot , any data point below 25th percentile - 1.5 IQR value and above 75th + 1.5 IQR value, treated as outlier value.

2. BIVARIATE ANALYSIS

This analysis is performed to understand the “cause- effect relationship” between two variables. This analysis gives us the understanding of relationship between two variables. Understanding this relationship helps us in feature selection. There are 3 types of Bi-variate analysis :

2.1 BIVARIATE ANALYSIS OF TWO NUMERICAL VARIABLES

2.1.1 SCATTER PLOT

Scatter plot represents data points with small-small dots, It provides the understanding of “how the one variable spread with respect to other variable” . Resulting pattern of scatter plot indicates the relationship type(linear or non-linear) and strength of the relationship. It can also be performed to identify outliers.

2.1.2 CORRELATION

Correlation is a measure of the degree of association between two numerical variables. It quantifies the direction and strength of the relationship between two numeric variables, X and Y, and always lies between -1.0 and 1.0.

2.2 BIVARIATE ANALYSIS OF TWO CATEGORICAL VARIABLES

2.2.1 CHI-SQUARE TEST

The chi-square test is a hypothesis test designed to test for a statistically significant relationship between nominal and ordinal variables organized in a bivariate table. In other words, it tells us whether two categorical variables are independent of one another.

It is calculated based on the difference between expected frequencies and the observed frequencies in one or more categories of the frequency table. A probability of zero indicates a complete dependency between two categorical variables and a probability of one indicates that two categorical variables are completely independent.

2.2.2 STACKED COLUMN CHARTS

Stacked Column chart is a useful graph to visualize the relationship between two categorical variables. It compares the percentage that each category from one variable contributes to a total across categories of the second variable.

2.2.3 COMBINATION CHART

A combination chart uses two or more chart types to emphasize that the chart contains different kinds of information. Here, we use a bar chart to show the distribution of one categorical variable and a line chart to show the percentage of the selected category from the second categorical variable. The combination chart is the best visualization method to demonstrate the predictability power of a predictor (X-axis) against a target (Y-axis).

COMBINATION PLOT FOR CATEGORICAL-CATEGORICAL COMPARISION

2.3 BIVARIATE ANALYSIS OF ONE NUMERICAL AND ONE CATEGORICAL VARIABLE

2.3.1 COMBINATION CHART

A combination chart uses two or more chart types to emphasize that the chart contains different kinds of information. Here, we use a bar chart to show the distribution of a binned numerical variable and a line chart to show the percentage of the selected category from the categorical variable. The combination chart is the best visualization method to demonstrate the predictability power of a predictor (X-axis) against a target (Y-axis).

2.3.2 T-TEST AND Z-TEST

Z-test and t-test are basically the same. They assess whether the averages of two groups are statistically different from each other. This analysis is appropriate for comparing the averages of a numerical variable for two categories of a categorical variable. If the probability of Z is small then the difference between two averages is more significant.

Z-TEST

T-Test is similar to Z-test, but it is used when sample size is less than 30.

T-TEST

Example : Is there a significant difference between the means (averages) of the numerical variable (Temperature) in two different categories of the categorical variable (O-Ring Failure)?

The low probability (0.0156) means that the difference between the average temperature for failed O-Ring and the average temperature for intact O-Ring is significant.

2.3.2 ANALYSIS OF VARIANCE (ANOVA)

The ANOVA test assesses whether the averages of more than two groups are statistically different from each other. This analysis is appropriate for comparing the averages of a numerical variable for more than two categories of a categorical variable.

Example : Is there a significant difference between the averages of the numerical variable (Humidity) in the three categories of the categorical variable (Outlook)?

There is no significant difference between the averages of Humidity in the three categories of Outlook.

That’s all in this article, if you like it please do hit the clap button. If any changes required in this article, please feel free to comment. Soon I will be publishing “A complete EDA implementation using python” !!

--

--