Python libraries SciPy and Pandas already have off-the-shelf tools to calculate descriptive statistics, but behind the scene they are calling Numpy functionalities.
This post just to learn more Numpy, and its great arsenal of dealing with data.

# Descriptive Statistics:

Descriptive statistics is the summary of the data, which include the following:

1. Average: the arithmatic mean, which is the sum of all devided by how many.
2. Median: the middle number after the soft of the numbers.
3. Mode: the most common number in the dataset.
4. Interquartile: the median of the whole, median of the first half (Q1), and the second half (Q2).
5. Measure of the spread or distribution of the data, which includes:
• range
• Variance
• Standard Deviation
• Skewness
• Kurtosis
• Interquartile range

# Descriptive Statistics with Numpy

`numpy` provides the basic of descriptive statistics.
Although `pandas` has statistical functions, but they are from `numpy`.
In this tutorial we will work mainly on numpy.

# Datasets

We are going to work on two datasets.

1. Iris dataset.
2. KNMI dataset (Weather data from Netherlands).

Iris dataset is a dataset that describes flowers and their attributes, and it is used a lot in the data science and machine learning. We can get it from one of these two sources:

1. from scikit-learn.

KNMI dataset. a daily measure of weather info from plants across Netherlands.

We are going to work out of CSV file.

``````import pandas as pd
names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'])
``````
``````iris.head()
``````
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

## Get Numpy arrays from Pandas’s dataframe:

``````# if pandas version >= 0.24 you can use .to_numpy()
sepal_widths = iris['sepal_width'].values
petal_widths = iris['petal_width'].values
petal_lengths = iris['petal_length'].values
sepal_lengths = iris['sepal_length'].values
``````

## Basic Descriptive Statistics:

``````import numpy as np
print ('The max of sepal length: ', sepal_lengths.max(), ' -- ', np.max(sepal_lengths))
print ('The min of sepal length: ', sepal_lengths.min(),  ' -- ', np.min(sepal_lengths))
print ('The mean of sepal length: ', sepal_lengths.mean(), ' -- ',  np.mean(sepal_lengths))
print ('The STD of sepal length: ', sepal_lengths.std(),  ' -- ', np.std(sepal_lengths))
print ('The variance of sepal length', sepal_lengths.var(),  ' -- ', np.var(sepal_lengths))
``````
``````The max of sepal length:  7.9  --  7.9
The min of sepal length:  4.3  --  4.3
The mean of sepal length:  5.843333333333334  --  5.843333333333334
The STD of sepal length:  0.8253012917851409  --  0.8253012917851409
The variance of sepal length 0.6811222222222223  --  0.6811222222222223
``````

### Median and Percentile:

``````from scipy.stats import scoreatpercentile
print ('The median of sepal length: ', np.median(sepal_lengths))
print ('Percentile or quartile 50%: ', scoreatpercentile(sepal_lengths, 50))
print ('Percentile or quartile 75%: ', scoreatpercentile(sepal_lengths, 75))
``````
``````The median of sepal length:  5.8
Percentile or quartile 50%:  5.8
Percentile or quartile 75%:  6.4
``````

## Pandas statistics:

`pandas` builts on top of `numpy` and provides more convenient approach.

### Dataframe’s describe:

``````iris.describe()
``````
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

### more statistical functions:

Method Description
describe a general descriptive statistics
count the number of non-NaN values
mad the mean absolute deviation (similar to STD)
median the median (50th percentile)
min, max, mode, std, var ….
skew Skewness is a measure of the asymmetry of the probability distribution
kurt Kurtosis (distribution shape)
``````iris.skew()
``````
``````sepal_length    0.314911
sepal_width     0.334053
petal_length   -0.274464
petal_width    -0.104997
dtype: float64
``````
``````iris.kurt()
``````
``````sepal_length   -0.552064
sepal_width     0.290781
petal_length   -1.401921
petal_width    -1.339754
dtype: float64
``````

AND MORE AND MORE

scipy

scipy statistical module

### Dataframe’s plot:

``````#ipython magic function
%matplotlib inline
iris.boxplot(return_type='axes')
``````
``````<matplotlib.axes._subplots.AxesSubplot at 0x2257ea92898>
`````` ``````iris.quantile([0.1, 0.9])
``````
sepal_length sepal_width petal_length petal_width
0.1 4.8 2.50 1.4 0.2
0.9 6.9 3.61 5.8 2.2

# Dealing with empty / invalid data:

many times the data that we got is not ideal, and we could find empty or invalid data.
We are going to explore Netherlands weather data as example.

Let us read weather data, and replace empty values with `nan`

``````to_float = lambda x: float(x.strip() or np.nan)
min_temp = np.loadtxt('./data/KNMI.txt', delimiter=',', usecols=(12), unpack=True,
converters={12: to_float},
skiprows=1) * .1

print ('min including nan: ', min_temp.min())
print ('max including nan: ', min_temp.max())
``````
``````min including nan:  nan
max including nan:  nan
``````

### exclude NaN in statistics:

• nanmin
• nanmax
• nanmean
• nanmedian
``````print ('excluding nan: ')
print ('min', np.nanmin(min_temp))
print ('max', np.nanmax(min_temp))
print ('mean', np.nanmean(min_temp))
print ('median', np.nanmedian(min_temp))
``````
``````excluding nan:
min -19.700000000000003
max 23.6
mean 6.627724838066767
median 6.800000000000001
``````

## More robust approach to exclude invalid data: Masked Arrays:

nanmin is just a simpler and shorter approach to a more robust way to deal with invalid data.
It is called `Masked Arrays`.
It is part of sub-module `numpy.ma`.

``````import numpy.ma as ma
inp = np.array([1,2,3,-100,5])
negative = lambda x: x < 0
print ('negative of the input array : ', negative(inp))
print ('mean of positive numbers: ', mask_inp.mean())
``````
``````negative of the input array :  [False False False  True False]
mean of positive numbers:  2.75
``````

### More features of masked arrays:

1. By using masked arrays then we don’t need to use nanmin, ….
We can use min, max, mean, …etc
``````import numpy.ma as ma

print ('min without using nanmin', masked_min.min())
``````
``````min without using nanmin -19.700000000000003
``````
1. we can get the valid only value:
``````print ('length all : ', len(masked_min))
print ('length of valid: ', len(masked_min.compressed()))
``````
``````length all :  93272
length of valid:  64224
``````

another way is to use: `masked_invalid`, which will include `NaN` and `Infinite`

``````maxs = ma.masked_invalid(min_temp)
maxs.min()
``````
``````-19.700000000000003
``````

### Pandas features to deal with invalid data:

Panda is `NaN` friendly.
All its statistical functions are by default ignoring NaN.

``````weather = pd.read_csv('data/KNMI.txt', header=0, usecols=(12,14), names=["min", "max"], converters={12: to_float, 14: to_float})
``````
min max
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
``````weather['min'].min()
``````
``````-197.0
``````

Pandas has more features to deal with invalid data.
A robust function `fillna`.

``````test_weather = weather.fillna(0)
``````
min max
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
4 0.0 0.0

### Exercise: Check for global warming:

``````avg_temp, min_temp, max_temp = np.loadtxt('./data/KNMI.txt', delimiter=',', usecols=(11, 12,14), unpack=True,
converters={11: to_float, 12: to_float, 14: to_float},
encoding='latin1', skiprows=1) * .1
``````
``````avg_temp = ma.array(avg_temp, mask = np.isnan(avg_temp))
min_temp = ma.array(min_temp, mask = np.isnan(min_temp))
max_temp = ma.array(avg_temp, mask = np.isnan(max_temp))
``````
``````from datetime import datetime as dt
to_year = lambda x: dt.strptime(x, "%Y%m%d").year

years = np.loadtxt('./data/KNMI.txt', delimiter=',', usecols=(1), unpack=True,
converters={1: to_year}, encoding='latin1', skiprows=1)

print ('first year: ', years.min(), ' and last year: ', years.max())
``````
``````first year:  1951.0  and last year:  2019.0
``````

and we will draw a graph of mean tempreture by year

``````year_range = range(int(years.min()), int(years.max()))

print ('range of years: ', year_range)
avg_of_avg_temp_by_year = [avg_temp[np.where (years == year)].mean() for year in year_range]

``````
``````range of years:  range(1951, 2019)
``````

#### alternative way of the above comprehension list:

``````avg_of_avg_temp_by_year=[]
for year in range(int(years), int(years[-1]) -1):
indices = np.where (years==year)
avgs.append(avg_temp[indices].mean())
``````
``````import matplotlib.pyplot as plt
plt.plot(year_range, avg_of_avg_temp_by_year, 'r-', label='Yearly Averages')
plt.plot(year_range, np.ones(len(avg_of_avg_temp_by_year)) * np.mean(avg_of_avg_temp_by_year))
plt.legend(prop={'size': 'x-small'})
plt.show()
`````` # EDA: Beyond Basic Statistics

Descriptive statistics are the basic of more deeper and comprehensive analysis. It is the basic tool for the `EDA` Exploratory Data Analysis. EDA is required to understand a dataset better, check its features and get a perliminary idea about the data.

## Correlation Coefficient

`Numpy` has functionality to study the `Correlation coefficient` between two variables:

``````sun_radiation = np.loadtxt('./data/KNMI.txt', delimiter=',', usecols=(20), unpack=True,
converters={20: to_float},
encoding='latin1', skiprows=1)
print ('Correlation Coefficient: ', corr)
``````
``````Correlation Coefficient:  0.6171336988259747
``````

## The Covariance Matrix:

The convariance matrix is finding the correlation between all the different pairs of features, and usually it is the first step in data analysis to concentrate on correlated features, and eliminate un-related features.

It is called : `Dimensionality Reduction`

Easy way to get Covariance Matrix:

``````iris.corr()
``````
sepal_length sepal_width petal_length petal_width
sepal_length 1.000000 -0.109369 0.871754 0.817954
sepal_width -0.109369 1.000000 -0.420516 -0.356544
petal_length 0.871754 -0.420516 1.000000 0.962757
petal_width 0.817954 -0.356544 0.962757 1.000000
``````corrdata = iris.corr()
img = plt.matshow(corrdata, cmap=plt.cm.rainbow)
plt.colorbar(img, ticks=[-1, 0, 1], fraction=0.045)
corrvalues = corrdata.values
for x in range(corrvalues.shape):
for y in range(corrvalues.shape):
plt.text(x, y, "{0:.2f}".format(corrvalues[x,y]), size=12, color='black', ha='center', va='center')
`````` ### Calculating Correlation between features based on custom factors

``````pd.crosstab(iris['petal_length'] > 3.758667, iris['petal_width'] > 1.198667)
``````
 petal_width False True petal_length False 56 1 True 4 89

### Categorial data:

``````iris['species'].unique()
``````
``````array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)
``````

## All in one: pandas-profiling:

`pandas-profiling` package does a very high level statistical profiling on the data, and generate profile report.
For each column it does the following: quantile, descriptive, most frequent values, histogram, missing values, correlations…

Examples: