Gradient Descent is one of the basic iterative optimization algorithms used in Machine Learning, and its deep-rooted in linear algebra and math.
It is the first algorithm explained by Andrew Ng course of Machine Learning.
If you are like me, lost in the math when Andrew explained it, you will find this post useful.

Introduction

I am trying here to explain what Andrew explained, but in simpler way.
Beside that, we are going to cover the Gradient Descent and build the algorithm from scratch using Python.

Pre-requisits:

You need the following pre-requisits skills:

• a begineer skills in Python and specially Numpy and Pandas.
• some basic high school math.

Dataset and Code:

The dataset represents house prices in Boston.
You can get the fields from here.

You can find the code and the data in the repository:

The math behind machine learning:

Let’s start from the data we have. From the data set, the price of the house could be effected by many inputs e.g: number of rooms (RM) crime rate (CRIM), distances to the working cetes (DIS).

The machine learning model will be a mathematical equatios like this:

We can generalize the above statement to build a function that take the input value and gives the output:

Linear Algebra to help:

The above statement is a linear equation, and to solve it we can use Linear Algebra.

From our data sets we can get many equations as follows:

$price_1 = \theta_0 + \theta_1 * crim_1 + \theta_2 * dis_1 + \theta_3 * rm_1$
$price_2 = \theta_0 + \theta_1 * crim_2 + \theta_2 * dis_2 + \theta_3 * rm_2$
$price_3 = \theta_0 + \theta_1 * crim_3 + \theta_2 * dis_3 + \theta_3 * rm_3$

There are two ways to solve the above equations and find the values of $\theta$:

1. Normal Equations using Matrices.

We are going to cover both these methods.

1. Using Matrices (Normal Equation):

To solve these equations, we can use matrices in linear algebra.
We call this method Normal Equation.
We know we can represent the above as:

$\begin{bmatrix}1\hspace{1em}crim_1\hspace{1em}dis_1\hspace{1em}rm_1\newline 1\hspace{1em}crim_2\hspace{1em}dis_2\hspace{1em}rm_2\newline 1\hspace{1em}crim_3\hspace{1em}dis_3\hspace{1em}rm_3\newline ... \newline 1\hspace{1em}crim_n\hspace{1em}dis_n\hspace{1em}rm_n\newline \end{bmatrix}\begin{bmatrix}\theta_0 \hspace{2em} \theta_1 \hspace{2em} ... \hspace{2em} \theta_n\end{bmatrix}$

From linear algebra when we have the following matrices:

The solution for theta ($\theta$).

Gradient Descent is based on these high level steps:

1. selecting arbitrary initial values of $\theta$.
2. then measuring the mean error between the real value and the calculated value based on those initial $\theta$. To calculate the error we used an equation called Squared error or Mean squared error. We can use other error equations, but this is one of the most effecient ways.
\begin{align}J(\theta) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x_{i}) - y_{i} \right)^2\end{align}.
We call the above the Cost Function.
3. try to find the $\theta$ that minimize that function.

Step 1: The cost Function

Next step in gradient descent is to enhance the effeciency by minimizing the cost function, and make as close as possible to zero.
To do that we can use linear algebra by using To enhance the effeciency, and make the model fit the reality, we will work on minimize the cost by using calculus.
Without going into details, the equation to find the smallest value for the cost function is as follows:

By using calculus, we can come to these equations:

We can generalize it as:

Using metrices:

To calculate the above using matrices:

Using Python:

predictions = X @ theta.T
deltas = predictions - y
delta_power = np.power(deltas, 2)
J = np.sum(delta_power) / 2 * len(X)


Step 2: Minimize the theta parameters

As we said we want to represent this equation(s):

Using Python:

predictions = X @ theta.T
deltas = predictions - y
theta = theta - (alpha / len(X)) * (X.T @ deltas)