In machine learning, we often perform what we call parameter estimation, which are the weights that are assigned to each feature of the input data.
For example, in a simple linear model, we use the equation
y=mx + c , and
c are your parameters to be estimated. For different values of the parameters, we build different models that produce different estimations of the data
Maximum likelihood is a technique for parameter value estimation.
MLE Parameter Estimation
Whenever we create a model with certain parameters, the outputs of the model (or the prediction) can be plotted as a probability distribution as well.
What MLE does it to try to make the distribution of the model close to the distribution of the observed data. Intuitively, this makes the model more accurate, as it becomes more representative of the actual data.
For example, given the following training data distribution points:
We want to find out which of the graphs below has the highest probability of plotting those points. Each graph has different parameter values, and so they are plotted in different spaces on the graph.
Just by visual inspection, we can see that the blue line is the graph with the correct parameters that produces those data points. But of course in a machine, there is no visual inspection, only maths.
Calculating the MLE
We want to calculate what is the total probability of observing all the generated data, or the joint probability of all the data points.
For a single data point for an assume Gaussian distribution, we have the following equation
For 3 data points, we have the following joint probability:
This can be extended to
n number of points
To calculate the MLE of the parameters, we need to find the values of the parameters in the equation that gives us the maximum value of the probability. To find the maximum, we get the differential of the equation and set it to 0, and solve for the parameters.
Extending the MLE to the least squares method
When the distribution is Gaussian, the process of finding the MLE is similar to the least squared method.
For least squares estimation we want to find the line that minimizes the total squared distance between the data points and the regression line.
When the data distribution is assumed to be Gaussian, the maximum probability is found when the data points get closer to the mean value.
Since the Gaussian distribution is symmetric, this is equivalent to minimizing the distance between the data points and the mean value.