Probability is very much important when it comes to artificial intelligence or deep learning. What is probability? It is the science of quantifying uncertain things. Deep learning systems use data to learn about various patterns in the data. Whenever data is utilized in a system uncertainty grows up and whenever uncertainty grows up, probability becomes relatable.

We introduce common sense to the system by introducing the probability in AI. Otherwise, the system would be non- useful.  Real-world data is chaotic. Artificial Intelligence systems or Deep learning systems use real-world data which is very much chaotic; they require a tool to handle the chaos.

The Probability and Statistics presented by us are simplified. The fact is that both are very huge and individual research subjects. But we are providing you the concepts written here which is easy and enough for a deep learning aspirant.

Foundations of probability:

When you begin deep learning, the first example the educator provides you is the MNIST handwritten digit recognition task. This task helps to identify handwritten digits and label them.

Sample space:  Sample Space is the set of all possible values in an experiment.

Random variable: A variable that can take different values of the sample space randomly. 

 Probability distribution: The probability distribution is a description of how likely the random variable is to take on different values of the sample space

Normalization: This property of adding up to 1.0 is called normalization. Also, the values must be between 0 and 1. An impossible event is denoted by 0 and a sure event is denoted by 1.

3 Important definitions:

In any probability at the start, you will always learn these 3 basics and important concepts which are conditional probability, marginal probability, and joint probability.

Joint Probability: What will be the probability of two events occurring simultaneously that is at the same time. It is denoted by p(y and x) or P(y=y, x=x).

Example: The probability of seeing moon and sun at the same time is very low.

Conditional probability: What is the probability of some event ‘Y’ occurring at given that other event ‘X’  had happened. It is denoted as  P(y = y | x =x). As the other event x had occurred, it’s probability can’t be zero.

Example: The probability of drinking after eating food is high.

Marginal probability: What will be the probability of a Subset of random variables from a superset of them.

Example: Probability of people having short height is the sum of probability of men having short height and probability of women having short height.(Here the short hair random variable is kept constant and the gender  variable was changed.)

Bayes’ theorem: Baye’s theorem describes that the probability of an event based on prior knowledge of other events is always related to that event.

Bayes theorem destroys the concept of belief in probability. But it  is also used in one of the most simple machine learning algorithm called the naive Bayes Algorithm.

Measures of central tendency and variation:

Mean: T arithmetical average value of the data is called Mean 

Median: The middle value of the data is known as the median.

Mode: Mode is the frequently occurring value of the data. 

Expected value: Variable X with respect to some distribution P(X=x) is the mean value of X when x is drawn from P.  The statistical mean of the dataset is equal to the expectation. 

Variance: Variance is the measure of variability in the data from the mean value.

Random variable, the variance is given by:

Standard deviation: Standard deviation means the square root of variance means. 

Co variance: It shows how the two variables are linearly related to each other.

Binomial distribution:  When the random variable can have only two outcomes(success and failure) at that time it is said to follow a binomial distribution. B binomial distribution is for discrete random variables. 

Continuous distributions:  In this, we describe the distribution using probability density functions denoted by p(x).

Their integral is =  1.

Uniform Distribution: Uniform Distribution is the simplest form of a continuous distribution, with every element of the sample space being equally similar

Normal distribution:  Normal Distribution is the most important of all distributions. It is also known as Gaussian distribution. In the absence of important knowledge about what form a distribution over the real numbers should take. The normal distribution has high entropy.

If in normal distribution the mean is 0 and the standard deviation is 1, then it is said to be a standard normal distribution.

The Famous Bell Curve

In machine learning, you often come across the word ‘ normalization’ and ‘standardization’.  The process which we did before to obtain standard normal distribution is known as standardization whereas the process of restricting the range of dataset values between 0.0 to 1.0 is called as normalization. However, these terms are often interchanged.

Model accuracy measurement tools:

To measure the performance of a deep learning model, we use several concepts. To know these concepts is essential, they are known as metrics.

You can understand accuracy  but the theory is that it is the proposition of correct results in the total obtained results.

Accuracy: It is a very simple measurement and it may provide wrong results sometimes .In some cases, higher accuracy doesn’t mean that our model is working correctly. To understand this go through  the following definitions,

·         True Positives (TP): Number of positive examples which are labeled as such.

·         False Positives (FP): Number of negative examples which are labeled as positive.

·         True Negatives (TN): Number of negative examples which are labeled as such.

·         False Negatives (FN): Number of positive examples which are labeled as negative.

Accuracy =  TP + FP + TN + FN) /  (TP + TN)

Confusion matrix: It is a matrix which contains the TP, FN,TN and FP values.

Precision and recall: So we can go for two more metrics-precision and recall.

Precision tells you that how many of the selected objects were correct

Recall tells you that how many correct objects were selected.

In the above-given example, precision and recall both are 0.0  This indicates that the model is very much poor.

F1 score: It is the harmonic average of precision and recall. 

Remember that an F1 score of 0 means worst and 1 means best. By using  the chaotic behavior of an accuracy metric can be resolved by using it.

Mean Absolute error: Mean Absolute error is  the average of difference between the predicted values and original.

Mean squared error: Mean squared error is the average of square of difference between the predicted values and original. It is used because it is easier to compute the gradients.

Receiver operating characteristic (ROC) curve:  It is a graph that shows the performance of classification models just like our digit recognizer example. It has two parameters as follows:

1.       True Positive rate(TPR)

2.      FPR is 1-specificiy.

These two are plotted against each other to obtain the following graph(several values of plots are obtained by changing the classification threshold and predicting the results again repeatedly.). The area under this ROC curve is a measure of accuracy.

Interpretation of the area under the curve(AUC): When the AUCis equal to 1.0 the model is best but when AUCis equal to 0.5 the model is worst. But if AUC=0.0 then the model is reciprocating the results. (Like classifying 1’s as 0’s and 0’s as 1’s).

Random process:

The collection of random variables that are indexed by some values is called Random Process. A stochastic process is a mathematical model for a fact that proceeds are not a predictable manner to the observer. 

Here, the outcome of the next event is not dependable on the outcome of the current event. For example, a series of coin tosses.

Probabilistic programming:

A new paradigm of programming has evolved referred to as probabilistic programming. These languages or libraries help to model bayesian style machine learning, it’s an exciting research field that’s supported by both the AI community and also the software engineering community. These languages readily support probabilistic functions and models like Gaussian models, Markov models, etc.

The probability and statistics have a strong significance in Artificial Intelligence. It is a vast topic we hope that you found the above-given information useful as it is a simplified version.

Thanks for reading!