Some commonly asked interview questions in Statistics / Machine Learning / Artificial Intelligence
What is the difference between Bayesian and Frequentist approach?
The frequentist approach treats the parameter (e.g., a population mean or proportion) as an unknown but fixed quantity. It does not have a probability distribution because, in this view, the parameter is not random only the data are random due to sampling variability. Statistical inference is made by evaluating how likely the observed data are, given different hypothetical values of the parameter.
The Bayesian approach treats the parameter as a random variable. It assigns a prior distribution to represent beliefs or knowledge about the parameter before observing data. After seeing the data, these beliefs are updated using Bayes’ theorem, resulting in a posterior distribution. This posterior captures our updated uncertainty about the parameter, and we can compute probabilities (e.g., that the parameter lies in a given range).
What is the difference between confidence interval and credible interval ?
A confidence interval reflects a frequentist interpretation of uncertainty about a parameter’s value. A confidence interval provides a range of values calculated from sample data, using a method that is designed to capture the true parameter value a certain proportion of time over repeated sampling. As an example, if we calculate the 95% confidence interval of a parameter, then we are saying that if we repeat the procedure to calculate the confidence interval 100 times using our method, it will capture the true value of the parameter about 95% of the times. The confidence is in the method, not in the specific interval once it is calculated. For any given confidence interval from one sample, the true parameter either is or is not in the interval; there is no probability involved from a frequentist standpoint after the interval is observed.
Whereas a credible interval reflects Bayesian interpretation of uncertainty about a parameter’s value. It gives a range within which the parameter lies with a certain probability, given the observed data and prior beliefs. For example, a 95% credible interval means that, based on the posterior distribution (which is reconciliation of the prior and the likelihood from the data), there is a 95% probability that the parameter lies within the interval. This is fundamentally different from a confidence interval in the frequentist framework, where the parameter is fixed and the interval is random. In Bayesian statistics, the parameter is treated as a random variable, and the credible interval expresses actual uncertainty about its value after seeing the data.
What is a p-value ?
p-value (probability value) is the probability of observing as or more extreme test statistic than what we have observed assuming the null hypothesis is true. p-value is always between 0 and 1. If the p-value is close to 0 (very low), we say the data does not seem to agree with the null hypothesis (overwhelming evidence against null hypothesis) and if the p-value is high, we say that data seems to agree with the null hypothesis (no evidence against null hypothesis).
If we get a very low p-value (close to 0), it does not mean the null hypothesis is false, it simply means the data does not agree with the null hypothesis. Similarly if the p-value is very high (close to 1), it does not mean the null hypothesis is true, it simply means the data agrees with the null hypothesis. It can always happen that we have got an unrepresentative sample in which case, our conclusion will be incorrect.
What is multicollinearity and how to overcome it ?
Multicollinearity is a situation when two or more explanatory variables (predictors) are highly linearly correlated with each other. This means the two variables provide similar information about the response. This indicates it might be better to include just one variable in the model as adding another variable (making the model more complex) will not provide much additional information.
One of the ways to identify multicollinearity is through the variance inflation factor (VIF). VIF = 1 indicates no multicollinearity. \(1 < VIF < 5\) indicates moderate multicollinearity which is acceptable. \(VIF > 5\) indicates a cause of concern and \(VIF > 10\) indicates serious issues with multicollinearity. Another way is to look at linear correlation between pairs of variables. Highly correlated variables indicates potential multicollinearity.
There are multiple ways to tackle multicollinearity. One of the most popular one is to calculate principal components and use some of them as predictors.
What is the difference between parametric and non parametric methods
The basic difference between parametric and non parametric methods is the distributional assumptions of the data. In a parametric methods, there are strong assumption for the data whereas in non parametric methods, the distributional assumptions are minimal. A parametric model will not give reliable results if the assumptions are not satisfied.
As an example, if we are looking at a linear regression model between predictor variable (X) and response (Y), using a linear regression (parametric method) assumes linear relationship between the two variables (X and Y), normality of residuals. The model will not perform well if these assumptions are not satisfied. Whereas if we consider a decision trees/random forest (non-parametric method), there is not distributional assumptions.
However the parameters in a parametric models can be interpreted in the context of the data whereas the hyper-parameters in a non-parametric models cannot be interpreted in the context of the data.
What is the difference between probability and likelihood
Probability is the chance of observing an event (or data) given a cetrain model (or parameters) whereas likelihood is trying to find the most plausible model (or parameters) given the event (or data) has already happened (or observed).
Suppose we denote the parameters by \(\theta\) and data by \(D\). Then probability is \(p(D|\theta)\) whereas likelihood is \(L(\theta|D)\). In probability, parameter \(\theta\) is fixed and data \(D\) is random whereas in likelihood, data \(D\) is fixed and parameter \(\theta\) is random.
What is the difference between correlation and covariance
Correlation and Covariance gives the linear relationship between two variables.
Covariance gives only the direction of linear relationship between 2 variables. The units are the same as that of the variables. Covariance can take any value between \(- \infty\) from \(\infty\).
Correlation gives both the strength and direction of linear relationship between 2 variables. It can be thought of as a scaled version of covariance and is unitless. Covariance can take any value between \(- 1\) from \(1\).
Based on the difference, two datasets cannot be compared using covariance due to the potential difference in the unit but it can be compared using the correlation.