Top 70 Statistics in Data Science Interview Questions and Answers-(2024)

1 .

Statistics in data science refers to the branch of mathematics that deals with the collection, analysis, interpretation, and presentation of data. In the context of data science, statistics plays a crucial role in extracting meaningful insights and making informed decisions from large and complex datasets.

Here's how statistics is applied in data science :

Data Collection : Statistics provides methods for collecting data through various sampling techniques, surveys, experiments, or observational studies. It helps in ensuring that the collected data is representative of the population of interest.

Data Exploration and Descriptive Statistics : Statistics allows data scientists to explore and summarize the characteristics of a dataset using descriptive statistics such as mean, median, mode, variance, standard deviation, and percentiles. These measures help in understanding the distribution, central tendency, and variability of the data.

Inferential Statistics : In data science, inferential statistics is used to make predictions or draw conclusions about a population based on a sample of data. Techniques such as hypothesis testing, confidence intervals, and regression analysis are commonly employed to infer relationships and patterns in the data.

Probability Theory : Probability theory is fundamental to statistical analysis in data science. It provides a framework for quantifying uncertainty and making probabilistic predictions about future events. Probability distributions, such as the normal distribution, binomial distribution, and Poisson distribution, are often used to model random phenomena in data science applications.

Statistical Modeling : Data scientists use statistical models to represent relationships between variables in a dataset and make predictions or infer causal relationships. Common statistical models include linear regression, logistic regression, time series models, and Bayesian networks.

Experimental Design : Statistics helps in designing experiments and studies to test hypotheses and evaluate the effectiveness of interventions or treatments. It guides the selection of appropriate sample sizes, experimental designs, and statistical tests to ensure the validity and reliability of the results.

Data Visualization : Statistics is closely integrated with data visualization techniques to communicate insights and findings effectively. Graphical representations such as histograms, scatter plots, box plots, and heatmaps are used to visualize patterns, trends, and relationships in the data.

2 .

How is the statistical significance of an insight assessed?

Hypothesis testing is used to find out the statistical significance of the insight. To elaborate, the null hypothesis and the alternate hypothesis are stated, and the p-value is calculated.

After calculating the p-value, the null hypothesis is assumed true, and the values are determined. To fine-tune the result, the alpha value, which denotes the significance, is tweaked. If the p-value turns out to be less than the alpha, then the null hypothesis is rejected. This ensures that the result obtained is statistically significant.

3 .

What is the difference between statistics and machine learning?

Statistics and machine learning are closely related fields that share many concepts and techniques, but they differ in their objectives, methodologies, and applications. Here are some key differences between statistics and machine learning:

Objectives :

* Statistics : The primary objective of statistics is to analyze data, make inferences about populations or processes, and draw conclusions based on probabilistic models. It focuses on understanding the underlying patterns and relationships in data, testing hypotheses, and making predictions with uncertainty quantification.

* Machine Learning : Machine learning aims to develop algorithms and models that can automatically learn from data, identify patterns, and make predictions or decisions without being explicitly programmed. Its focus is on building predictive models and optimizing performance metrics through algorithmic techniques.

Emphasis :

* Statistics : Statistics places a strong emphasis on inferential analysis, hypothesis testing, uncertainty quantification, and interpretation of results within a probabilistic framework. It is often used to gain insights into the underlying processes generating the data and to make decisions based on statistical evidence.

* Machine Learning : Machine learning emphasizes predictive modeling, pattern recognition, optimization, and automation of decision-making processes. It focuses on building predictive models that generalize well to unseen data and optimize performance metrics such as accuracy, precision, recall, or F1-score.

Data Handling :

* Statistics : Statistics typically deals with structured data and relies on statistical models and techniques to analyze relationships between variables. It often involves assumptions about the data distribution and requires careful consideration of sampling methods, data preprocessing, and model selection.

* Machine Learning : Machine learning is more flexible in handling various types of data, including structured, unstructured, and semi-structured data. It can handle large-scale datasets and is capable of learning complex patterns and dependencies in the data without relying on explicit statistical assumptions.

Approach :

* Statistics : Statistics often follows a deductive approach, where hypotheses are formulated based on theoretical considerations or prior knowledge, and statistical tests are conducted to evaluate these hypotheses using observed data.

* Machine Learning : Machine learning typically follows an inductive approach, where algorithms learn patterns and relationships directly from data without prior assumptions or explicit hypotheses. It focuses on algorithmic optimization and generalization to new data.

Interpretability :

* Statistics : Statistics often prioritizes model interpretability and the ability to explain the underlying relationships in the data. It emphasizes understanding the significance of variables, parameter estimates, and confidence intervals.

* Machine Learning : Machine learning may prioritize model performance and predictive accuracy over interpretability, especially in complex models such as deep neural networks. It may sacrifice interpretability for improved predictive power, especially in applications where accurate predictions are more critical than understanding the underlying mechanisms.

4 .

Where are long-tailed distributions used?

A long-tailed distribution is a type of distribution where the tail drops off gradually toward the end of the curve.

The Pareto principle and the product sales distribution are good examples to denote the use of long-tailed distributions. Also, it is widely used in classification and regression problems.

5 .

What is the central limit theorem?

The central limit theorem is the foundation of statistics. It states that if a sample is drawn from a population with large sample size, the distribution of the sample's mean will be distributed normally. In other words, the original population distribution will not be affected.

The central limit theorem is extremely useful in estimating confidence intervals and testing hypotheses. For instance, let's say I want to estimate the worldwide average height. I would take a sample of people from the general population and calculate the mean. Because it is difficult or impossible to collect data on every person's height, the mean of my sample will serve as my estimate.

To create a normal curve, we can plot the mean value and the frequency on a graph and then multiply them several times. The resulting curve will be similar to the original data set, but it will be slightly shifted to the left.

6 .

What is a hypothesis test? How is the statistical significance of an insight determined?

The statistical significance of an experiment's insights can be assessed using hypothesis testing. Hypothesis testing examines the probability of a given experiment's results occurring by chance. The null hypothesis is defined first, and then p-values are computed. If the null hypothesis is true, other values are determined as well. As its name suggests, the alpha value indicates the degree of significance.

In a two-tailed test, the p-value is less than alpha if the null hypothesis is rejected but is greater than alpha if the null hypothesis is accepted. In a one-tailed test, the p-value is less than alpha if the null hypothesis is accepted but is greater than alpha if the null hypothesis is rejected. The rejection of the null hypothesis indicates that the results obtained are statistically significant.

7 .

What are observational and experimental data in statistics?

Observational data is derived from the observation of certain variables from observational studies. The variables are observed to determine any correlation between them.

Experimental data is derived from those experimental studies where certain variables are kept constant to determine any discrepancy or causality.

8 .

What is an outlier?

Outliers can be defined as the data points within a data set that varies largely in comparison to other observations. Depending on its cause, an outlier can decrease the accuracy as well as the efficiency of a model. Therefore, it is crucial to remove them from the data set.

9 .

How to screen for outliers in a data set?

There are many ways to screen and identify potential outliers in a data set. Two key methods are described below –

Standard deviation/z-score : Z-score or standard score can be obtained in a normal distribution by calculating the size of one standard deviation and multiplying it by 3. The data points outside the range are then identified. The Z-score is measured from the mean. If the z-score is positive, it means the data point is above average.
If the z-score is negative, the data point is below average.

If the z-score is close to zero, the data point is close to average.

If the z-score is above or below 3, it is an outlier and the data point is considered unusual.

The formula for calculating a z-score is –

z= data point−mean/standard deviation OR z=x−μ/ σ?

Interquartile range (IQR) : IQR, also called midspread, is a method to identify outliers and can be described as the range of values that occur throughout the length of the middle of 50% of a data set. It is simply the difference between two extreme data points within the observation.

IQR=Q3 – Q1?

Other methods to screen outliers include Isolation Forests, Robust Random Cut Forests, and DBScan clustering.

10 .

What is the meaning of an inlier?

An Inliner is a data point within a data set that lies at the same level as the others. It is usually an error and is removed to improve the model accuracy. Unlike outliers, inlier is hard to find and often requires external data for accurate identification.

11 .

What is the meaning of six sigma in statistics?

Six sigma in statistics is a quality control method to produce an error or defect-free data set. Standard deviation is known as Sigma or σ. The more the standard deviation, the less likely that process performs with accuracy and causes a defect. If a process outcome is 99.99966% error-free, it is considered six sigma. A six sigma model works better than 1σ, 2σ, 3σ, 4σ, 5σ processes and is reliable enough to produce defect-free work.

12 .

What is meant by mean imputation for missing data? Why is it bad?

Mean imputation is a rarely used practice where null values in a dataset are replaced directly with the corresponding mean of the data.

It is considered a bad practice as it completely removes the accountability for feature correlation. This also means that the data will have low variance and increased bias, adding to the dip in the accuracy of the model, alongside narrower confidence intervals.

13 .

How is missing data handled in statistics?

There are many ways to handle missing data in Statistics :

* Prediction of the missing values
* Assignment of individual (unique) values
* Deletion of rows, which have the missing data
* Mean imputation or median imputation
* Using random forests, which support the missing values

14 .

What is exploratory data analysis?

Exploratory data analysis is the process of performing investigations on data to understand the data better.

In this, initial investigations are done to determine patterns, spot abnormalities, test hypotheses, and also check if the assumptions are right.

15 .

What is the meaning of selection bias?

Selection bias is a phenomenon that involves the selection of individual or grouped data in a way that is not considered to be random. Randomization plays a key role in performing analysis and understanding model functionality better.

If correct randomization is not achieved, then the resulting sample will not accurately represent the population.

16 .

What is the meaning of KPI in statistics?

KPI is an acronym for a key performance indicator. It can be defined as a quantifiable measure to understand whether the goal is being achieved or not. KPI is a reliable metric to measure the performance level of an organization or individual with respect to the objectives. An example of KPI in an organization is the expense ratio.

17 .

What is the Pareto principle?

Also known as the 80/20 rule, the Pareto principle states that 80% of the effects or results in an experiment are obtained from 20% of the causes. A simple example is – 20% of sales come from 80% of customers.

18 .

What is the Law of Large Numbers in statistics?

According to the law of large numbers, an increase in the number of trials in an experiment will result in a positive and proportional increase in the results coming closer to the expected value. As an example, let us check the probability of rolling a six-sided dice three times. The expected value obtained is far from the average value. And if we roll a dice a large number of times, we will obtain the average result closer to the expected value (which is 3.5 in this case).

19 .

What is the meaning of an inlier?

An inlier is a data point that lies at the same level as the rest of the dataset. Finding an inlier in the dataset is difficult when compared to an outlier as it requires external data to do so. Inliers, similar to outliers reduce model accuracy. Hence, even they are removed when they’re found in the data. This is done mainly to maintain model accuracy at all times.

20 .

What is the probability of getting a sum of 5 or 8 when 2 dice are rolled once?

When 2 dice are rolled,

Total outcomes = 36 (i.e. 6*6)

Possible outcomes of getting 5 = 4

Possible outcomes of getting a sum 8 = 5

Total = 9

Probability =9/36 = 1/4 = 0.25

21 .

What do you think of the phrase 'p-value'?

It is a number that helps determine the probability of a random occurrence when evaluating a hypothesis. In statistics, the p-value indicates how likely it is that a particular dataset occurred by chance. If the p-value is less than alpha, we can conclude that there is a probability of 5% that the experiment results occurred by chance or 5% of the time, we would see these results.

22 .

What are cherry-picking, P-hacking, and significance chasing?

* Cherry-picking is the act of exclusively taking the bits of information that support a particular conclusion and ignoring all the bits of information that contradict it.

* P-hacking, also known as data collection or analysis manipulation, is a technique that produces significant patterns even though they have no underlying effect.

* Reporting insignificant results as if they are almost significant, is known as Significance Chasing. Data Dredging, Data Fishing, and Data Snooping are all names for this behaviour.

23 .

What is the difference between an error of type I and an error of type II?

* When the null hypothesis is rejected even though it is correct, a type 1 error occurs. False positives are also known as type 1 errors.

* When the null hypothesis is not rejected despite being incorrect, a type 2 error occurs. This is also known as a false negative.

24 .

What are some examples of data sets with non-Gaussian distributions?

When data follows a non-normal distribution, it is frequently non-Gaussian. A non-Gaussian distribution is often seen in many statistics processes. This occurs when data is naturally clustered on one side or the other on a graph. For instance, bacterial growth follows an exponential or non-Gaussian distribution, which is non-normal.

25 .

How does linear regression work?

When utilised in statistics, linear regression is a technique that models the relationship between one or more predictor variables and one outcome variable. For example, linear regression may be used to study the connection between various predictors, such as age, gender, heredity, diet, and height

26 .

What are the necessary conditions for a Binomial Distribution?

The three most important characteristics of a Binomial Distribution are listed below.

* The number of observations must be prearranged. In other words, one can only determine the probability of an event happening a specific number of times if a fixed number of trials are performed.
* It is important that each trial is independent of the others. This means that the probability of each subsequent trial should not be affected by previous trials.
* The chance of getting the job remains the same no matter how many times you try.

27 .

Can you give an example of root cause analysis?

Root cause analysis, as the name suggests, is a method used to solve problems by first identifying the root cause of the problem.

Example : If the higher crime rate in a city is directly associated with the higher sales in a red-colored shirt, it means that they are having a positive correlation. However, this does not mean that one causes the other.

Causation can always be tested using A/B testing or hypothesis testing.

28 .

How can you calculate the p-value using MS Excel?

The formula used in MS Excel to calculate p-value is :

 =tdist(x,deg_freedom,tails)?

The p-value is expressed in decimals in Excel. Here are the steps to calculate it :

* Find the Data tab

* On the Analysis tab, click on the data analysis icon

* Select Descriptive Statistics and then click OK

* Select the relevant column

* Input the confidence level and other variables

29 .

What are the types of biases that you can encounter while sampling?

Sampling bias occurs when you lack the fair representation of data samples during an investigation or a survey. The six main types of biases that one can encounter while sampling are :

* Undercoverage bias
* Observer Bias
* Survivorship bias
* Self-Selection/Voluntary Response Bias
* Recall Bias
* Exclusion Bias

30 .

What is cherry-picking, P-hacking, and significance chasing?

Cherry-picking can be defined as the practice in statistics where only that information is selected which supports a certain claim and ignores any other claim that refutes the desired conclusion.

P-hacking refers to a technique in which data collection or analysis is manipulated until significant patterns can be found who have no underlying effect whatsoever.

Significance chasing is also known by the names of Data Dredging, Data Fishing, or Data Snooping. It refers to the reporting of insignificant results as if they are almost significant.

31 .

What is the difference between type I vs type II errors?

A type 1 error occurs when the null hypothesis is rejected even if it is true. It is also known as false positive.

A type 2 error occurs when the null hypothesis fails to get rejected, even if it is false. It is also known as a false negative.

32 .

What is the Binomial Distribution Formula?

The binomial distribution formula is:

b(x; n, P) = nCx * Px * (1 – P)n – x?

Where :

b = binomial probability

x = total number of “successes” (pass or fail, heads or tails, etc.)

P = probability of success on an individual trial

n = number of trials

33 .

What is DOE?

DOE is an acronym for the Design of Experiments in statistics. It is considered as the design of a task that describes the information and the change of the same based on the changes to the independent input variables.

34 .

What type of data does not have a log-normal distribution or a Gaussian distribution?

Exponential distributions do not have a log-normal distribution or a Gaussian distribution. In fact, any type of data that is categorical will not have these distributions as well.

Example : Duration of a phone car, time until the next earthquake, etc.

35 .

What is the Pareto principle?

The Pareto principle is also called the 80/20 rule, which means that 80 percent of the results are obtained from 20 percent of the causes in an experiment.

A simple example of the Pareto principle is the observation that 80 percent of peas come from 20 percent of pea plants on a farm.

36 .

What is the meaning of the five-number summary in Statistics?

The five-number summary is a measure of five entities that cover the entire range of data as shown below :

* Low extreme (Min)
* First quartile (Q1)
* Median
* Upper quartile (Q3)
* High extreme (Max)

37 .

What are population and sample in Inferential Statistics, and how are they different?

A population is a large volume of observations (data). The sample is a small portion of that population. Because of the large volume of data in the population, it raises the computational cost. The availability of all data points in the population is also an issue.

In short :

* We calculate the statistics using the sample.
* Using these sample statistics, we make conclusions about the population.

38 .

What is kurtosis?

Kurtosis is used to describe the extreme values present in one tail of distribution versus the other. It is actually the measure of outliers present in the distribution. A high value of kurtosis represents large amounts of outliers being present in data. To overcome this, we have to either add more data into the dataset or remove the outliers.

39 .

What is correlation?

Correlation is used to test relationships between quantitative variables and categorical variables. Unlike covariance, correlation tells us how strong the relationship is between two variables. The value of correlation between two variables ranges from -1 to +1.

The -1 value represents a high negative correlation, i.e., if the value in one variable increases, then the value in the other variable will drastically decrease. Similarly, +1 means a positive correlation, and here, an increase in one variable will lead to an increase in the other. Whereas, 0 means there is no correlation.

If two variables are strongly correlated, then they may have a negative impact on the statistical model, and one of them must be dropped.

40 .

What is the difference between a sample and a population?

The subset of the population from which numbers are obtained is known as the sample. The numbers obtained from the population are known as parameters, while the numbers obtained from the sample are known as statistics. It is through sample data that conclusions may be made about the population.

Population	Sample
A parameter is an observable quality that can be measured.	Statistics is an observable quality that can be measured.
Every element of the population is a unique individual.	A subset of the population is used to explore some aspects of the population.
An opinion report is a true representation of what happened.	The reported values have a confidence level and an error margin.
All members of a group are included in the list.	A particular portion of the population is represented by that subset.

41 .

What is the difference between Descriptive and Inferential Statistics?

Descriptive	Inferential
Describe the data in terms of its key characteristics.	To conclude the population, it is used.
Data can be organised, analysed, and presented in a meaningful way thanks to charts.	The purpose of data analysis is to compare data and make predictions through hypotheses.
Using charts, tables, and graphs to present information.	Probability was responsible for achieving this goal.

42 .

What are the types of sampling in Statistics?

There are four main types of data sampling as shown below :

* Simple random : Pure random division
* Cluster : Population divided into clusters
* Stratified : Data divided into unique groups
* Systematical : Picks up every ‘n’ member in the data

43 .

What is the meaning of covariance?

Covariance is the measure of indication when two items vary together in a cycle. The systematic relation is determined between a pair of random variables to see if the change in one will affect the other variable in the pair or not.

44 .

What is Bessel's correction?

Bessel's correction is a factor that is used to estimate a populations’ standard deviation from its sample. It causes the standard deviation to be less biased, thereby, providing more accurate results.

45 .

What are some of the low and high-bias Machine Learning algorithms?

Some of the widely used low and high-bias Machine Learning algorithms are :

Low bias : Decision trees, Support Vector Machines, k-Nearest Neighbors, etc.

High bias : Linear Regression, Logistic Regression, Linear Discriminant Analysis, etc.

46 .

When should you use a t-test vs a z-test?

The z-test is used for hypothesis testing in statistics with a normal distribution. It is used to determine population variance in the case where a sample is large.

The t-test is used with a t-distribution and used to determine population variance when you have a small sample size.

In case the sample size is large or n>30, a z-test is used. T-tests are helpful when the sample size is small or n<30.

47 .

What is the empirical rule?

In statistics, the empirical rule states that every piece of data in a normal distribution lies within three standard deviations of the mean. It is also known as the 68–95–99.7 rule. According to the empirical rule, the percentage of values that lie in a normal distribution follow the 68%, 95%, and 99.7% rule. In other words, 68% of values will fall within one standard deviation of the mean, 95% will fall within two standard deviations, and 99.75 will fall within three standard deviations of the mean.

48 .

How are confidence tests and hypothesis tests similar? How are they different?

Confidence tests and hypothesis tests both form the foundation of statistics.

The confidence interval holds importance in research to offer a strong base for research estimations, especially in medical research. The confidence interval provides a range of values that helps in capturing the unknown parameter.

Hypothesis testing is used to test an experiment or observation and determine if the results did not occur purely by chance or luck using the below formula where ‘p’ is some parameter.

Confidence and hypothesis testing are inferential techniques used to either estimate a parameter or test the validity of a hypothesis using a sample of data from that data set. While confidence interval provides a range of values for an accurate estimation of the precision of that parameter, hypothesis testing tells us how confident we are inaccurately drawing conclusions about a parameter from a sample. Both can be used to infer population parameters in tandem.

In case we include 0 in the confidence interval, it indicates that the sample and population have no difference. If we get a p-value that is higher than alpha from hypothesis testing, it means that we will fail to reject the bull hypothesis.

49 .

What general conditions must be satisfied for the central limit theorem to hold?

Here are the conditions that must be satisfied for the central limit theorem to hold :

* The data must follow the randomization condition which means that it must be sampled randomly.
* The Independence Assumptions dictate that the sample values must be independent of each other.
* Sample sizes must be large. They must be equal to or greater than 30 to be able to hold CLT. Large sample size is required to hold the accuracy of CLT to be true.

50 .

What is Random Sampling? Give some examples of some random sampling techniques.

Random sampling is a sampling method in which each sample has an equal probability of being chosen as a sample. It is also known as probability sampling.

Let us check four main types of random sampling techniques :

* Simple Random Sampling technique – In this technique, a sample is chosen randomly using randomly generated numbers. A sampling frame with the list of members of a population is required, which is denoted by ‘n’. Using Excel, one can randomly generate a number for each element that is required.

* Systematic Random Sampling technique - This technique is very common and easy to use in statistics. In this technique, every k’th element is sampled. For instance, one element is taken from the sample and then the next while skipping the pre-defined amount or ‘n’.

In a sampling frame, divide the size of the frame N by the sample size (n) to get ‘k’, the index number. Then pick every k’th element to create your sample.

* Cluster Random Sampling technique - In this technique, the population is divided into clusters or groups in such a way that each cluster represents the population. After that, you can randomly select clusters to sample.

* Stratified Random Sampling technique – In this technique, the population is divided into groups that have similar characteristics. Then a random sample can be taken from each group to ensure that different segments are represented equally within a population.

51 .

In a scatter diagram, what is the line that is drawn above or below the regression line called?

The line that is drawn above or below the regression line in a scatter diagram is called the residual or also the prediction error.

52 .

What are the examples of symmetric distribution?

Symmetric distribution means that the data on the left side of the median is the same as the one present on the right side of the median.

There are many examples of symmetric distribution, but the following three are the most widely used ones :

* Uniform distribution
* Binomial distribution
* Normal distribution

53 .

What is the relationship between mean and median in a normal distribution?

In a normal distribution, the mean is equal to the median. To know if the distribution of a dataset is normal, we can just check the dataset’s mean and median.

54 .

What is the difference between the Ist quartile, the IInd quartile, and the IIIrd quartile?

Quartiles are used to describe the distribution of data by splitting data into three equal portions, and the boundary or edge of these portions are called quartiles.

That is,

The lower quartile (Q1) is the 25th percentile.
The middle quartile (Q2), also called the median, is the 50th percentile.
The upper quartile (Q3) is the 75th percentile.

55 .

What are the scenarios where outliers are kept in the data?

There are not many scenarios where outliers are kept in the data, but there are some important situations when they are kept. They are kept in the data for analysis if :

* Results are critical
* Outliers add meaning to the data
* The data is highly skewed

56 .

Briefly explain the procedure to measure the length of all sharks in the world.

Following steps can be used to determine the length of sharks :

* Define the confidence level (usually around 95%)
* Use sample sharks to measure
* Calculate the mean and standard deviation of the lengths
* Determine t-statistics values
* Determine the confidence interval in which the mean length lies

57 .

What is the meaning of degrees of freedom (DF) in statistics?

Degrees of freedom or DF is used to define the number of options at hand when performing an analysis. It is mostly used with t-distribution and not with the z-distribution.

If there is an increase in DF, the t-distribution will reach closer to the normal distribution. If DF > 30, this means that the t-distribution at hand is having all of the characteristics of a normal distribution.

58 .

What are quantitative data and qualitative data?

Qualitative data is used to describe the characteristics of data and is also known as Categorical data. For example, how many types. Quantitative data is a measure of numerical values or counts. For example, how much or how often. It is also known as Numeric data.

59 .

What is the left-skewed distribution and the right-skewed distribution?

In the left-skewed distribution, the left tail is longer than the right side.

Mean < median < mode

In the right-skewed distribution, the right tail is longer. It is also known as positive-skew distribution.

Mode < median < mean

60 .

How to convert normal distribution to standard normal distribution?

Any point (x) from the normal distribution can be converted into standard normal distribution (Z) using this formula :

Z(standardized) = (x-µ) / σ?

Here, Z for any particular x value indicates how many standard deviations x is away from the mean of all values of x.

61 .

How to detect outliers?

The best way to detect outliers is through graphical means. Apart from that, outliers can also be detected through the use of statistical methods using tools such as Excel, Python, SAS, among others. The most popular graphical ways to detect outliers include box plot and scatter plot.

62 .

What is skewness?

Skewness provides the measure of the symmetry of a distribution. If a distribution is not normal or asymmetrical, it is skewed. A distribution can exhibit positive skewness or negative skewness if the tail on the right is longer and the tail on the left side is longer, respectively.

63 .

What is a confounding variable?

A confounding variable in statistics is an ‘extra’ or ‘third’ variable that is associated with both the dependent variable and the independent variable, and it can give a wrong estimate that provides useless results.

For example, if we are studying the effect of weight gain, then lack of workout will be the independent variable, and weight gain will be the dependent variable. In this case, the amount of food consumption can be the confounding variable as it will mask or distort the effect of other variables in the study. The effect of weather can be another confounding variable that may later the experiment design.

64 .

What does autocorrelation mean?

Autocorrelation is a representation of the degree of correlation between the two variables in a given time series. It means that the data is correlated in a way that future outcomes are linked to past outcomes. Autocorrelation makes a model less accurate because even errors follow a sequential pattern.

65 .

If there is a 30 percent probability that you will see a supercar in any 20-minute time interval, what is the probability that you see at least one supercar in the period of an hour (60 minutes)?

The probability of not seeing a supercar in 20 minutes is:

= 1 − P(Seeing one supercar)
= 1 − 0.3
= 0.7?

Probability of not seeing any supercar in the period of 60 minutes is:

= (0.7) ^ 3 = 0.343?

Hence, the probability of seeing at least one supercar in 60 minutes is:

= 1 − P(Not seeing any supercar)
= 1 − 0.343 = 0.657?

66 .

What is the meaning of TF/IDF vectorization?

TF-IDF is an acronym for Term Frequency – Inverse Document Frequency. It is used as a numerical measure to denote the importance of a word in a document. This document is usually called the collection or the corpus.

The TF-IDF value is directly proportional to the number of times a word is repeated in a document. TF-IDF is vital in the field of Natural Language Processing (NLP) as it is mostly used in the domain of text mining and information retrieval.

67 .

Can you give an example to denote the working of the central limit theorem?

Let’s consider the population of men who have normally distributed weights, with a mean of 60 kg and a standard deviation of 10 kg, and the probability needs to be found out.

If one single man is selected, the weight is greater than 65 kg, but if 40 men are selected, then the mean weight is far more than 65 kg.

The solution to this can be as shown below :

Z = (x − µ) / ? = (65 − 60) / 10 = 0.5
 
For a normal distribution P(Z &gt; 0.5) = 0.409
Z = (65 − 60) / 5 = 1
P(Z &gt; 1) = 0.090?

68 .

What is the purpose of Hash tables in statistics?

When key-value pairs are stored in a hash table, the information regarding keys and associated values are stored in a hierarchical fashion using hash tables. The hashing function is used to provide an index that contains all of the information regarding keys and their associated values.

69 .

What is a bell-curve distribution?

A bell-curve distribution is represented by the shape of a bell and indicates normal distribution. It occurs naturally in many situations especially while analyzing financial data. The top of the curve shows the mode, mean and median of the data and is perfectly symmetrical.

The key characteristics of a bell-shaped curve are :

* The empirical rule says that approximately 68% of data lies within one standard deviation of the mean in either of the directions.
* Around 95% of data falls within two standard deviations and
* Around 99.7% of data fall within three standard deviations in either direction.

70 .

What is the difference between parametric and non-parametric statistical tests?

Parametric and non-parametric statistical tests are two broad categories of statistical methods used for hypothesis testing and data analysis. Here's a description of the key differences between them:

Parametric Statistical Tests :

* Parametric tests assume that the data follows a specific probability distribution, usually a normal distribution. Common parametric tests include t-tests, analysis of variance (ANOVA), linear regression, and chi-square tests.

* Parametric tests typically make assumptions about the population parameters, such as the mean, variance, or shape of the distribution.

* These tests often require that the data meet certain assumptions, such as normality and homogeneity of variances.

* Parametric tests are generally more powerful (i.e., have higher statistical power) than non-parametric tests when the assumptions are met.

* Examples of parametric tests include:

* Student's t-test for comparing means of two groups.

* One-way ANOVA for comparing means of more than two groups.

* Pearson correlation coefficient for assessing linear relationships between variables.

* Linear regression for modeling the relationship between a dependent variable and one or more independent variables.

Non-parametric Statistical Tests :

* Non-parametric tests do not assume any specific probability distribution for the data. Instead, they are based on fewer or weaker assumptions about the underlying population.

* Non-parametric tests are often used when the data do not meet the assumptions required for parametric tests, such as when the data are skewed, have outliers, or come from non-normal distributions.

* Non-parametric tests are generally less powerful than parametric tests, especially with larger sample sizes, but they are more robust to violations of assumptions.

* These tests are sometimes referred to as distribution-free tests because they do not rely on distributional assumptions.

* Examples of non-parametric tests include:

* Mann-Whitney U test (Wilcoxon rank-sum test) for comparing medians of two independent groups.

* Kruskal-Wallis test for comparing medians of more than two independent groups.

* Spearman's rank correlation coefficient for assessing monotonic relationships between variables.

* Chi-square test for independence for comparing categorical variables.