Table of Contents
- Statistics for Machine Learning (7-Day Mini-Course) (Blog)
- Digital textbook on probability and statistics (!)
- Statisticians say the darndest things (Blog)
- Intro to Descriptive Statistics (Blog!)
Applications of Statistics in Machine Learning
-
Statistics in Data Preparation:
Statistical methods are required in the preparation of train and test data for your machine learning model.Tasks:
- Outlier detection
- Missing Value Imputation
- Data Sampling
- Data Scaling
- Variable Encoding
A basic understanding of data distributions, descriptive statistics, and data visualization is required to help you identify the methods to choose when performing these tasks.
-
Statistics in Model Evaluation:
Statistical methods are required when evaluating the skill of a machine learning model on data not seen during training.Tasks:
- Data Sampling
- Data Re-Sampling
- Experimental Design
Re-Sampling Techniques include k-fold and cross-validation.
-
Statistics in Model Selection:
Statistical methods are required when selecting a final model or model configuration to use for a predictive modeling problem.Tasks:
- Checking for a significant difference between results
- Quantifying the size of the difference between results
Techniques include statistical hypothesis tests.
-
Statistics in Model Presentation:
Statistical methods are required when presenting the skill of a final model to stakeholders.Tasks:
- Summarizing the expected skill of the model on average
- Quantifying the expected variability of the skill of the model in practice
Techniques include estimation statistics such as confidence intervals.
-
Statistics in Prediction:
Statistical methods are required when making a prediction with a finalized model on new data.Tasks:
- Quantifying the expected variability for the prediction.
Techniques include estimation statistics such as prediction intervals.
-
Summary:
- Data Preparation:
- Outlier detection
- Missing Value Imputation
- Data Sampling
- Data Scaling
- Variable Encoding
- Model Evaluation:
- Data Sampling
- Data Re-Sampling
- Experimental Design
- Model Selection:
- Checking for a significant difference between results
- Quantifying the size of the difference between results
- Model Presentation:
- Summarizing the expected skill of the model on average
- Quantifying the expected variability of the skill of the model in practice
- Prediction:
- Quantifying the expected variability for the prediction.
- Data Preparation:
Introduction to Statistics
-
Statistics:
Statistics is a subfield of mathematics. It refers to a collection of methods for working with data and using data to answer questions. - Statistical Tools:
Statistical Tools can be divided into two large groups of methods:- Descriptive Statistics: methods for summarizing raw observations into information that we can understand and share.
- Inferential Statistics: methods for quantifying properties of the population from a smaller set of obtained observations called a sample.
-
Descriptive Statistics:
Descriptive Statistics is the process of using and analyzing summary statistics that quantitatively describe or summarize features from raw observations.Descriptive Statistics are broken down into (Techniques):
- Measures of Central Tendency: mean, mode, median
- Measures of Variability/Dispersion: variance, std, minimum, maximum, kurtosis, skewness
Contrast with Inferential Statistics:
- Descriptive Statistics aims to summarize a sample.
Descriptive statistics is solely concerned with properties of the observed data, and it does not rest on the assumption that the data come from a larger population. - Inferential Statistics uses the sample to learn about the population.
It is assumed that the observed data set is sampled from a larger population.
-
Inferential Statistics:
Inferential Statistics is the process of using data analysis to deduce properties of an underlying distribution of probability by analyzing a smaller set of observations, drawn from the population, called a sample.
Inferential statistical analysis infers properties of a population, for example by testing hypotheses and deriving estimates.Techniques:
- AUC
- Kappa-Statistics Test
- Confusion Matrix
- F-1 Score
Statistical Hypothesis Tests
-
Statistical Hypothesis Tests:
Statistical hypothesis tests can be used to indicate whether the difference between two samples is due to random chance, but cannot comment on the size of the difference. -
Statistical Hypothesis Tests Types:
Statistical Hypothesis Tests in Python (Blog)- Normality Tests:
- Shapiro-Wilk Test
- D’Agostino’s K^2 Test
- Anderson-Darling Test
- Correlation Tests:
- Pearson’s Correlation Coefficient
- Spearman’s Rank Correlation
- Kendall’s Rank Correlation
- Chi-Squared Test
- Stationary Tests:
- Augmented Dickey-Fuller
- Kwiatkowski-Phillips-Schmidt-Shin
- Parametric Statistical Hypothesis Tests:
- Student’s t-test
- Paired Student’s t-test
- Analysis of Variance Test (ANOVA)
- Repeated Measures ANOVA Test
- Nonparametric Statistical Hypothesis Tests:
- Mann-Whitney U Test
- Wilcoxon Signed-Rank Test
- Kruskal-Wallis H Test
- Friedman Test
- Normality Tests:
Estimation Statistics
-
Estimation:
- Effect Size: Methods for quantifying the size of an effect given a treatment or intervention.
- Interval Estimation: Methods for quantifying the amount of uncertainty in a value.
- Meta-Analysis: Methods for quantifying the findings across multiple similar studies.
The most useful methods in applied Machine Learning are Interval Estimation methods.
Types of Intervals:
- Tolerance Interval: The bounds or coverage of a proportion of a distribution with a specific level of confidence.
- Confidence Interval: The bounds on the estimate of a population parameter.
- Prediction Interval: The bounds on a single observation.
Confidence Intervals in ML:
A simple way to calculate a confidence interval for a classification algorithm is to calculate the binomial proportion confidence interval, which can provide an interval around a model’s estimated accuracy or error.
Hypotheses are about population parameters. The Null Hypothesis includes Equality; others include Inequalities.
Significance Level (\(\alpha\)): is the probability that we reject the Null Hypothesis when in-reality it is correct.
People will buy more chocolate if we give away a free gift with the chocolate.
- Population: All days we sell chocolate
- Sample: the days in the next month (not so random - oh well..)
-
Set-up: on each day, randomly give out a gift with the chocolate or not (toss a coin)
- Treatments:
- Offering a gift
- Not offering a gift
- Hypotheses:
- \(H_0\): There is no difference in mean sales (for the population) for the two treatments.
- Math: \(\mathrm{H}_{0}: \mu_{\text {free sticker}}=\mu_{\text {no sticker}}\)
\(\iff\)
\(\mathrm{H}_{0}: \mu_{\text {free sticker }}-\mu_{\text {no sticker }}=0\) - English: There is no difference in the sales for the two treatments.
- Math: \(\mathrm{H}_{0}: \mu_{\text {free sticker}}=\mu_{\text {no sticker}}\)
- \(H_0\): There is a difference in mean sales (for the population) for the two treatments.
- Math: \(\mathrm{H}_{1}: \mu_{\text {free sticker }} \neq \mu_{\text {no sticker }}\)
\(\iff\)
\(\mathrm{H}_{1}: \mu_{\text {free sticker }}-\mu_{\text {no sticker }} \neq 0\)
- Math: \(\mathrm{H}_{1}: \mu_{\text {free sticker }} \neq \mu_{\text {no sticker }}\)
\(\iff\)
- \(H_0\): There is no difference in mean sales (for the population) for the two treatments.
One-Tailed: the hypotheses have an inequality (\(\leq\) or \(\geq\)) and an inequality (\(>\) or \(<\)).
Two-Tailed: the hypotheses have an equality (\(=\)) and a non-equality (\(\neq\)).
- P-Value: is a probability. Precisely, it is the probability that we would get our sample result by chance, IF there is NO effect in the population.
- How likely is it to get the results that you observe, IF the Null Hypothesis is true.
- If very likely: The Null Hypothesis is probably True.
- If very unlikely: The Null Hypothesis is probably False.
- Understanding where the p-value comes from (Vid)
- How likely is it to get the results that you observe, IF the Null Hypothesis is true.
A small p-value indicates a significant result. The smaller the p-value the more confident we are that the Null Hypothesis is wrong.
Statistical Significance: We have evidence that the result we see in the sample also exist in the population (as opposed to chance, or sampling errors).
Thus, when you get a p-value less than a certain significance level (\(\alpha\)) and you reject the Null Hypothesis; you have a statistically significant result.
The larger the sample the more likely the results will be statistically significant.
The smaller the sample the more unlikely the results will be statistically significant.
The Null Hypothesis for Regression Analysis reads: The slope coefficient of variable-2 is 0. Variable-2 does not influence variable-1.
Types of Data:
- Nominal: AKA Categorical, Qualitative. E.g. color, gender, preferred chocolate
- Summary Statistics: Use Frequency, Percentages. Can’t calculate Mean, etc.
- Graphs: Pie Chart, Bar Chart, Stacked Bar Chart.
- Analysis:
- Ordinal: E.g. Rank, Satisfaction, Agreement
- Summary Statistics: Use Frequency, Proportions. Shouldn’t use Means etc. Can use Mean for data like user-emotion.
- Graphs: Bar Chart, Stacked Bar Chart, Histogram
- Interval/Ratio: AKA Scale, Quantitative, Parametric. Types: Discrete, Continuous. E.g. height, weight, age.
- Summary Statistics: Use Mean, Median, StD.
- Graphs: Bar Chart, Stacked Bar Chart, Histogram; Boxplots; Scatters.
- Analysis:
- Column: Variable/Feature
- Row : Observation
Statistical Tests
Choosing a Statistical Test depends on three factors:
- Data (level of measurement?):
- Nominal/Categorical:
- Test for Proportion
- Test for Difference of Two Proportions
- Chi-Squared Test for Independence
- Interval/Ratio:
- Test for the Mean
- Test for Difference of Two Means (independent samples)
- Test for Difference of Two Means (paired)
- Test for Regression Analysis
- Ordinal:
Ordinal data can be classified with one of the other two depending on the context.
- Nominal/Categorical:
- Samples (how many?):
- One Sample: If we wish to compare a proportion or a mean against a given value, this will involve One Sample.
- Test for the Mean
- Test for the Proportion
- Two Samples: If we are comparing two different lots of things (e.g. men and women, people from different departments).
- Test for Difference of Two Proportions
- Test for Difference of Two Means (independent samples)
- One Sample, Two Measurements: If we have two sets of information on the same people/things, we have one sample with two variables.
- Chi-Squared Test for Independence
- Test for Regression Analysis
- Test for Difference of Two Means (paired)
- One Sample: If we wish to compare a proportion or a mean against a given value, this will involve One Sample.
- Purpose (of analysis?):
- Testing Against a Hypothesized Value:
- Test for Proportion
- Test for the Mean
- Test for Difference of Two Means (paired)
- Comparing Two Statistics:
- Test for Difference of Two Proportions
- Test for Difference of Two Means (independent samples)
- Looking for a Relationship between Two Variables:
- Chi-Squared Test for Independence
- Test for Regression Analysis
- Testing Against a Hypothesized Value:
Other Statistical Tests:
- ANOVA
- Spearman: The Spearman rank-order correlation coefficient is a nonparametric measure of the monotonicity of the relationship between two datasets.
Unlike the Pearson correlation, the Spearman correlation does not assume that both datasets are normally distributed. - Kruskal-Wallis: The Kruskal-Wallis H-test tests the null hypothesis that the population median of all of the groups are equal.
It is a non-parametric version of ANOVA. - Mann-Whitney:
- Anderson-Darling: The Anderson-Darling tests the null hypothesis that a sample is drawn from a population that follows a particular distribution.
This function works for normal, exponential, logistic, or Gumbel distributions.
Considerations:
- Residuals
- Bias
- Independence
- Post-Hoc
- Normality