Table of Contents



Applications of Statistics in Machine Learning

  1. Statistics in Data Preparation:
    Statistical methods are required in the preparation of train and test data for your machine learning model.

    Tasks:

    • Outlier detection
    • Missing Value Imputation
    • Data Sampling
    • Data Scaling
    • Variable Encoding

    A basic understanding of data distributions, descriptive statistics, and data visualization is required to help you identify the methods to choose when performing these tasks.

  2. Statistics in Model Evaluation:
    Statistical methods are required when evaluating the skill of a machine learning model on data not seen during training.

    Tasks:

    • Data Sampling
    • Data Re-Sampling
    • Experimental Design

    Re-Sampling Techniques include k-fold and cross-validation.

  3. Statistics in Model Selection:
    Statistical methods are required when selecting a final model or model configuration to use for a predictive modeling problem.

    Tasks:

    • Checking for a significant difference between results
    • Quantifying the size of the difference between results

    Techniques include statistical hypothesis tests.

  4. Statistics in Model Presentation:
    Statistical methods are required when presenting the skill of a final model to stakeholders.

    Tasks:

    • Summarizing the expected skill of the model on average
    • Quantifying the expected variability of the skill of the model in practice

    Techniques include estimation statistics such as confidence intervals.

  5. Statistics in Prediction:
    Statistical methods are required when making a prediction with a finalized model on new data.

    Tasks:

    • Quantifying the expected variability for the prediction.

    Techniques include estimation statistics such as prediction intervals.

  6. Summary:

    • Data Preparation:
      • Outlier detection
      • Missing Value Imputation
      • Data Sampling
      • Data Scaling
      • Variable Encoding
    • Model Evaluation:
      • Data Sampling
      • Data Re-Sampling
      • Experimental Design
    • Model Selection:
      • Checking for a significant difference between results
      • Quantifying the size of the difference between results
    • Model Presentation:
      • Summarizing the expected skill of the model on average
      • Quantifying the expected variability of the skill of the model in practice
    • Prediction:
      • Quantifying the expected variability for the prediction.

Introduction to Statistics

  1. Statistics:
    Statistics is a subfield of mathematics. It refers to a collection of methods for working with data and using data to answer questions.

  2. Statistical Tools:
    Statistical Tools can be divided into two large groups of methods:
    • Descriptive Statistics: methods for summarizing raw observations into information that we can understand and share.
    • Inferential Statistics: methods for quantifying properties of the population from a smaller set of obtained observations called a sample.
  3. Descriptive Statistics:
    Descriptive Statistics is the process of using and analyzing summary statistics that quantitatively describe or summarize features from raw observations.

    Descriptive Statistics are broken down into (Techniques):

    • Measures of Central Tendency: mean, mode, median
    • Measures of Variability/Dispersion: variance, std, minimum, maximum, kurtosis, skewness

    Contrast with Inferential Statistics:

    • Descriptive Statistics aims to summarize a sample.
      Descriptive statistics is solely concerned with properties of the observed data, and it does not rest on the assumption that the data come from a larger population.
    • Inferential Statistics uses the sample to learn about the population.
      It is assumed that the observed data set is sampled from a larger population.
  4. Inferential Statistics:
    Inferential Statistics is the process of using data analysis to deduce properties of an underlying distribution of probability by analyzing a smaller set of observations, drawn from the population, called a sample.
    Inferential statistical analysis infers properties of a population, for example by testing hypotheses and deriving estimates.

    Techniques:

    • AUC
    • Kappa-Statistics Test
    • Confusion Matrix
    • F-1 Score


Statistical Hypothesis Tests

  1. Statistical Hypothesis Tests:
    Statistical hypothesis tests can be used to indicate whether the difference between two samples is due to random chance, but cannot comment on the size of the difference.

  2. Statistical Hypothesis Tests Types:
    Statistical Hypothesis Tests in Python (Blog)

    • Normality Tests:
      1. Shapiro-Wilk Test
      2. D’Agostino’s K^2 Test
      3. Anderson-Darling Test
    • Correlation Tests:
      1. Pearson’s Correlation Coefficient
      2. Spearman’s Rank Correlation
      3. Kendall’s Rank Correlation
      4. Chi-Squared Test
    • Stationary Tests:
      1. Augmented Dickey-Fuller
      2. Kwiatkowski-Phillips-Schmidt-Shin
    • Parametric Statistical Hypothesis Tests:
      1. Student’s t-test
      2. Paired Student’s t-test
      3. Analysis of Variance Test (ANOVA)
      4. Repeated Measures ANOVA Test
    • Nonparametric Statistical Hypothesis Tests:
      1. Mann-Whitney U Test
      2. Wilcoxon Signed-Rank Test
      3. Kruskal-Wallis H Test
      4. Friedman Test

Estimation Statistics

  1. Estimation:

    • Effect Size: Methods for quantifying the size of an effect given a treatment or intervention.
    • Interval Estimation: Methods for quantifying the amount of uncertainty in a value.
    • Meta-Analysis: Methods for quantifying the findings across multiple similar studies.

    The most useful methods in applied Machine Learning are Interval Estimation methods.

    Types of Intervals:

    • Tolerance Interval: The bounds or coverage of a proportion of a distribution with a specific level of confidence.
    • Confidence Interval: The bounds on the estimate of a population parameter.
    • Prediction Interval: The bounds on a single observation.

    Confidence Intervals in ML:
    A simple way to calculate a confidence interval for a classification algorithm is to calculate the binomial proportion confidence interval, which can provide an interval around a model’s estimated accuracy or error.

Hypotheses are about population parameters. The Null Hypothesis includes Equality; others include Inequalities.

Significance Level (\(\alpha\)): is the probability that we reject the Null Hypothesis when in-reality it is correct.

People will buy more chocolate if we give away a free gift with the chocolate.

One-Tailed: the hypotheses have an inequality (\(\leq\) or \(\geq\)) and an inequality (\(>\) or \(<\)).
Two-Tailed: the hypotheses have an equality (\(=\)) and a non-equality (\(\neq\)).

A small p-value indicates a significant result. The smaller the p-value the more confident we are that the Null Hypothesis is wrong.

Statistical Significance: We have evidence that the result we see in the sample also exist in the population (as opposed to chance, or sampling errors).
Thus, when you get a p-value less than a certain significance level (\(\alpha\)) and you reject the Null Hypothesis; you have a statistically significant result.
The larger the sample the more likely the results will be statistically significant.
The smaller the sample the more unlikely the results will be statistically significant.

The Null Hypothesis for Regression Analysis reads: The slope coefficient of variable-2 is 0. Variable-2 does not influence variable-1.

Types of Data:


Statistical Tests

Choosing a Statistical Test depends on three factors:

Other Statistical Tests:

Considerations: