<aside> πŸ’‘ **Linear Algebra

Eigenvector** β†’ for some $n\times n$ matrix $A$, $x$ is an eigenvector of $A$ if: $Ax=\lambda x$, where $\lambda$ is a scalar called eigenvalue. Conceptually, if a matrix, $A$, represents a linear transformation, if applied to its eigenvector, $x$, it will only scale it in the same direction of $x$ multiplied by the scaling factor $\lambda$. Eigendecomposition β†’ decomposition of a square matrix into its eigenvectors. Singular Value Decomposition (SVD) β†’ used for non-square matrices β†’ $A=U\Sigma V^T$.

Applications β†’ PCA (eigendecomposition of a covariance matrix)


<aside> πŸ’‘ PAC Learning

Probably Approximately Correct (PAC) learning is a framework in computational learning theory introduced by Leslie Valiant in 1984. It provides a mathematical foundation to understand the feasibility of learning a target function from a class of functions based on given training data. The goal of PAC learning is to formalize and quantify the learning process.

Key Concepts of PAC Learning

  1. Hypothesis Space ($\mathcal{H}$)

    1. The set of all possible hypotheses that the learning algorithm can choose from.
    2. Suppose we have a set of points in a 2D plane, and we aim to learn a linear separator. The hypothesis space $\mathcal{H}$ consists of all possible linear separators.
  2. Target Function ($f$)

    1. The actual function that we aim to learn, which maps the input space to the output space.
    2. The target function $f$ is an unknown linear function that separates the positive and negative examples.
  3. Training Examples ($S$)

    1. A sample of labeled examples drawn from the distribution $D$. Each example is of the form $(x,f(x))$
    2. A set of labeled examples $S=\{(x_1,y_1), (x_2,y_2), \dots, (x_n,y_n)\}$
  4. Error of a Hypothesis ($h$)

    1. The error of a hypothesis $h$ with respect to the target function $f$ and distribution $D$ is defined as:

    $$ \text{error}(h) = P_{x \sim D}[h(x) \neq f(x)] $$

  5. Accuracy ($\epsilon$)

    1. A hypothesis $h$ is said to be approximately correct if its error is less than $\epsilon$. This means the hypothesis correctly classifies at least $1-\epsilon$ fraction of the examples.
  6. Confidence ($\delta$)

    1. The probability that the learning algorithm outputs an approximately correct hypothesis is at least $1-\delta$.

Definition of PAC Learning

A concept class $\mathcal{c}$ is PAC learnable if there exists a learning algorithm $\mathcal{A}$ such that for any target concept $f \in \mathcal{C}$, for any distribution $D$ over the input space, and for any $\epsilon$ and $\delta$ (where $0 < \epsilon, \delta < 1$), the algorithm $\mathcal{A}$ outputs a hypothesis $h \in \mathcal{H}$ such that:

$$ P[\text{error}(h) \leq \epsilon] \geq 1 - \delta $$


<aside> πŸ’‘ Unbiased Estimator

Because $f_X$ is usually unknown, but we have sample $S_X=\{x_i\}_{i=1}^N$, we often content ourselves not with the true values of statistics of the probability distribution, such as expectation but with their unbiased estimators.

We say that $\hat{\theta}(S_X)$ is an unbiased estimator of some statistic $\theta$ calculated using a sample $S_X$ drawn from an unknown probability distribution if $\hat{\theta}(S_X)$ has the following property:

$$ \mathbb{E}[\hat{\theta}(S_X)]=\theta $$

where $\hat{\theta}$ is a sample statistic., obtained using a sample $S_X$ and not the real statistic $\theta$ that can be obtained only knowing $f_X$; the expectation is taken over all possible samples drawn from $X$. Intuitively, this means that if you have have an unlimited number of such samples as $S_X$, and you compute some unbiased estimator, such as $\hat{\mu}$, using each sample, then the average of all these $\hat{\mu}$ equals the real statistic $\mu$ that you would get computed on $X$.

It can be shown that an unbiased estimator of an unknown $\mathbb{E}[X]$ is given by $\frac{1}{N}\sum_{i=1}^N x_i$ β†’ called in statistics the sample mean.

</aside>

<aside> πŸ’‘ Numeric Data Transformation

Normalizing β†’ The goal of normalization is to transform features to be on a similar scale β†’ This improves the performance and training stability of the model β†’ especially important when numeric features covering distinctly different ranges (for example, age and income)

Why normalize numeric features? β†’ without it, gradient descent can "bounce" and slow down convergence

Four common normalization techniques: (1) Scaling to a range (2) Clipping (3) Log scaling (4) Z-score

Scaling to a range Min-max scaling β†’ $x^{\prime}=\frac{x-x_{min}}{x_{max}-x_{min}}$
It’s a good choice when: (1) you know the approximate upper and lower bound on your data (2) data is approximately uniformly distributed across the range Good Example: age Bad Example: income β†’ because only a few people have very high incomes. The upper bound of the linear scale for income would be very high, and most people would be squeezed into a small part of the scale

Feature clipping β†’ If your data set contains extreme outliers β†’ You may apply feature clipping before or after other normalizations

Log scaling β†’ helpful when a handful of your values have many points, while most other values have few points β†’ This data distribution is known as the power law distribution Good Example: movie rating

Untitled

Z-score β†’ a variation of scaling that represents the number of standard deviations away from the mean β†’ You would use z-score to ensure your feature distributions have mean = 0 and std = 1 β†’ It’s useful when there are a few outliers, but not so extreme that you need clipping. $x^{\prime}=\frac{x-\mu}{\sigma}$

Bucketing (binning) β†’ transforming numeric (usually continuous) data to categorical data Quantile bucketing β†’ The problem is that equally spaced buckets don’t capture this distribution well β†’ Solution: creating buckets that each have the same number of points, that’s called quantile bucketing.

On the difference between normalization and standardization:

Normalization (Min-Max Scaling) β†’ This technique rescales your data to a range of [0,1] (or any other specific range).

Standardization (Z-score Scaling) β†’ This centers your data around 0 with a standard deviation of 1.


<aside> πŸ’‘ Model Interpretability & Explainability

SHAP (SHapley Additive exPlanation) β†’ LIME (Local Interpretable Model-agnostic Explanation) β†’

</aside>


<aside> πŸ’‘ Hyperparameter Tuning

Grid search β†’ sequentially trying all the parameter combinations β†’ yields the best result but takes long time to run

Random search β†’ we define a distribution for each parameter and randomly sample from the joint distribution over all parameters β†’ much faster but not necessarily guaranteed to achieve an optimal result

Bayesian hyperparameter tuning β†’ used more in the context of neural networks, random forests, XGBoost β†’ it works by building a probabilistic model of the objective function and using it to select the most promising hyperparameters to evaluate next, balancing exploration of new regions of the parameter space with exploitation of known good regions, thereby efficiently searching for optimal hyperparameters.

</aside>


<aside> πŸ’‘ Classification

Two types of classification models: (1) Generative models β†’ deal with the joint distribution of $X$ and $y$ β†’ $P(X,y)=P(y|X)P(X)$ β†’ Maximizing a posterior probability distribution produces decision boundaries between classes where the resulting posterior probability is equivalent. (2) Discriminative models β†’ directly determines a decision boundary by choosing the class that maximizes the probability β†’ $\hat{y}=\arg\max_{k}P(y=k|X)$

Both methods choose a predicted class that maximizes the posterior probability distribution; the difference is simply the approach.

</aside>


<aside> πŸ’‘ Evaluating Classifiers

Accuracy paradox β†’ when accuracy is not a good measure of performance and is not enough; e.g., (imbalanced dataset) when predicting all 0’s for detecting cancer, you’ll get 99.9% accuracy but that’s not very helpful. In imbalanced data we often care about model performance on the minority class

Confusion Matrix

Untitled

NOTE: In production, assign (business-related) cost to each cell so you know the cost each TP or FP, etc.

There’s a natural tradeoff between optimizing for precision and recall.

False Positive (FP) β†’ Type I Error False Negative (FN) β†’ Type II Error

Sensitivity (Recall, True Positive Rate): Focuses on correctly identifying positive cases. It is critical in applications where missing a positive case is costly or dangerous. Specificity (True Negative Rate): Focuses on correctly identifying negative cases. It is crucial in applications where falsely identifying a negative case as positive should be minimized.

Balancing sensitivity and specificity is often necessary, depending on the context and consequences of false positives and false negatives.

ROC Curve and AUC ROC (Receiver Operator Characteristic) β†’ plots the true positive rate (sensitivity) versus the false positive rate (1 - specificity) for various thresholds.

$45^\degree$line in ROC β†’ shows that for every positive example that we correctly classify, we also incorrectly classify a negative example β†’ the goal is to be always above this line β†’ To obtain a good balance specificity and sensitivity, we ought to pick a threshold that maximizes the distance away from the line.

AUC (Area Under the Curve) β†’ measures how well the classifier separates classes. It’s between 0 and 1, where higher number means better separation. It’s useful for comparing different models.

AIC AIC stands for Akaike information criterion. AIC is an estimate of the out-of-sample prediction error, penalizing by degrees of freedom. AIC will determine the relative information value of the model. $K$ is the number of independent variables and $L$ is the log likelihood of seeing that data.

Backtesting

Used primarily in finance, and particularly for time series forecasting models. The goal is to evaluate the performance of a predictive model by testing it on historical data.

For more in-depth on confusion matrix, see here.

</aside>


<aside> πŸ’‘ SVM

The goal of SVM β†’ form a hyperplane that linearly separates the training data β†’ it aims to maximize the margin β†’ i.e., the minimum distance from the decision boundary to any training point.

Support vectors β†’ the points closest to the hyperplane

Hyperplane is defined by β†’ $\mathbf{w}^T\mathbf{x} - \mathbf{b} = 0$

Support vectors lines are defined with these equations: $\mathbf{w}^T\mathbf{x} - \mathbf{b} = 1$ $\mathbf{w}^T\mathbf{x} - \mathbf{b} = -1$

The margin is β†’ $\frac{2}{||\mathbf{w}||}$ β†’ Maximizing the distance between the decision boundary and support vectors means minimizing the denominator of the margin

Hard-margin SVM

$\min||w||$

subject to

$\mathbf{w}^T\mathbf{x} - \mathbf{b} \ge 1$ when $y_i = 1$ $\mathbf{w}^T\mathbf{x} - \mathbf{b} \le -1$ when $y_i = -1$

These two conditions can be summarized into: $y_i(\mathbf{w}^T\mathbf{x} - \mathbf{b}) \ge 1$

This is constrained optimization problem β†’ using linear programming could be unstable in some cases β†’ we instead minimize $||w||^2$ and use quadratic programming that guarantees a unique solution

Soft-margin SVM β†’ it allows some datapoints (outliers) inside the margin. It does that using slacks:

$\min||w||^2+C\sum_N e_i$

subject to

$y_i(\mathbf{w}^T\mathbf{x} - \mathbf{b}) \ge 1 - e_i$

where $e_i$ β†’ is the distance between the new support vector and where the old support vector:

$C$ β†’ regularization parameter

Kernels β†’ SVM uses kernels to handle nonlinear decision boundaries β†’ it transforms data into a higher dimensional space. Mathematically, the kernel generalizes the dot product to a higher dimension

$k(x,y)=\phi(x)^T\phi(y)$

Most common kernels are: Radial Basis Function (RBF) and Gaussian

SVM works well in:

SVM doesn’t work well:

Used over LR or Naive Bayes when decision boundary is nonlinear or there’s a small amount of data

Kernel Trick β†’ The kernel trick is a powerful technique in Support Vector Machines (SVM) that allows them to efficiently handle non-linear decision boundaries by transforming the input data into a higher-dimensional space. This transformation enables the SVM to find a linear separator in this higher-dimensional space, which corresponds to a non-linear separator in the original space. What problem it solves? The kernel trick addresses the problem of non-linearly separable data. In the original feature space, the data might not be linearly separable, meaning no straight line (or hyperplane in higher dimensions) can perfectly separate the different classes. By using the kernel trick, SVMs can implicitly transform the data into a higher-dimensional space where a linear separator can be found. Instead of explicitly computing the coordinates of the data in a higher-dimensional space, the kernel trick allows us to compute the inner product of the data points directly in the higher-dimensional space using a kernel function. Given a feature mapping function $\phi(x)$, the kernel function $K(x_i,x_j)$ computes the dot product in the transformed space:

$K(x_i,x_j)=\langle \phi(x_i),\phi(x_j) \rangle$

A kernel function is a function that corresponds to an inner product in some higher-dimensional feature space. Common kernel functions include:

NOTE: Kernel Trick allows calculating the dot product without performing feature transformation. The only caveat is that we can’t do kernel trick in its primal mode β†’ it needs to be done on the dual form of SVM.

</aside>


<aside> πŸ’‘ Dimensionality Reduction

Curse of dimensionality β†’ Imagine you have a data with 1M rows and 2M features β†’ intuitively, you could guess that it’d be hard to tease out which features are important for a prediction task β†’ in geometric terms, this demonstrate a sparse data spread over many dimensions β†’ meaning each data point is relatively far away from other data points β†’ causes problem when trying to extract a pattern using ML as the idea of similarity (or closeness) of data often matters a great deal β†’ if a particular data point has nothing close to it, how can an algorithm make sense of it?

Dimensionality reduction reduces the complexity of the problem with minimal loss of important information

</aside>


<aside> πŸ’‘ Principal Component Analysis

PCA β†’ combines highly correlated variables into a new, smaller set of constructs called principal components which captures most of the variance in the data

$y_i=w_i^Tx=\sum_{j=1}^p w_{ij}x_j$

subject to

$y_i$ is uncorrelated with $y_j$, and $var(y_i)$ is maximized

Basic idea β†’ PCA finds a linear and orthogonal projection of $\mathbf{x} \rightarrow \mathbf{z}$ such that $\mathbf{z}$ is a good approximation of $\mathbf{x}$ β†’ orthogonality of eigenvectors is a result of applying eigendecomposition to a symmetric matrix, i.e., $\mathbf{\Sigma}$.

PCA Algorithm β†’ it proceeds to first finding the component having maximal variance β†’ the second component found is uncorrelated with the the first and has the second-highest variance β†’ and so on … It does by performing an eigendecomposition of the covariance matrix of $X$ where the first principal component is the eigenvector with the highest eigenvalue NOTE: It’s better to use correlation matrix β†’ meaning that using the standardized data

PCA Objective Function β†’ in PCA, we minimize the reconstruction error:

Untitled

PCA caveats β†’ (1) variables need to have a linear relationship, (2) PCA can’t handle outliers, (3) PCA is sensitive to the units of measurement for input features β†’ data needs to be standardized


<aside> πŸ’‘ Recommender Systems

</aside>


<aside> πŸ’‘ Association Rule Mining

</aside>