๐️ PART 1: DATA TERMS
Dataset A dataset is simply a collection of data organized in rows and columns — like an Excel sheet. Every ML project starts with a dataset. Without data, there is nothing to learn from.
Sample / Instance / Observation / Data Point These four words all mean the same thing — one single row in your dataset. If your dataset has 1000 rows, it has 1000 samples.
Feature A feature is one input column in your dataset. It is the information you give to the model so it can make predictions. For example, in a student dataset, "hours studied," "sleep hours," and "attendance" are all features. Features are also called attributes, predictors, or independent variables.
Label / Target The label is the output column — the thing you want the model to predict. In the student dataset, "exam score" is the label. Also called the dependent variable or response variable.
Feature Vector One single row of feature values for one sample. For example, for one student: hours=5, sleep=7, attendance=90% — that entire row together is called a feature vector.
Dimensionality Simply means how many features your dataset has. If you have 10 columns (excluding the label), your data has 10 dimensions. More dimensions = more complex data.
Structured Data Data that fits neatly in rows and columns — like CSV files, spreadsheets, or databases. Most beginner ML projects use structured data.
Unstructured Data Data that does NOT fit in a table — like images, audio files, videos, and raw text. Deep learning is usually needed for this type.
Raw Data Data exactly as it was collected — messy, unprocessed, possibly with errors, missing values, and inconsistencies.
Population vs Sample Population is the complete set of all possible data. Sample is the smaller portion you actually collect and work with. In ML, your dataset is a sample of the real-world population.
Imbalanced Dataset When one class (category) has far more examples than another. For example, 950 "not fraud" and only 50 "fraud" cases. This is a big problem because the model gets lazy and always predicts the majority class.
Synthetic Data Artificially generated data that mimics real data. Used when real data is too small or too sensitive to use.
๐งน PART 2: DATA PREPROCESSING TERMS
Data Preprocessing The entire process of cleaning, transforming, and organizing raw data before feeding it to any model. This is the most time-consuming step in real ML work — usually takes 70-80% of total project time.
Data Cleaning Fixing or removing errors, duplicates, wrong entries, and inconsistencies in your dataset.
Missing Values / Null Values When some cells in your dataset are empty — data was not collected or got lost. You must handle this before training.
Imputation Filling in missing values with something meaningful instead of just deleting the row. Common strategies are filling with the mean (average), median (middle value), or mode (most common value) of that column.
Outlier A data point that is extremely different from all others. For example, if everyone's salary is between 30,000 and 80,000 but one entry shows 5,000,000 — that is an outlier. Outliers can badly mislead the model.
Noise Random, meaningless variation in data caused by measurement errors or random chance. It does not represent any real pattern and can confuse the model.
Feature Scaling The process of bringing all feature values to a similar range so that no single feature dominates just because it has bigger numbers. For example, age (20–60) and salary (30,000–100,000) are on very different scales — scaling fixes this.
Standardization (Z-score Scaling) A type of feature scaling that transforms values so the column has a mean of 0 and a standard deviation of 1. Works well when data has outliers.
Normalization (Min-Max Scaling) A type of feature scaling that squishes all values to be between 0 and 1. Works well when you need a strict range.
Encoding Machine learning models only understand numbers. Encoding is the process of converting text categories into numbers.
Label Encoding Converts categories into simple numbers like 0, 1, 2. For example: Red=0, Blue=1, Green=2. Problem: the model might think Green (2) is "greater than" Red (0), which is wrong.
One-Hot Encoding Creates a separate 0/1 column for each category. Avoids the false ordering problem of label encoding. Best for categories with no natural order like color, city, or gender.
Feature Engineering Manually creating new features or transforming existing ones to give the model more useful information. For example, from "date of birth" you create "age." From "total salary" and "working hours" you create "hourly rate." Good feature engineering can dramatically improve results.
Feature Selection Choosing only the most useful features and removing the irrelevant or redundant ones. Too many useless features can confuse the model and slow down training.
Dimensionality Reduction A technique to reduce the number of features while preserving as much important information as possible. Useful when you have hundreds or thousands of features.
EDA (Exploratory Data Analysis) Deeply exploring and understanding your data before building any model. You look at statistics, distributions, correlations, and patterns using graphs and numbers. EDA helps you understand what you are working with.
Correlation A number that tells you how strongly two features are related to each other. Ranges from -1 to +1. Close to +1 means they increase together. Close to -1 means one goes up when the other goes down. Close to 0 means no relationship.
Data Distribution How values in a column are spread out. Is data mostly in the center? Skewed to one side? Understanding distribution helps you choose the right preprocessing and algorithm.
Skewness When data is not symmetrical — it leans heavily to one side. For example, most people earn average salaries but very few earn extremely high — that creates a skewed distribution.
Variance (in data) How spread out the values in a column are. Low variance means values are close together. High variance means they are very spread out.
✂️ PART 3: DATA SPLITTING TERMS
Training Set The portion of data the model actually learns from. Typically 70–80% of total data. Model sees this data during training.
Test Set Data the model has never seen during training. Used at the very end to evaluate real performance. Typically 20–30% of total data.
Validation Set A third portion kept separate from both training and test sets. Used during training to check how well the model is learning and to tune it. Prevents overfitting by giving early feedback.
Train-Test Split The act of dividing your full dataset into training and test portions before any model training begins.
Data Leakage A very serious and common mistake where information from the test set accidentally gets mixed into the training process. This gives you fake great results that completely fail in real life. Always split your data first before doing any preprocessing.
Cross-Validation Instead of one fixed train-test split, you divide data into K equal parts and repeat training K times — each time using a different part as the test set. Gives a much more reliable measure of performance.
K-Fold Cross-Validation The most common form of cross-validation. "K" is the number of folds (splits). K=5 means you do 5 rounds of training and testing. Final score is the average of all 5 rounds.
Stratified Split A smarter way to split data that ensures each split has the same proportion of each class. Very important for imbalanced datasets.
Hold-Out Method Simply the standard train-test split — you "hold out" a portion of data for testing. The simplest evaluation strategy.
random_state / Random Seed A number you set to make random operations give the same result every time. Without this, your train-test split would be different every time you run the code, making results inconsistent.
๐ค PART 4: MODEL TERMS
Model A model is a mathematical function that learns the relationship between features (X) and labels (y). After training, you give it new inputs and it gives you predictions.
Algorithm The step-by-step procedure used to build and train a model. Different algorithms learn in different ways. The algorithm is the method — the model is the result of applying that method.
Parameters The internal values a model learns automatically during training. You never set these manually — the algorithm figures them out. For example, in linear regression, the slope and intercept are parameters.
Hyperparameters Settings that YOU control before training begins. The algorithm does not learn these — you choose them. For example, how many trees in a random forest, or how deep a decision tree should be.
Hyperparameter Tuning The process of searching for the best hyperparameter values to get the best model performance. Done by trying many combinations and seeing which works best.
Training / Fitting The process where the model looks at your training data and adjusts its internal parameters to learn the patterns. "Training a model" and "fitting a model" mean the same thing.
Inference / Prediction Using a trained model to make predictions on new, unseen data. This is what happens in real-world deployment.
Estimator In scikit-learn, any object that can learn from data is called an estimator. Every model is an estimator.
Transformer An object that takes data and converts it into a different form — like a scaler or encoder. It transforms data but does not make predictions.
Pipeline A way to chain multiple steps together — like preprocessing followed by a model — into one single unit. Keeps your workflow clean and prevents mistakes like data leakage.
Baseline Model The simplest possible model you build first, before trying anything complex. It sets a minimum performance level. If your advanced model can't beat the baseline, something is wrong.
Benchmark A reference point to compare model performance against. Could be a baseline model, human performance, or a previously published result.
๐ PART 5: TYPES OF MACHINE LEARNING
Supervised Learning The model learns from labeled data — both inputs (X) and correct outputs (y) are provided. The model learns to map inputs to outputs. Most common type of ML.
Unsupervised Learning The model only receives inputs (X) — no labels. It finds hidden patterns, structures, or groupings on its own without being told what to look for.
Semi-Supervised Learning A mix of both — a small amount of labeled data combined with a large amount of unlabeled data. Useful when labeling data is expensive or time-consuming.
Reinforcement Learning The model (called an agent) learns by interacting with an environment. It takes actions, receives rewards for good actions and penalties for bad ones, and gradually learns the best strategy.
Self-Supervised Learning The model generates its own labels from the data itself. Used heavily in modern NLP and image models. No human labeling needed.
Transfer Learning Taking a model already trained on one task and reusing it for a different but related task. Saves massive amounts of time and data. Very popular in deep learning.
๐ฏ PART 6: TASK TYPES
Classification Predicting which category something belongs to. Output is a class label, not a number.
Binary Classification Only two possible outputs — yes/no, spam/not spam, fraud/not fraud.
Multiclass Classification More than two possible outputs — cat/dog/bird, or classifying handwritten digits 0–9.
Multilabel Classification One sample can belong to multiple classes at the same time. For example, a news article can be both "politics" and "economy."
Regression Predicting a continuous numerical value — like price, temperature, or score. Output is any number, not a fixed category.
Clustering Grouping similar data points together without any labels. The model decides the groups on its own.
Anomaly Detection Finding unusual data points that don't follow normal patterns. Used in fraud detection, network security, and quality control.
Ranking Ordering items by relevance or importance. Used in search engines and recommendation systems.
Recommendation Predicting what a user might like based on past behavior. Used in Netflix, YouTube, and Amazon.
๐ PART 7: MODEL EVALUATION METRICS
For Classification:
Accuracy The percentage of predictions that were correct out of all predictions. Simple but can be very misleading with imbalanced data.
Confusion Matrix A table that breaks down all predictions into four categories to give a detailed picture of model performance:
- True Positive (TP) — Model said YES, actual was YES. Correct.
- True Negative (TN) — Model said NO, actual was NO. Correct.
- False Positive (FP) — Model said YES, actual was NO. Wrong. Also called Type 1 Error.
- False Negative (FN) — Model said NO, actual was YES. Wrong. Also called Type 2 Error.
Precision Out of everything the model predicted as positive, how many were actually positive? Precision matters when false alarms are costly. Example: spam filter — you don't want real emails going to spam.
Recall (Sensitivity) Out of all actual positives, how many did the model correctly find? Recall matters when missing a positive is dangerous. Example: cancer detection — you don't want to miss a real case.
F1 Score The balance between Precision and Recall combined into one number. Use this when both false positives and false negatives matter. Range is 0 to 1 — higher is better.
ROC Curve A graph that shows model performance at different decision thresholds. Shows the tradeoff between catching true positives and avoiding false positives.
AUC (Area Under the Curve) A single number summarizing the ROC curve. 1.0 = perfect model. 0.5 = model is just randomly guessing.
Log Loss Measures the quality of probability predictions — not just whether the answer is right or wrong, but how confident and correct the model is. Lower is better.
For Regression:
MAE (Mean Absolute Error) Average of how far off predictions are from actual values. Easy to understand since it is in the same unit as the target.
MSE (Mean Squared Error) Average of squared differences between predictions and actual values. Penalizes big errors much more than small ones.
RMSE (Root Mean Squared Error) Square root of MSE. Brings the error back to the same unit as the target. Most commonly used regression metric.
R² Score (R-Squared) Tells you what percentage of the variation in the target is explained by the model. 1.0 = perfect. 0.0 = model learned nothing useful. Can even be negative if model is worse than just predicting the average.
⚠️ PART 8: MODEL PROBLEMS & CONCEPTS
Overfitting The model learned the training data too well — including its noise and random quirks. It performs great on training data but poorly on new data. The model memorized instead of learning.
Underfitting The model is too simple and failed to even learn the training data properly. It performs poorly on both training and new data.
Bias Error caused by oversimplified assumptions in the model. High bias = the model is ignoring important patterns = underfitting.
Variance Error caused by the model being too sensitive to small changes in training data. High variance = the model is capturing noise = overfitting.
Bias-Variance Tradeoff You cannot minimize both bias and variance at the same time. Reducing one usually increases the other. The goal is to find the right balance.
Generalization A model's ability to perform well on new, unseen data. A model that generalizes well has truly learned the underlying patterns — not just memorized the training data.
Robustness A model's ability to maintain good performance even when input data is slightly noisy or different from training data.
Regularization A technique that prevents overfitting by penalizing overly complex models. Forces the model to stay simpler.
L1 Regularization (Lasso) A type of regularization that can reduce some feature weights all the way to zero — effectively removing useless features. Also does feature selection automatically.
L2 Regularization (Ridge) A type of regularization that shrinks all weights towards zero but rarely to exactly zero. Handles correlated features well.
Dropout A regularization technique used in neural networks. Randomly "turns off" some neurons during training to prevent the network from relying too heavily on any single neuron.
Class Imbalance When one class in your dataset has far more examples than others. The model becomes biased towards the majority class and ignores the minority class.
Underfitting vs Overfitting Balance Think of it like studying for an exam. Underfitting = didn't study enough. Overfitting = memorized past papers exactly but can't handle new questions. The goal is to understand the concepts deeply — that is good generalization.
⚙️ PART 9: TRAINING PROCESS TERMS
Loss Function / Cost Function A mathematical measure of how wrong the model's predictions are. The model's entire goal during training is to minimize this number. Loss function and cost function mean essentially the same thing.
Gradient Descent The core optimization method that almost every ML model uses. It works by calculating which direction to adjust the model's parameters to reduce the loss — like walking downhill to find the lowest point in a valley.
Learning Rate Controls how big each step is in gradient descent. Too high = model overshoots and never finds the minimum. Too low = training takes forever. One of the most important hyperparameters.
Epoch One complete pass through the entire training dataset. If you train for 100 epochs, the model sees all training data 100 times.
Batch A small chunk of training data processed at one time. Instead of processing all data at once or one sample at a time, you process it in batches.
Batch Size How many samples are in each batch. Common values are 32, 64, 128. Smaller batches = noisier but faster updates. Larger batches = smoother but requires more memory.
Mini-Batch Gradient Descent The most common training approach — processes data in small batches rather than the full dataset or one sample at a time.
Iteration One update of model weights using one batch. If you have 1000 samples and batch size 100, you do 10 iterations per epoch.
Convergence When the loss stops decreasing and the model has finished learning. Training is complete when the model converges.
Weights The learnable parameters inside a model, especially in neural networks. The model adjusts these during training to reduce loss.
Bias (in neural networks) An extra learnable parameter added to each layer of a neural network. Different from the "bias" in bias-variance — this one helps the model fit data more flexibly.
Optimizer The algorithm that performs gradient descent and updates weights. Common optimizers include SGD (Stochastic Gradient Descent), Adam, and RMSProp.
Early Stopping Stopping training before it completes all planned epochs — when the validation performance stops improving. Prevents overfitting.
Momentum A concept in optimization where the model remembers the direction of previous updates and uses that to move faster in the right direction — like a ball rolling downhill gaining speed.
๐ฒ PART 10: ALGORITHM-SPECIFIC TERMS
Linear & Logistic Regression:
Intercept — the value of the output when all inputs are zero (the starting point of the line). Coefficient / Weight — how much each feature influences the prediction. Decision Boundary — the line (or curve) that separates different classes in classification. Sigmoid Function — a mathematical function that converts any number to a value between 0 and 1. Used in logistic regression to output probabilities. Probability Threshold — in classification, the cutoff point (usually 0.5) above which you predict positive class.
Decision Trees:
Node — a decision point where the tree asks a question about a feature. Root Node — the very first question at the top of the tree. Leaf Node — a final answer at the bottom of the tree — no more questions after this. Branch — a path between nodes depending on the answer to a question. Depth — how many levels the tree has. Deeper = more complex. Pruning — cutting off branches of the tree to make it simpler and reduce overfitting. Gini Impurity — a measure used to decide which feature to split on. Lower gini = purer split. Information Gain — how much knowing a feature reduces uncertainty. Higher is better for splitting. Entropy — a measure of disorder or impurity in a group of samples.
Ensemble Methods:
Ensemble — combining multiple models to get better results than any single model alone. Bagging (Bootstrap Aggregating) — training multiple models on different random subsets of data and averaging their results. Reduces variance. Random Forest — an ensemble of many decision trees using bagging. Each tree sees a random subset of features and data. Boosting — training models sequentially where each new model focuses on correcting errors made by the previous one. Reduces bias. XGBoost / LightGBM / CatBoost — powerful, popular boosting algorithms widely used in competitions and real projects. Voting — combining predictions from multiple models by majority vote (classification) or average (regression). Stacking — using predictions of multiple models as inputs to a final model that makes the ultimate prediction.
SVM (Support Vector Machine):
Hyperplane — the decision boundary that separates classes. Support Vectors — the data points closest to the hyperplane. They are the most important points that define the boundary. Margin — the distance between the hyperplane and the nearest support vectors. SVM tries to maximize this. Kernel — a function that transforms data into a higher dimension to make it linearly separable. Common kernels: linear, polynomial, RBF.
KNN (K-Nearest Neighbors):
K — the number of nearest neighbors to look at when making a prediction. Distance Metric — how you measure "closeness" between points. Common: Euclidean distance, Manhattan distance.
Clustering:
Centroid — the center point of a cluster. In K-Means, each cluster has one centroid. K (in clustering) — number of clusters you want the algorithm to find. Inertia — total distance of all points from their cluster centroid. Lower = tighter, better clusters. Elbow Method — a technique to find the best value of K by plotting inertia and looking for where improvement slows down (the "elbow" shape). DBSCAN — a clustering algorithm that finds clusters of any shape and automatically labels outliers as noise.
Neural Networks:
Neuron / Node — the basic unit of a neural network. Receives inputs, applies a function, passes output forward. Layer — a group of neurons. Networks have input layer, hidden layers, and output layer. Input Layer — receives raw feature data. Hidden Layer — processes information between input and output. More hidden layers = deeper network. Output Layer — produces the final prediction. Activation Function — a mathematical function applied to each neuron's output to introduce non-linearity. Common ones: ReLU, Sigmoid, Tanh, Softmax. ReLU (Rectified Linear Unit) — the most common activation function. Outputs the input if positive, otherwise outputs 0. Simple and very effective. Softmax — activation function used in the output layer for multiclass classification. Converts raw numbers into probabilities that sum to 1. Deep Learning — machine learning using neural networks with many hidden layers. Handles complex tasks like images, speech, and language. Backpropagation — the algorithm used to train neural networks. It calculates how much each weight contributed to the error and adjusts them accordingly. Feedforward — data moves in one direction: input → hidden layers → output. No loops. CNN (Convolutional Neural Network) — a type of neural network designed for images. Detects patterns like edges, shapes, and textures. RNN (Recurrent Neural Network) — a type of neural network designed for sequences — text, time series, audio. Has memory of previous inputs. LSTM (Long Short-Term Memory) — an improved RNN that can remember information over long sequences without forgetting.
๐ง PART 11: ADVANCED CONCEPTS
PCA (Principal Component Analysis) A dimensionality reduction technique that combines correlated features into fewer, uncorrelated components while keeping as much information as possible.
Feature Importance A score that tells you how much each feature contributed to the model's predictions. Helps you understand what the model is actually using.
Cross-Entropy Loss The standard loss function for classification problems. Measures the difference between predicted probabilities and actual labels.
Softmax Converts a list of raw numbers into probabilities that all add up to 1. Used in final layer of multiclass classification.
One vs Rest (OvR) A strategy for multiclass classification using binary classifiers. Train one classifier per class: "is this class A or not?" Repeat for every class.
Probability Calibration Making sure a model's predicted probabilities actually reflect real-world probabilities. A model that says 70% confident should be correct about 70% of the time.
Ensemble Diversity For ensembles to work well, the individual models must make different kinds of errors. More diverse models = better ensemble.
Learning Curve A graph showing training and validation performance as training data size increases. Helps diagnose if you need more data or a better model.
Confusion Matrix Normalization Showing the confusion matrix as percentages instead of raw counts — easier to compare across different dataset sizes.
GridSearchCV A method of hyperparameter tuning that exhaustively tries every possible combination of hyperparameter values you specify.
RandomizedSearchCV A faster version of GridSearchCV that tries random combinations instead of all possible ones. Good when you have many hyperparameters.
NLP (Natural Language Processing) A field of ML focused on understanding and generating human language — text classification, translation, summarization, chatbots.
Computer Vision A field of ML focused on understanding images and videos — object detection, face recognition, medical imaging.
Time Series Data collected over time in sequence — stock prices, weather, sensor data. Requires special ML techniques that respect the time order.
Deployment Taking a trained model and making it available to be used in a real application. The model goes "live" and starts making predictions for real users.
Model Serialization / Saving a Model Saving a trained model to a file so you can load and use it later without retraining from scratch.
API (Application Programming Interface) A way to expose your trained model so other applications can send data to it and receive predictions back.
๐ PART 12: COMPLETE QUICK REFERENCE
| Term | What It Means in One Line |
|---|---|
| Dataset | Collection of data in rows and columns |
| Sample | One single row of data |
| Feature | Input column used for prediction |
| Label / Target | Output column to be predicted |
| Feature Vector | All features for one single sample |
| Dimensionality | Number of features in the dataset |
| Imputation | Filling missing values intelligently |
| Outlier | Extreme value far from all others |
| Noise | Random meaningless variation in data |
| Standardization | Scale data to mean=0, std=1 |
| Normalization | Scale data to range 0 to 1 |
| Label Encoding | Convert categories to 0, 1, 2... |
| One-Hot Encoding | Create separate 0/1 column per category |
| Feature Engineering | Create new useful features manually |
| Feature Selection | Remove useless features |
| EDA | Explore data before modeling |
| Correlation | How strongly two features are related |
| Training Set | Data the model learns from |
| Test Set | Unseen data to evaluate model |
| Validation Set | Data to tune model during training |
| Data Leakage | Test info accidentally used in training |
| Cross-Validation | Evaluate by rotating test sets K times |
| random_state | Seed to make random operations repeatable |
| Model | Learned function: inputs → predictions |
| Algorithm | Method used to build the model |
| Parameters | Values model learns automatically |
| Hyperparameters | Settings you control before training |
| Overfitting | Memorized training data, fails on new |
| Underfitting | Too simple, fails everywhere |
| Bias | Error from oversimplification |
| Variance | Error from oversensitivity to training data |
| Generalization | Performance on new unseen data |
| Regularization | Penalty to keep model simple |
| Loss Function | Measures how wrong predictions are |
| Gradient Descent | Walk downhill to minimize loss |
| Learning Rate | Step size in gradient descent |
| Epoch | One full pass through training data |
| Batch Size | Samples processed in one update |
| Convergence | When model stops improving |
| Accuracy | % of correct predictions |
| Precision | Correct positives out of all predicted positives |
| Recall | Correct positives out of all actual positives |
| F1 Score | Balance of Precision and Recall |
| AUC-ROC | Model's ability to separate classes |
| MAE | Average prediction error (regression) |
| RMSE | Root mean squared error (regression) |
| R² Score | How much variance model explains |
| Confusion Matrix | Table of TP, TN, FP, FN |
| Ensemble | Combining multiple models |
| Bagging | Train on random subsets, average results |
| Boosting | Models correct each other sequentially |
| Random Forest | Many decision trees combined (bagging) |
| XGBoost | Powerful boosting algorithm |
| Pruning | Simplify decision tree by cutting branches |
| Kernel (SVM) | Transforms data for non-linear problems |
| Centroid | Center point of a cluster |
| Activation Function | Introduces non-linearity in neural networks |
| Backpropagation | How neural networks learn from errors |
| Dropout | Randomly disable neurons to prevent overfitting |
| Deep Learning | Neural networks with many layers |
| Transfer Learning | Reuse model trained on another task |
| PCA | Reduce features while keeping information |
| NLP | ML for text and language |
| Deployment | Making model available in real applications |