Overview
Simple story: Data โ Insights โ AI Model โ Decisions
Data Science = Understand & predict using data AI = Learn & act automatically ML = Overlap (learning from data)Teaching hook: โHow does Netflix know what youโll like next?โ
Definitions
Data Science = Collect, clean, analyze data โ find patterns โ make predictions โ communicate insights.
AI = Systems that learn from data and make decisions (e.g., language, vision, recommendations).
ML = A subset of AI often used inside Data Science to learn patterns from data.
Relationship
AI needs good data. Data Science prepares the data so AI models can learn reliably.
One-liner: Data Science prepares the fuel; AI is the engine.
How they relate (Venn Diagram)
Stats ยท EDA ยท Viz AI
NLP ยท Vision ยท Robotics ML
Learn from data Deep Learning
Neural nets
๐ต Data Science overlaps with AI through Machine Learning
๐ก ML is a subset of AI โ it learns patterns from data
๐ด Deep Learning is a subset of ML โ uses neural networks for complex tasks
๐ฃ๏ธ How to explain each circle
๐ The Zomato Story โ One example, all four concepts
Set the scene: You open Zomato to order dinner. Behind that simple tap, all four fields are working together. Let's follow your orderโฆ
๐ Chapter 1 โ Data Science: "What do people eat?"
Zomato's data team collects millions of orders โ timestamps, locations, ratings, cuisine type, weather, festivals, and more.
A data scientist cleans this data (removing duplicates, fixing missing pincodes) and runs an analysis:
๐ก "Biryani orders spike 40% on Sundays in Hyderabad, and 60% during rain."
They build dashboards showing trends city-by-city, cuisine-by-cuisine. The operations team uses this to plan restaurant partnerships and delivery fleet allocation.
Tools used: Python, Pandas, SQL, Matplotlib, Power BI
Key takeaway: Data Science answers "What happened?" and "Why?"
๐ Chapter 2 โ Machine Learning: "What will they order next?"
Now Zomato wants to predict, not just report. The ML team takes the cleaned data and trains a model:
๐ก "Users who order biryani on Sunday also order gulab jamun 70% of the time โ show gulab jamun as a combo suggestion."
The model learns patterns without being explicitly programmed โ it figures out rules on its own from thousands of order histories.
๐ท๏ธ Features vs Labels
Before training a model, you split your data into two parts:
Features (Input โ X)
The information the model uses to make a prediction
Zomato example:
- Distance to restaurant
- Time of day
- Day of week
- Weather
- Restaurant prep time
Label (Output โ Y)
The answer the model is trying to predict
Zomato example:
- Delivery time (e.g., 35 minutes)
๐ก In supervised learning the label is known during training. In unsupervised learning there is no label โ the model discovers patterns on its own.
โ๏ธ Training Data vs Testing Data
You never test a model on the same data it was trained on โ that's like giving a student the exact same exam they practiced with. Instead, you split:
Training Set (~70-80%)
The model learns from this data. It sees both features and labels, and adjusts itself to find patterns.
๐ Like studying from a textbook
Testing Set (~20-30%)
The model is evaluated on this data. It only sees features and must predict the label โ we compare its predictions to the real answers.
๐ Like taking the final exam
๐ก Zomato example: Out of 1 lakh past delivery records, 80,000 are used to train the model (it learns that rain + long distance = slower delivery). The remaining 20,000 are used to test โ did the model predict delivery time accurately on orders it never saw before?
Full dataset split
Types at play:
- Supervised ML: Predict delivery time (labeled data: past delivery times)
- Unsupervised ML: Group users into segments (budget eaters, health-conscious, party orderers)
- Reinforcement: Optimize delivery routes โ the system tries different paths and learns which are fastest
Tools used: Scikit-learn, XGBoost, feature engineering on order data
Key takeaway: ML answers "What will happen?" โ it learns patterns and makes predictions.
๐ง Chapter 3 โ Deep Learning: "Understand photos, reviews & speech"
Some problems are too complex for traditional ML. Zomato needs to:
- ๐ธ Analyze food photos users upload โ is it biryani or pulao? Is the presentation good? (Computer Vision using CNNs)
- ๐ฌ Understand reviews โ "The butter chicken was to die for but the naan was stale" โ extract sentiment per dish, not just per restaurant (NLP using Transformers)
- ๐๏ธ Voice ordering โ "Order my usual from Paradise Biryani" โ understand spoken Hindi/English and map it to an order (Speech Recognition using RNNs)
๐ก "A neural network with millions of parameters reads 10 lakh reviews and learns that 'fire' means great when talking about food, but bad when talking about delivery."
Tools used: PyTorch, TensorFlow, Hugging Face Transformers, CNNs, BERT
Key takeaway: Deep Learning handles unstructured data (images, text, audio) that traditional ML can't.
๐ค Chapter 4 โ AI: "The whole system, acting smart"
Now put it all together. When you open Zomato at 8 PM on a rainy Sunday in Hyderabad:
- ๐ Data Science already found that biryani + rain + Sunday = peak demand
- ๐ ML predicts you'll order biryani (based on your past orders) and suggests a gulab jamun combo
- ๐ง Deep Learning shows you restaurants with the best food photos and filters out places with negative review sentiment
- ๐ค AI orchestrates everything: personalizes your home screen, estimates delivery in 35 min, assigns the nearest rider, adjusts pricing based on demand, and sends you a push notification: "Craving biryani? Paradise has 20% off tonight!"
๐ก AI is the umbrella โ it's the system that makes intelligent decisions by combining Data Science insights, ML predictions, and Deep Learning understanding.
Key takeaway: AI is not just one technique โ it's the intelligent system that ties DS + ML + DL together to act automatically.
๐ The Zomato comparison table
๐ฏ Classroom recap โ one sentence each
๐ Data Science = Zomato's analyst finds that biryani sells most on rainy Sundays.
๐ ML = The app learns your taste and predicts you'll want gulab jamun with it.
๐ง Deep Learning = It reads your review "butter chicken was fire ๐ฅ" and knows that's a 5-star compliment.
๐ค AI = It puts it all together โ personalizes your screen, estimates delivery, and nudges you with a discount at the perfect moment.
๐ Side-by-side comparison
๐ฏ Quick classroom activity
Ask learners to pick any other app (Swiggy, Amazon, Instagram, Spotify) and map the same four chapters:
1๏ธโฃ What data is collected? โ Data Science
2๏ธโฃ What prediction is made? โ ML
3๏ธโฃ What unstructured data is understood? โ Deep Learning
4๏ธโฃ What smart decision is automated? โ AI
Data Science contributes
- Data pipelines & quality
- EDA and insights
- Feature engineering
- Evaluation and interpretation
AI contributes
- Learning patterns from data
- Automation and decisions
- NLP / Vision capabilities
- Continuous improvement via feedback
๐ง Types of Machine Learning
Machine Learning is divided into three main types based on how the model learns from data.
Think of it this way:
๐ Supervised = Teacher gives you questions and answers โ you learn the pattern
๐ Unsupervised = No answers given โ you find hidden groups on your own
๐ฎ Reinforcement = You learn by trial and error โ rewards for good moves, penalties for bad ones
๐ Supervised Learning โ "Learning with a teacher"
What is it?
The model is trained on labeled data โ every input comes with the correct output (the "answer"). The model learns the mapping from input โ output, then predicts answers for new, unseen data.
๐ก Like a student studying past exam papers with answer keys โ they learn the pattern and can answer new questions.
Two sub-types
Classification
Predict a category
- Is this email spam or not?
- Is this tumor malignant or benign?
- Will the customer churn? (Yes/No)
Regression
Predict a number
- What will the house price be?
- How many units will sell next month?
- What's the expected delivery time?
Real-world examples
Common algorithms
Linear Regression Logistic Regression Decision Trees Random Forest SVM KNN XGBoost Neural Networks๐ Unsupervised Learning โ "Learning without answers"
What is it?
The model is given data without labels โ no correct answers. It must find hidden patterns, groupings, or structure on its own.
๐ก Like dumping a pile of mixed Lego bricks on a table and asking someone to group them โ nobody told them the categories, they figure it out by shape, color, and size.
Key techniques
Clustering
Group similar items together
- Customer segmentation (budget vs premium shoppers)
- Grouping news articles by topic
- Identifying patient groups with similar symptoms
Dimensionality Reduction
Simplify data while keeping meaning
- Compress 100 features down to 5 key ones
- Visualize high-dimensional data in 2D
- Speed up other ML models
Association Rules
Find items that occur together
- "People who buy bread also buy butter" (Market Basket)
- Recommend products on Amazon
Anomaly Detection
Spot unusual data points
- Credit card fraud detection
- Network intrusion detection
Real-world examples
Common algorithms
K-Means DBSCAN Hierarchical Clustering PCA t-SNE Apriori Isolation Forest๐ Easy examples to explain in class
๐ Example 1 โ Sorting a fruit basket (Clustering)
Imagine you dump 100 mixed fruits on a table โ apples, bananas, oranges, grapes. Nobody tells you the names.
You'd naturally group them by color, shape, and size:
- ๐ด Round + red โ one pile
- ๐ก Long + yellow โ another pile
- ๐ Round + orange โ another pile
- ๐ฃ Tiny + purple โ another pile
๐ก That's K-Means Clustering โ the algorithm does exactly this with data points instead of fruits. You tell it "make 4 groups" and it figures out the best grouping.
๐ Example 2 โ Organizing your wardrobe (Clustering)
You have 200 clothes thrown in a pile. No labels. You start grouping:
- Formal shirts โ one section
- Casual t-shirts โ another
- Jeans โ another
- Winter jackets โ another
Nobody told you these categories โ you discovered them by looking at fabric, style, and season.
๐ก This is exactly what unsupervised learning does with customer data, documents, or images โ it finds natural groups.
๐ Example 3 โ Supermarket shopping patterns (Association Rules)
A supermarket notices from millions of receipts:
- ๐ People who buy bread usually also buy butter (82% of the time)
- ๐ผ People who buy diapers on Friday nights also buy beer (famous real case!)
- โ People who buy coffee also buy sugar + milk
๐ก Nobody programmed these rules โ the Apriori algorithm discovered them from transaction data. That's why supermarkets place bread near butter!
๐ฆ Example 4 โ Catching a thief (Anomaly Detection)
Your bank knows your normal pattern:
- ๐ You usually shop in Hyderabad
- ๐ฐ You spend โน500โโน5,000 per transaction
- ๐ You shop between 10 AMโ10 PM
Suddenly: โน95,000 spent at 3 AM in Romania ๐จ
๐ก The model was never told what "fraud" looks like โ it just knows this transaction is very different from your normal behavior. That's Anomaly Detection using Isolation Forest.
๐ฑ Example 5 โ Instagram "Explore" page (Clustering + Association)
Instagram doesn't ask you "Do you like travel photos?" โ it watches your behavior:
- You liked 50 sunset photos, 30 food reels, 20 cricket highlights
- It groups you with users who have similar likes โ Clustering
- It notices that people in your group also like mountain trek videos โ Association
- Result: Your Explore page fills with sunsets, food, cricket, and treks ๐ฏ
๐ก No labels were needed โ Instagram discovered your interests purely from behavior patterns.
๐ Example 6 โ Classroom analogy (for students)
Imagine a teacher has 40 students and wants to form study groups, but doesn't know anyone yet:
- She collects data: marks in Math, Science, English, attendance, participation
- She runs clustering and finds 4 natural groups:
- ๐ข All-rounders โ high marks everywhere
- ๐ต Science stars โ great at science, average elsewhere
- ๐ก Creative writers โ excel in English and projects
- ๐ด Need support โ low marks, low attendance
- She didn't pre-define these groups โ the algorithm found them
๐ก Ask students: "If I gave you the class data in a spreadsheet, could you find these groups without any labels? That's what unsupervised learning does โ automatically."
๐ค When to use Supervised vs Unsupervised?
๐ Decision Table โ Ask yourself these questions
The choice depends on one simple question: Do you have labeled data (answers)?
๐บ๏ธ Decision Flowchart
Step 1: Do you have labeled data?
โ Yes โ Do you want to predict a category or a number?
โ Category โ Classification (e.g., spam filter, disease detection)
โ Number โ Regression (e.g., house price, delivery time)
โ No โ Do you want to group things or find oddities?
โ Group similar items โ Clustering (e.g., customer segments)
โ Find items that go together โ Association (e.g., bread + butter)
โ Spot unusual behavior โ Anomaly Detection (e.g., fraud)
โ Agent learns by doing? โ Reinforcement Learning (e.g., game AI, self-driving)
๐ Real Scenarios โ Which would you pick?
๐ฏ Classroom tip: Give students a list of 5 problems and ask them to decide: Supervised or Unsupervised? It's a great discussion starter!
๐ฎ Reinforcement Learning โ "Learning by trial and error"
What is it?
An agent interacts with an environment, takes actions, and receives rewards (positive) or penalties (negative). Over time, it learns a strategy (policy) that maximizes total reward.
๐ก Like training a dog โ it tries something, gets a treat (reward) or a "no" (penalty), and gradually learns what to do.
Key concepts
Components
- Agent โ the learner (e.g., a robot, game player)
- Environment โ the world it interacts with
- State โ current situation
- Action โ what the agent does
- Reward โ feedback signal (+/-)
Exploration vs Exploitation
The agent must balance:
- Explore โ try new actions to discover better rewards
- Exploit โ use what it already knows works well
Like choosing between your favorite restaurant (exploit) vs trying a new one (explore).
Real-world examples
Common algorithms
Q-Learning Deep Q-Network (DQN) Policy Gradient PPO Actor-Critic SARSA๐ Side-by-side comparison
๐ฏ Quick classroom recap
๐ Supervised = "I'll teach you with examples and answers" โ Predict spam, prices, diseases
๐ Unsupervised = "Here's raw data, find the patterns" โ Group customers, detect fraud
๐ฎ Reinforcement = "Try things, I'll reward good moves" โ Play games, drive cars, optimize routes
๐ Linear Regression โ In Depth
The simplest and most important supervised learning algorithm. If you understand this, you understand the foundation of all ML.
๐ฏ One-liner: Linear Regression finds the best straight line through your data points to predict a number.
๐ What is Linear Regression?
Given some input (X), predict a continuous number (Y) by fitting a straight line through the data.
The Formula
Y = mX + b
- Y = predicted value (what we want)
- X = input feature (what we know)
- m = slope (how much Y changes when X changes)
- b = intercept (value of Y when X = 0)
In Plain English
๐ House price example:
"For every extra 100 sq ft of area, the price goes up by โน5 lakh."
- X = area in sq ft (input)
- Y = price in โน (prediction)
- m = โน5 lakh per 100 sq ft (slope)
- b = base price (intercept)
๐ง Intuition โ The "Best Fit Line"
Imagine plotting student study hours (X) vs exam marks (Y) on a graph:
90 โ โข
80 โ โข โข
70 โ โข /
60 โ โข /
50 โ โข /
40 โ โข/
30 โ /โข
โโโโโโโโโโโโโโโโโโโโโโโโ
1 2 3 4 5 6 7
Study Hours
โข = actual data points / = best fit line
The best fit line is the one that is closest to all the points on average. Some points are above the line, some below โ but the line minimizes the total error.
๐ก Now if a new student says "I studied 5 hours", you follow the line up and predict ~72 marks. Thatโs linear regression!
โ๏ธ How does it learn? (Training process)
The model doesn't guess the line randomly. It uses a process called Gradient Descent:
- Start with a random line (random m and b)
- Predict Y for every X in training data
- Measure error โ how far are predictions from actual values?
- Adjust m and b slightly to reduce the error
- Repeat steps 2โ4 hundreds/thousands of times until the line fits best
Error measurement (Cost Function)
Mean Squared Error (MSE)
MSE = (1/n) ร ฮฃ (actual - predicted)ยฒ
Square the errors so negatives donโt cancel positives. Smaller MSE = better line.
๐ก Think of it like tuning a guitar โ you pluck a string, hear it's off, tighten slightly, pluck again, repeat until it sounds right. That's gradient descent adjusting m and b.
๐ข Simple vs Multiple Linear Regression
Simple Linear Regression
1 feature โ 1 prediction
Y = mX + b
Example: Predict house price from area alone
- X = area (sq ft)
- Y = price
Multiple Linear Regression
Many features โ 1 prediction
Y = mโXโ + mโXโ + mโXโ + b
Example: Predict house price from area + bedrooms + location + age
- Xโ = area, Xโ = bedrooms, Xโ = location score
- Y = price
๐ก In reality, predictions depend on many factors. Multiple regression is far more common in real projects.
๐ Real-World Examples
โ ๏ธ Assumptions (When does it work well?)
Linear regression works best when these conditions are roughly true:
1. Linearity
The relationship between X and Y is approximately a straight line, not a curve.
โ Study hours vs marks (linear) โ Age vs energy (curved)
2. No Outliers
Extreme values can pull the line away from the true pattern.
โ One house sold for โน100 crore in a โน50 lakh neighborhood
3. Independence
Each data point should be independent โ one observation shouldnโt influence another.
4. No Multicollinearity
Input features shouldnโt be too correlated with each other (for multiple regression).
โ Using both "area in sq ft" AND "area in sq m" (theyโre the same thing)
๐ก Donโt worry about memorizing these โ just remember: linear regression works when "more X โ proportionally more/less Y" holds roughly true.
๐ How to evaluate? (Is the line good?)
After training, you need to measure how good the model is:
๐ก Analogy: Rยฒ = 0.85 means your line explains 85% of why prices vary. The remaining 15% is due to things your model doesnโt know (like renovation quality, neighbor noise, etc.).
๐ Python Code (Step by Step)
# Step 1: Import libraries import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score, mean_absolute_error # Step 2: Load data df = pd.read_csv("houses.csv") # Columns: area, bedrooms, age, price # Step 3: Define Features (X) and Label (Y) X = df[["area", "bedrooms", "age"]] # inputs y = df["price"] # what we predict # Step 4: Split into Training and Testing (80/20) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) # Step 5: Create and Train the model model = LinearRegression() model.fit(X_train, y_train) # โ this is where learning happens! # Step 6: Predict on test data y_pred = model.predict(X_test) # Step 7: Evaluate print("Rยฒ Score:", r2_score(y_test, y_pred)) print("MAE:", mean_absolute_error(y_test, y_pred)) # Step 8: See the learned formula print("Coefficients (m):", model.coef_) print("Intercept (b):", model.intercept_)
๐ก What each step maps to:
Steps 1โ2 = Data Science (collect & load)
Step 3 = Feature engineering (Features vs Labels)
Step 4 = Train/Test split
Step 5 = ML training (gradient descent finds best m and b)
Steps 6โ7 = Evaluation (how good is our model?)
Step 8 = Interpretability (what did the model learn?)
โ When to use / โ When NOT to use
โ Use Linear Regression when
- You want to predict a number (not a category)
- The relationship looks roughly straight-line
- You need a simple, interpretable model
- Youโre starting out and need a baseline model
- You have limited data (it works with small datasets too)
โ Donโt use when
- The output is a category (use Logistic Regression instead)
- The relationship is curved (use polynomial or tree-based models)
- You have lots of outliers (theyโll skew the line)
- Features are heavily correlated with each other
- The data is non-numeric (images, text โ use deep learning)
๐จ Common mistakes beginners make
๐ผ Full Walkthrough โ Predicting Salary from Experience (Y = mX + c)
๐ฏ Goal: A company wants to predict what salary to offer a new hire based on their years of experience.
Step 1: Collect the data
HR pulls salary records of 10 current employees:
X = Feature (what we know) | Y = Label (what we want to predict)
Step 2: Plot it โ do we see a line?
20 โ โข
18 โ โข /
16 โ โข /
14 โ โข /
13 โ โข /
11 โ โข /
8 โ โข/
6.5 โ โข/
5 โ โข/
3 โ โข/
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 2 3 4 5 6 7 8 9 10
Years of Experience
โข = actual data / = best fit line
โ The data points roughly follow a straight line going upward โ perfect for linear regression!
Step 3: Understand Y = mX + c
Salary = m ร Experience + c
What is m? (Slope)
How much salary increases for each extra year of experience.
In our data: m โ 1.9
โ Every extra year adds roughly โน1.9 lakhs to salary
What is c? (Intercept)
The starting salary when experience = 0 (a fresher straight out of college).
In our data: c โ 1.2
โ A fresher would start at roughly โน1.2 lakhs
๐ Our learned formula:
Salary = 1.9 ร Experience + 1.2
Step 4: Make predictions!
Now a candidate walks in with 4 years of experience. What salary should we offer?
Salary = 1.9 ร 4 + 1.2
Salary = 7.6 + 1.2
Salary = โน8.8 lakhs
Let's try a few more:
Step 5: Check โ is the prediction good?
Compare predicted vs actual for data we already have:
๐ก Errors are tiny (โน0.2โ0.3 lakh off) โ our line is a great fit! In real-world terms, the model is off by just โน20,000โ30,000.
Step 6: What does the slope and intercept MEAN?
Slope (m = 1.9)
"For every 1 extra year of experience, salary increases by โน1.9 lakhs."
This is the rate of change โ steeper slope = salary grows faster per year
Intercept (c = 1.2)
"A fresher with 0 years would earn โน1.2 lakhs."
This is the starting point โ where the line crosses the Y-axis
Step 7: What changes when m or c changes?
๐ก Classroom question: "If Company A offers m=3, c=5 vs Company B offers m=1.5, c=10 โ which is better for a fresher? Which is better after 15 years?"
๐
ฐ๏ธ Company A: Salary = 3ร0 + 5 = โน5L (fresher) โ 3ร15 + 5 = โน50L (15 yrs)
๐
ฑ๏ธ Company B: Salary = 1.5ร0 + 10 = โน10L (fresher) โ 1.5ร15 + 10 = โน32.5L (15 yrs)
๐ B pays more initially, but A overtakes after ~3.3 years and pays much more long-term!
๐ Summary
๐ Formula: Salary = 1.9 ร Experience + 1.2
๐ m (slope) = 1.9 โ salary rise per year
๐ c (intercept) = 1.2 โ fresher starting salary
๐ Prediction: 4 yrs โ โน8.8L, 8 yrs โ โน16.4L, 12 yrs โ โน24L
๐ How it learned: found the m and c that minimize prediction errors across all 10 employees
๐ฏ Classroom recap โ Linear Regression in 30 seconds:
1๏ธโฃ You have data with inputs (features) and a number to predict (label)
2๏ธโฃ The algorithm finds the best straight line through the data
3๏ธโฃ It uses that line to predict the number for new inputs
4๏ธโฃ You measure how good it is with Rยฒ and MAE
5๏ธโฃ Itโs the simplest ML algorithm โ start here, then try fancier models
๐งช Linear Regression Lab โ Step by Step
Follow each step in Google Colab or Jupyter Notebook. Every command is explained line-by-line so you know what it does and why.
Step 1 Import Libraries
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score
| Command | What It Does |
|---|---|
import numpy as np | Loads NumPy โ gives us fast math on arrays (mean, sum, matrix ops) |
import pandas as pd | Loads Pandas โ lets us work with data in tables (DataFrames) |
import matplotlib.pyplot as plt | Loads Matplotlib โ draws graphs and charts |
from sklearnโฆtrain_test_split | Function to split data into training & testing sets |
from sklearnโฆLinearRegression | The Linear Regression model class from scikit-learn |
from sklearnโฆmean_squared_error, r2_score | Functions to measure how good our predictions are |
Step 2 Outlier Detection using IQR
๐ What is IQR (Interquartile Range)?
Imagine you lined up all employee salaries from lowest to highest. Now split them into 4 equal groups:
| Position | Lowest | Q1 (25%) | Median (50%) | Q3 (75%) | Highest |
|---|---|---|---|---|---|
| Salary | โน12,000 | โน25,000 | โน40,000 | โน60,000 | โน95,000 |
| Percentile | 0% | 25% | 50% | 75% | 100% |
| โโโโโโโโโ IQR = Q3 โ Q1 = โน60,000 โ โน25,000 = โน35,000 โโโโโโโโโบ | |||||
| Term | Meaning | Example (Salaries in โน) |
|---|---|---|
| Q1 (25th percentile) | 25% of data falls below this value | โน25,000 |
| Q3 (75th percentile) | 75% of data falls below this value | โน60,000 |
| IQR | The spread of the middle 50% of data (Q3 โ Q1) | โน60,000 โ โน25,000 = โน35,000 |
| Lower Bound | Q1 โ 1.5 ร IQR โ anything below is an outlier | โน25,000 โ โน52,500 = โโน27,500 |
| Upper Bound | Q3 + 1.5 ร IQR โ anything above is an outlier | โน60,000 + โน52,500 = โน1,12,500 |
๐ค Why do we need IQR?
Example: If 99 employees earn โน20Kโโน80K but one CEO earns โน50,00,000 โ that one value pulls the mean, distorts the model, and ruins predictions.
| Without Outlier Removal | With IQR Outlier Removal |
|---|---|
| Mean salary = โน70,000 (inflated by CEO) | Mean salary = โน42,000 (realistic) |
| Model tries to fit the extreme point | Model focuses on the real pattern |
| Best-fit line is tilted/wrong | Best-fit line is accurate |
| Poor predictions for normal employees | Good predictions for normal employees |
๐ป The Code
# Calculate Q1 (25th percentile) and Q3 (75th percentile) for 'monthly_salary' Q1 = df['monthly_salary'].quantile(0.25) Q3 = df['monthly_salary'].quantile(0.75) # Calculate the Interquartile Range (IQR) IQR = Q3 - Q1 # Define the lower and upper bounds for outlier detection lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR print(f"Q1 (25th percentile): โน{Q1:,.0f}") print(f"Q3 (75th percentile): โน{Q3:,.0f}") print(f"IQR: โน{IQR:,.0f}") print(f"Lower Bound for Outliers: โน{lower_bound:,.0f}") print(f"Upper Bound for Outliers: โน{upper_bound:,.0f}")
| Command | What It Does |
|---|---|
df['monthly_salary'].quantile(0.25) | Finds Q1 โ the salary at the 25th percentile (25% of employees earn less than this) |
df['monthly_salary'].quantile(0.75) | Finds Q3 โ the salary at the 75th percentile (75% of employees earn less than this) |
IQR = Q3 - Q1 | Calculates the spread of the middle 50% of data |
Q1 - 1.5 * IQR | Lower fence โ salaries below this are unusually low (outliers) |
Q3 + 1.5 * IQR | Upper fence โ salaries above this are unusually high (outliers) |
f"โน{Q1:,.0f}" | Formats the number with commas and โน symbol, no decimals (e.g., โน25,000) |
๐งน Remove Outliers
# Count rows before outlier removal rows_before = len(df) # Filter out rows where 'monthly_salary' is outside the bounds df = df[(df['monthly_salary'] >= lower_bound) & (df['monthly_salary'] <= upper_bound)] # Count rows after outlier removal rows_after = len(df) rows_removed = rows_before - rows_after # Print how many rows were removed and the threshold print(f"\nRemoved {rows_removed} rows with salary above โน{upper_bound:,.0f} or below โน{lower_bound:,.0f}")
| Command | What It Does |
|---|---|
len(df) | Counts total rows in the DataFrame โ we save this before filtering |
df[โฆ] >= lower_bound | Creates a True/False mask โ True for rows where salary is above the lower fence |
df[โฆ] <= upper_bound | Creates a True/False mask โ True for rows where salary is below the upper fence |
& (and operator) | Combines both conditions โ keeps only rows that satisfy BOTH |
df = df[โฆ] | Overwrites the DataFrame with only the clean rows (outliers are dropped) |
rows_before - rows_after | Tells you exactly how many outlier rows were removed |
๐ Visual Summary โ Box Plot View
| Zone | Range | What Happens |
|---|---|---|
| โ Below Lower Bound | Below โนโ27,500 | โ OUTLIER โ removed |
| Lower Bound โ Q1 | โนโ27,500 โ โน25,000 | โ Normal (just low) |
| Q1 โ Median | โน25,000 โ โน40,000 | โ Middle 50% (IQR zone) |
| Median โ Q3 | โน40,000 โ โน60,000 | โ Middle 50% (IQR zone) |
| Q3 โ Upper Bound | โน60,000 โ โน1,12,500 | โ Normal (just high) |
| โ Above Upper Bound | Above โน1,12,500 | โ OUTLIER โ removed |
โ Before training any ML model (Linear Regression, etc.) โ outliers distort the best-fit line
โ During data cleaning / EDA phase
โ When you see suspiciously high or low values in
df.describe()โ Don't remove outliers blindly โ sometimes they are real (e.g., a doctor's salary IS higher)
Step 3 One-Hot Encoding (Categorical โ Numbers)
๐ค What is One-Hot Encoding?
Machine Learning models only understand numbers, not text. If your data has a column like department with values "HR", "IT", "Sales" โ you must convert it to numbers before training.
One-Hot Encoding creates a new column for each category and fills it with 1 (yes, belongs) or 0 (no, doesn't belong):
โ Before (Text โ ML can't use this)
| Name | Department | Salary |
|---|---|---|
| Ravi | HR | โน30,000 |
| Priya | IT | โน50,000 |
| Amit | Sales | โน35,000 |
| Neha | HR | โน32,000 |
โ After (Numbers โ ML ready!)
| Name | Salary | dept_IT | dept_Sales |
|---|---|---|---|
| Ravi | โน30,000 | 0 | 0 |
| Priya | โน50,000 | 1 | 0 |
| Amit | โน35,000 | 0 | 1 |
| Neha | โน32,000 | 0 | 0 |
๐๏ธ Why drop_first=True? โ The Multicollinearity Trap
Multicollinearity = when one column can be perfectly predicted from other columns. This confuses the model.
| Employee | dept_HR | dept_IT | dept_Sales | Sum | Problem? |
|---|---|---|---|---|---|
| Ravi | 1 | 0 | 0 | 1 | Sum is ALWAYS 1! dept_HR = 1 โ dept_IT โ dept_Sales โ HR column adds zero new info โ This is multicollinearity |
| Priya | 0 | 1 | 0 | 1 | |
| Amit | 0 | 0 | 1 | 1 |
| Scenario | What Happens |
|---|---|
drop_first=False (keep all columns) | โ Multicollinearity โ model coefficients become unstable & unreliable, small data changes cause wild swings in results |
drop_first=True (drop one column) | โ No redundancy โ each column provides unique information, model is stable |
๐ป The Code
# Apply one-hot encoding to the 'department' column df = pd.get_dummies(df, columns=['department'], drop_first=True) # Print the new column names to show the encoded 'department' columns print("New column names after encoding 'department':") print(df.columns.tolist())
| Command | What It Does |
|---|---|
pd.get_dummies(df, โฆ) | Pandas function that automatically converts categorical text columns into 0/1 numeric columns |
columns=['department'] | Tells Pandas which column to encode. You can pass multiple: ['department', 'city'] |
drop_first=True | Drops the first category (e.g., HR) to avoid multicollinearity. The dropped category becomes the baseline/reference |
df.columns.tolist() | Returns all column names as a list โ verify the new dummy columns appeared |
๐ Expected Output
New column names after encoding 'department': ['name', 'monthly_salary', 'department_IT', 'department_Sales'] # 'department_HR' is gone โ it's the baseline (drop_first dropped it) # When department_IT=0 AND department_Sales=0 โ the employee is in HR
โ When a column has categories with no natural order (department, city, colour)
โ Before training Linear Regression, Logistic Regression, or Neural Networks
โ Don't use for ordered categories like "Low โ Medium โ High" โ use Label Encoding (0, 1, 2) instead
โ Don't use if a column has 100+ unique values (too many new columns!) โ use target encoding or embeddings
Step 4 Split Data into Training & Testing
๐ค Why do we split data?
Imagine a student who memorises all the answers but can't solve a new question. That's what happens when a model trains AND tests on the same data โ it just memorises, it doesn't learn.
๐ 80%
Training Set
The model learns from this data โ finds patterns, builds the equation
๐งช 20%
Testing Set
The model is tested on this โ data it has never seen before
๐ฏ Features (X) vs Target (y)
| Term | Meaning | In Our Data |
|---|---|---|
| X (Features) | The input columns โ what the model uses to make predictions | Experience, department_IT, department_Sales, etc. |
| y (Target) | The output column โ what we want to predict | monthly_salary |
df.drop('monthly_salary', axis=1) | Takes all columns EXCEPT salary โ those become features | Everything except monthly_salary |
df['monthly_salary'] | Picks only the salary column โ this is the target | monthly_salary |
Full DataFrame (df):
| experience | department_IT | department_Sales | monthly_salary |
|---|---|---|---|
| 5 | 1 | 0 | โน50,000 |
| 3 | 0 | 1 | โน35,000 |
โ df.drop('monthly_salary') โ X (Features)
| experience | department_IT | department_Sales |
|---|---|---|
| 5 | 1 | 0 |
| 3 | 0 | 1 |
โ df['monthly_salary'] โ y (Target)
| monthly_salary |
|---|
| โน50,000 |
| โน35,000 |
๐ป The Code
from sklearn.model_selection import train_test_split # Define features (X) as all columns except 'monthly_salary' X = df.drop('monthly_salary', axis=1) # Define the target variable (y) as 'monthly_salary' y = df['monthly_salary'] # Split the data into training and testing sets # 80% for training, 20% for testing # random_state ensures reproducibility of the split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) # Print the shapes of the resulting datasets print(f"Shape of X_train (training features): {X_train.shape}") print(f"Shape of X_test (testing features): {X_test.shape}") print(f"Shape of y_train (training target): {y_train.shape}") print(f"Shape of y_test (testing target): {y_test.shape}")
| Command | What It Does |
|---|---|
from sklearnโฆimport train_test_split | Imports the splitting function from scikit-learn |
df.drop('monthly_salary', axis=1) | Removes the salary column and keeps everything else as features. axis=1 means "drop a column" (axis=0 would drop a row) |
df['monthly_salary'] | Selects only the salary column as the target variable |
train_test_split(X, y, โฆ) | Randomly shuffles the data and splits it into 4 parts: X_train, X_test, y_train, y_test |
test_size=0.2 | 20% goes to testing, 80% goes to training. Common choices: 0.2, 0.25, 0.3 |
random_state=42 | Fixes the random shuffle โ so every student gets the same split. Without this, each run gives different results |
X_train, X_test | Training features & Testing features (the input columns, split into two groups) |
y_train, y_test | Training target & Testing target (the salary column, split into two groups) |
.shape | Shows (rows, columns) โ verify the split worked correctly |
๐ฆ What do the 4 variables contain?
| Variable | Contains | Used For | Example Shape |
|---|---|---|---|
X_train | 80% of feature rows | Model learns from these inputs | (800, 5) |
y_train | 80% of salary values | Model learns the correct answers | (800,) |
X_test | 20% of feature rows | Model predicts on these (unseen) | (200, 5) |
y_test | 20% of salary values | We compare predictions vs these actual answers | (200,) |
| Training (80%) | Testing (20%) | |
|---|---|---|
| Features | X_train Input columns for LEARNING | X_test Input columns for TESTING |
| Target | y_train Correct salary answers for LEARNING | y_test Correct answers to CHECK predictions against |
โ Forgetting
random_state โ each run gives different results, making debugging hardโ Using too small a test set (5%) โ not enough data to reliably evaluate
โ Using too large a test set (50%) โ not enough data for the model to learn
โ Sweet spot: test_size = 0.2 to 0.3 (20โ30% testing)
Step 5 Train the Model & Predict
๐๏ธ What does "Training" mean?
Training = the model looks at X_train (features) and y_train (correct salaries) and figures out the best equation that connects them.
model.fit(X_train, y_train) โ What the model sees during training:
| X_train (Features) | y_train (Correct Answer) | |
|---|---|---|
| exp=5, IT=1, Sales=0 | โ | โน50,000 |
| exp=3, IT=0, Sales=1 | โ | โน35,000 |
| exp=8, IT=0, Sales=0 | โ | โน72,000 |
| โฌ๏ธ Model learns the equation: Salary = mโรExperience + mโรIT + mโรSales + c | ||
.fit() as a student studying. The model reads hundreds of examples (X_train + y_train) and figures out the pattern. After .fit() completes, the model has "learned" โ it now knows the slope, intercept, and coefficients.
๐ฎ What does "Predict" mean?
model.predict(X_test) = give the model new inputs it has never seen, and it uses the learned equation to guess the salary.
y_pred = model.predict(X_test) โ Model predicts on unseen data:
| X_test (Unseen Features) | y_pred (Model's Guess) | y_test (Actual Answer) | Close? | |
|---|---|---|---|---|
| exp=4, IT=0, Sales=0 | โ | โน38,500 | โน40,000 | โ Off by โน1,500 |
| exp=7, IT=0, Sales=1 | โ | โน61,200 | โน58,000 | โ ๏ธ Off by โน3,200 |
| โ Compare y_pred vs y_test to measure how good the model is! | ||||
๐ป The Code
from sklearn.linear_model import LinearRegression import pandas as pd # Create an instance of the Linear Regression model model = LinearRegression() # Train the model using the training features (X_train) and target (y_train) model.fit(X_train, y_train) # Make predictions on the test set using the trained model y_pred = model.predict(X_test) print(y_pred)
| Command | What It Does |
|---|---|
LinearRegression() | Creates a blank model object โ it hasn't learned anything yet. Think of it as a new student on Day 1. |
model.fit(X_train, y_train) | ๐ฏ THE TRAINING STEP! The model reads all training data and calculates the best coefficients (slopes) and intercept to minimise errors. |
model.predict(X_test) | ๐ฎ THE PREDICTION STEP! Feeds new (unseen) feature values into the learned equation and gets predicted salaries back. |
y_pred | An array of predicted salary values โ one prediction for each row in X_test. |
print(y_pred) | Displays the predictions. Example: [38500. 61200. 45800.] |
โก fit() vs predict() โ Key Difference
model.fit() | model.predict() | |
|---|---|---|
| When | Training phase | After training |
| Input | X_train + y_train (features AND answers) | X_test (features ONLY โ no answers) |
| What it does | Learns the equation (finds m and c) | Uses the equation to predict y values |
| Analogy | Student studying with textbook + answer key | Student writing the exam (no answer key!) |
| Runs | Once (on training data) | Many times (any new data) |
.fit(), you can inspect what the model learned:model.coef_ โ the slopes (one per feature)model.intercept_ โ the starting value (c in Y = mX + c)Example:
[1400, 5000, 3000] means each year of experience adds โน1,400, being in IT adds โน5,000, being in Sales adds โน3,000.
Step 6 Evaluate the Model (Rยฒ, MAE, RMSE)
๐ Why do we evaluate?
The model made predictions (y_pred). But how good are they? We need numbers to measure accuracy โ not just "looks okay".
| Employee | Actual (y_test) | Predicted (y_pred) | Error | Good? |
|---|---|---|---|---|
| Ravi | โน40,000 | โน38,500 | โน1,500 | โ Small |
| Priya | โน58,000 | โน61,200 | โน3,200 | โ ๏ธ Bigger |
| Amit | โน32,000 | โน33,100 | โน1,100 | โ Small |
๐ป The Code
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error import numpy as np # Calculate R-squared score R_squared = r2_score(y_test, y_pred) print(f"R-squared (Rยฒ): {R_squared:.2f}") print(f"Our model explains {R_squared*100:.2f}% of salary variation.\n") # Calculate Mean Absolute Error (MAE) MAE = mean_absolute_error(y_test, y_pred) print(f"Mean Absolute Error (MAE): โน{MAE:,.2f}") print(f"On average, our prediction is off by โน{MAE:,.0f}.\n") # Calculate Root Mean Squared Error (RMSE) RMSE = np.sqrt(mean_squared_error(y_test, y_pred)) print(f"Root Mean Squared Error (RMSE): โน{RMSE:,.2f}") print(f"RMSE is โน{RMSE:,.0f} โ larger errors are penalised more.")
| Command | What It Does |
|---|---|
r2_score(y_test, y_pred) | Calculates Rยฒ โ how much of the salary variation the model can explain (0 to 1) |
mean_absolute_error(y_test, y_pred) | Calculates MAE โ average of |actual โ predicted| (absolute differences) |
mean_squared_error(y_test, y_pred) | Calculates MSE โ average of (actual โ predicted)ยฒ (squared differences) |
np.sqrt(โฆ) | Takes square root of MSE to get RMSE โ brings it back to โน units |
๐ Rยฒ (R-Squared) โ "How much does the model explain?"
| Rยฒ = 0.85 | What it means |
|---|---|
| 85% | The model explains 85% of why salaries differ from person to person (experience, department, etc.) |
| 15% | The remaining 15% is due to factors the model doesn't know about (negotiation skills, company culture, luck) |
| Rยฒ Value | Rating | Meaning | Example |
|---|---|---|---|
| 0.90 โ 1.00 | ๐ข Excellent | Model captures almost all patterns | Predicting house price from area + location + rooms |
| 0.70 โ 0.89 | ๐ก Good | Model captures most patterns | Predicting salary from experience + department |
| 0.50 โ 0.69 | ๐ Moderate | Model captures some patterns โ consider more features | Predicting marks from study hours alone |
| Below 0.50 | ๐ด Poor | Model is barely better than just guessing the average | Predicting stock price from weather |
๐ MAE (Mean Absolute Error) โ "By how much am I off?"
MAE Calculation Example:
| Employee | Actual | Predicted | |Actual โ Predicted| |
|---|---|---|---|
| Ravi | โน40,000 | โน38,500 | โน1,500 |
| Priya | โน58,000 | โน61,200 | โน3,200 |
| Amit | โน32,000 | โน33,100 | โน1,100 |
| MAE = (1,500 + 3,200 + 1,100) รท 3 = | โน1,933 | ||
| Feature | MAE |
|---|---|
| Formula | Average of |actual โ predicted| for all test rows |
| Unit | Same as target (โน) โ easy to interpret! |
| Treats errors | All errors equally โ โน100 off and โน10,000 off are treated proportionally |
| Lower is | Better โ โ means predictions are closer to reality |
| Best for | When you want a simple, intuitive measure of average mistake |
๐ RMSE (Root Mean Squared Error) โ "How bad are the big mistakes?"
RMSE Calculation Example:
| Employee | Actual | Predicted | Error | (Error)ยฒ | Note |
|---|---|---|---|---|---|
| Ravi | โน40,000 | โน38,500 | โน1,500 | 2,250,000 | |
| Priya | โน58,000 | โน61,200 | โน3,200 | 10,240,000 | โ big error gets amplified! |
| Amit | โน32,000 | โน33,100 | โน1,100 | 1,210,000 | |
| MSE = (2,250,000 + 10,240,000 + 1,210,000) รท 3 = | 4,566,667 | ||||
| RMSE = โ4,566,667 = | โน2,137 | ||||
| Feature | RMSE |
|---|---|
| Formula | โ(average of (actual โ predicted)ยฒ ) |
| Unit | Same as target (โน) โ interpretable |
| Treats errors | Big errors are penalised much more (squaring amplifies large mistakes) |
| Lower is | Better โ |
| Best for | When big mistakes are costly (e.g., medical, financial predictions) |
โ๏ธ MAE vs RMSE โ When to use which?
| MAE | RMSE | |
|---|---|---|
| Treats all errors | Equally | Big errors punished more |
| Always | RMSE โฅ MAE | RMSE โฅ MAE |
| If MAE โ RMSE | All errors are similar in size (consistent model) โ | |
| If RMSE โซ MAE | Some predictions have very large errors (investigate those!) โ ๏ธ | |
| Use MAE when | You want a simple average error โ "how many โน off on average?" | |
| Use RMSE when | Big mistakes are expensive โ medical dosage, loan amount, fraud detection | |
๐ Evaluation Cheat Sheet
| Metric | Question It Answers | Good Value | Direction |
|---|---|---|---|
| Rยฒ | "How much variance does the model explain?" | > 0.80 | Higher is better โฌ๏ธ |
| MAE | "On average, how many โน am I wrong?" | As low as possible | Lower is better โฌ๏ธ |
| RMSE | "How bad are my worst mistakes?" | Close to MAE | Lower is better โฌ๏ธ |
1๏ธโฃ Check Rยฒ first โ is the model useful at all? (above 0.70? good!)
2๏ธโฃ Check MAE โ is the average error acceptable for your business? (โน2,000 off on a โน50,000 salary = 4% โ pretty good!)
3๏ธโฃ Check RMSE vs MAE โ if RMSE is much bigger than MAE, some predictions are wildly off โ find and fix those rows
Step 7 Visualise โ Actual vs Predicted Chart
๐ What does this chart show?
This scatter plot answers one question: "How close are our predictions to reality?" Every dot is one employee from the test set.
| Element on the Chart | What It Represents | How to Read It |
|---|---|---|
| ๐ต Blue Dots | Each dot = one employee from the test set | X-axis = their actual salary | Y-axis = what the model predicted |
| ๐ Orange Dashed Line | The Perfect Prediction Line (diagonal) | If the model were 100% correct, every dot would sit exactly on this line (actual = predicted) |
| Dot ON the line | Prediction = Actual | โ Perfect prediction! (rarely happens) |
| Dot ABOVE the line | Predicted > Actual | โฌ๏ธ Model overestimated this employee's salary |
| Dot BELOW the line | Predicted < Actual | โฌ๏ธ Model underestimated this employee's salary |
| Dots close together near the line | Consistent, accurate model | โ Low MAE/RMSE, high Rยฒ |
| Dots scattered far from the line | Inconsistent predictions | โ High errors โ model needs more features or a different algorithm |
๐ฏ Reading the dots โ Example
| Employee | Actual Salary (X-axis) | Predicted Salary (Y-axis) | Where is the dot? | Meaning |
|---|---|---|---|---|
| Ravi | โน40,000 | โน38,500 | Slightly below the line | Model underestimated by โน1,500 |
| Priya | โน58,000 | โน61,200 | Slightly above the line | Model overestimated by โน3,200 |
| Amit | โน32,000 | โน32,000 | On the line | Perfect prediction! โ |
| Sneha | โน70,000 | โน45,000 | Far below the line | โ Big miss โ investigate why! |
๐ป The Code
import matplotlib.pyplot as plt import seaborn as sns plt.figure(figsize=(10, 7)) # Create a scatter plot of actual vs predicted salaries sns.scatterplot(x=y_test, y=y_pred, alpha=0.9) # Plot a perfect prediction line (diagonal line from min to max) plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color='orange', linestyle='--', linewidth=2, label='Perfect Prediction') # Set title and axis labels plt.title('How close are our salary predictions?', fontsize=16) plt.xlabel('Actual Salary (in โน)', fontsize=12) plt.ylabel('Predicted Salary (in โน)', fontsize=12) plt.grid(True, linestyle='--', alpha=0.7) plt.legend() plt.tight_layout() plt.show()
| Command | What It Does |
|---|---|
plt.figure(figsize=(10, 7)) | Creates a chart canvas โ 10 inches wide, 7 inches tall |
sns.scatterplot(x=y_test, y=y_pred) | Draws blue dots โ X = actual salary, Y = predicted salary. Each dot = one employee. |
alpha=0.9 | Dot transparency (0=invisible, 1=solid). 0.9 = slightly see-through so overlapping dots are visible |
plt.plot([min, max], [min, max], โฆ) | Draws the orange dashed diagonal line from the smallest to largest salary. This is the "perfect prediction" reference line. |
color='orange', linestyle='--' | Makes the line orange & dashed (so it's easy to distinguish from dots) |
plt.title(โฆ) | Adds the chart title at the top |
plt.xlabel / ylabel | Labels the X-axis (Actual) and Y-axis (Predicted) |
plt.grid(True, linestyle='--') | Shows dashed grid lines for easier reading |
plt.legend() | Shows the legend box identifying the orange line |
plt.tight_layout() | Adjusts spacing so nothing gets cut off |
plt.show() | Renders the chart on screen |
โ How to judge your model from this chart
| What You See | What It Means | Rating |
|---|---|---|
| All dots hugging the orange line tightly | Predictions are very close to actual โ excellent model | ๐ข Great |
| Most dots near the line, a few stray | Good overall, but some outlier predictions โ investigate those employees | ๐ก Good |
| Dots scattered widely around the line | Predictions are inconsistent โ model needs improvement | ๐ Weak |
| Dots form a cloud with no pattern | Model has not learned the relationship โ try more features or a different algorithm | ๐ด Poor |
Recap of all 7 steps:
1๏ธโฃ Import Libraries โ 2๏ธโฃ IQR Outlier Removal โ 3๏ธโฃ One-Hot Encoding โ 4๏ธโฃ Train/Test Split โ 5๏ธโฃ Train & Predict โ 6๏ธโฃ Evaluate (Rยฒ, MAE, RMSE) โ 7๏ธโฃ Visualise Results
Lifecycle (Expand/Collapse)
1) Problem definition
Define the question + success metric (ex: predict churn).
2) Data collection
Collect from apps, logs, sensors, databases, surveys.
3) Data cleaning & prep
Fix missing values, duplicates, outliers, formats.
4) EDA
Explore patterns using statistics and visualizations.
5) Modeling (ML/AI)
Train classification/regression/clustering models.
6) Evaluation
Validate on unseen data using correct metrics.
7) Deployment
Expose model via API, batch jobs, or apps.
8) Monitoring
Watch drift, retrain, and improve iteratively.
Start with a hook (Ask students)
"How does Netflix know what movie you'll like next?"
Now explain
๐ Because it uses Data Science + Artificial Intelligence
Data Science
Understanding the past
AI
Acting on that understanding automatically
โ
Data Science tells us WHAT is happening
โ
AI decides WHAT TO DO about it
Examples (Data Science vs AI)
Quick classroom activity
Ask learners to pick any app (Swiggy / Amazon / Instagram) and answer:
1) What data is collected?
2) What insight is extracted (Data Science)?
3) What decision is automated (AI)?
Tip: You can also plug in your own examples from Data Science with AI.docx.