A Beginner’s Guide to Linear Regression in Machine Learning
Introduction
Have you ever judged someone’s age or net worth based on their physical characteristics or possessions? This is actually an example of using machine learning, specifically linear regression.
For example, if you assume someone who drives a Rolls-Royce is worth $400 million, while someone driving an old model Toyota car is only worth $50,000, you are using linear regression to make a prediction based on a single feature called simple linear regression (the type of car they drive).
Let’s dive in…
Machine Learning
In the field of artificial intelligence, machine learning allows machines to analyze patterns in data and make decisions based on those insights. This is achieved through the use of algorithms and statistical models, which can be classified into three main categories: supervised learning, unsupervised learning, and reinforcement learning. These approaches are all driven by data and machine learning models.
Supervised learning: This involves training a machine using labelled data, where the desired output is already known. For example, a dataset containing images of cats and dogs could be labelled with the correct animal, allowing a machine to learn to identify new images as either a cat or a dog. There are two types of supervised learning: classification, where the goal is to predict a categorical label, and regression, where the goal is to predict a continuous value.
Unsupervised learning: This involves training a machine using unlabeled data, where the desired output is unknown. A machine learning model could be used to group a large dataset of Amazon products into categories, such as electronics, food and beverages, and fashion and accessories, using unsupervised learning. One type of unsupervised learning model is called clustering.
Linear Regression
Linear regression is a type of supervised machine learning algorithm used in the field of data science to understand and make predictions based on relationships between data. It is considered a simple model, and understanding how it works and how to apply it can be very beneficial for data scientists and machine learning enthusiasts, especially beginners.
Types of Linear Regression
Linear regression is a method of predicting a dependent variable (y) based on one or more independent variables (x). If there is only one independent variable, the linear regression is called Simple Linear Regression. If there are multiple independent variables, it is called Multiple Linear Regression.
Simple Linear Regression predicts a value based on a single independent variable, while Multiple Linear Regression predicts a value based on multiple independent variables.
Closer Look…. Explaining the parameters in the image above.
y: This signifies the dependent variable. For instance, we will be predicting the salary of machine learning engineers based on their years of experience. The y signifies the salary of the machine learning engineers. It can also be called a target variable.
x: This signifies the independent variable or feature in a dataset. The dependent variable depends on it. In the above scenario, "years of experience" is the independent variable.
m: The "m" is the same as what you must have imagined in your geometry class. The m represents the gradient, it is also referred to as "weights" in machine learning. The gradient is the slope of the line of best fit (more details on it).
This represents the intercept of the line of best fit across the y-axis.
Line of Best Fit
The aim of a machine learning model is to produce the least possible error. The line of best fit is the line across our dataset with the least possible error.
The line of best fit is modelled by the intercept and gradient of the predicted value (m and c) . The aim of linear regresssion is to get the best intercept and gradient value to achieve the lowest residual.
This line is determined by the intercept and gradient, also known as m and c. The smaller the residual, the better the fit of the line. If the residual is large, it means the line of best fit is not accurate enough.
Practical Approach to Linear Regression
In this section of the article, we’ll be applying everything we have learned to build a Simple Linear Regression model that predicts the salary of Machine Learning (ML) Engineers based on their years of experience.
Dataset
Data is the driving force of any machine learning model. Without data, there’s no model. We’ll be predicting based on two columns — YearsExperience and Salary. Our target variable is the Salary feature or column.
Model
I used the linear regression model in the Scikit-Learn library, which is written in Python programming language. The Scikit-Learn library helps to bypass knowing all the intricacies of coding the linear model (y = mx + c) from scratch.
Software
he Jupyter notebook from the Anaconda software will be used for this purpose, as it supports various data and machine learning Python libraries and is commonly used by data scientists and machine learning professionals for rapid modeling. Follow this link to learn how to install the Jupyter notebook.
Coding
- Import the necessary libraries : Numpy stands for numerical computations. The Pandas library helps with manipulating data. Matplotlib.pyplot helps to visualize the data. Sklearn.linear_model is scikit-learn library that contains are pre-made machine learning model.
2. Import the dataset: The dataset is a CSV (comma-separated value) file. It can be previewed on my GitHub repo.
3. Preview the data: It is best that you check to see, what your data looks like. There’s a Pandas function called head() to achieve that. The head function prints the first 5 rows of the features by default.
4. Plot the data: Plotting the data helps to see the relationship between the independent and dependent variables. We can clearly see, that as the year of experience increases the salary of the ML Engineer also increases. The ‘plt’ here is gotten from the Matplotlib library.
5. Modelling: This is applying the linear regression machine learning model. We applied Linear regression from the Sklearn library. As we have been learning, x is the years of experience, i.e, an independent variable, and y is the salary i.e dependent variable. Then I used train_test_split to divide the x and y data into groups to prevent overfitting — a topic for another day. reg is a variable where the model object LinearRegression() is passed.
Heads up …
In Scikit-learn, one thing you should note is that we use the fit function to help our machine learning learn about the data. Then the predict function helps to predict as the name implies or outputs the target variable from our prediction.
6. Prediction — Remember, we said that linear regression prediction involves getting the line of best fit, which involves the value of the weights called m and c. So, let’s check the coefficient and intercept that make up the line.
From the image above 93222.944 is the gradient, while 26095.89798 is the intercept. I’ll explain further with a diagram below
Final Prediction
To wrap things up, let’s visualize our line of best fit. The line of best fit is one of the most important ingredients in the linear regression model.
Conclusion
Linear regression might not be the best for solving real-life regression problems. However, it’s one of the basics you’ve to understand in Machine Learning, it serves as a building block for many other machine learning models out there.
Click this link to download the notebook and the data used on my GitHub repository.
I hope this article, was super helpful to you, glad you made it. Endeavour to clap, follow and share this article with others in need of it. I hope to write more beautiful and better articles on Machine Learning, Data Analytics, Data Engineering, Python, and Personal Development.