PyCaret for Beginners: Building Your First Classification Model

Josiah Adesola
7 min readJan 9, 2023

--

Photo by Crew on Unsplash

Are you looking for a way to build a machine learning (ML) model without writing a lot of code? If you have a no-code or low-code background, the PyCaret library can make it easy for you to build ML models like a pro.

At the end of this article, you will:

  • Understand the knowledge of PyCaret.
  • Understand machine learning workflow.
  • Build a breast cancer classification model using PyCaret.

What is PyCaret?

PyCaret is a low-code machine learning Python library. PyCaret eliminates the need for extensive ML processing to create a model. You can complete tasks that would normally take hours in minutes. It is amazing, right?

PyCaret is designed for individuals with little coding experience. It allows you to easily build and deploy machine learning models. It automates the machine learning workflow with minimal coding. PyCaret is designed to be flexible and scalable, saving you a significant amount of coding time. As a beginner, intermediate, or experienced ML practitioner, PyCaret stands as an excellent choice.

Prerequisites

Breast Cancer Classification Model With PyCaret

Breast cancer accounts for more than 25% of cancer cases, and millions of people die from it daily. It is caused by the abnormal growth of cells in the breast. This model will predict whether a patient’s breast cancer is malignant (deadly) or benign (non-deadly).

The dataset can be found at this Kaggle link.

  1. Download PyCaret: To use the PyCaret ML library, you can install it in your Jupyter notebook or Colab. This will allow you to access and utilize the library’s functions and features within these environments.

This line of code installs several Python libraries that are necessary for using PyCaret

Image by author

2. Import necessary libraries: For this project, you will need two Python libraries: Pandas and PyCaret. Pandas is already installed on your Google Colab or Jupyter notebook, so you only need to install PyCaret.

3. Preview the dataset: To help identify the target variable from the remaining variables, you should check the dataset. Pandas can be used to preview the first five rows of the table.

The output produced from this code is shown below.

Image by author

The target variable is the `diagnosis` column, which represents whether the cancer is malignant (M) or benign (B). It is a categorical variable that can be preprocessed with PyCaret. The remaining variables are numerical and will be used to predict the target variable.

Follow the tutorial above to learn how to import Kaggle datasets into your Google Colab in 3 easy steps

4. Data Preprocessing: This article does not focus on data preprocessing. However, it is important to note that machine learning models can only understand numerical data. Therefore, we will need to use the Scikit-learn library to convert the ‘diagnosis’ column, which contains the categorical data ‘M’ and ‘B’, to numerical values 1 and 0 respectively.

The diagnosis column has changed to figures

5. Setup the PyCaret environment: To use PyCaret’s built-in functions, you must first set up a PyCaret environment. If you are using Colab, you will need to run these lines of code to enable Google Colab. If you do not do this, you may encounter error messages

This code enables PyCaret to function properly in Google Colab.
enable colab image by author

There are three important things to consider when setting up the PyCaret environment:

  1. Data: The dataset (df) should be mapped to the ‘data’ variable. Note that ‘df’ is just the variable name I chose for my dataset. You can use any variable name you prefer.
  2. Target: Set the target variable column using the ‘target’ parameter.
  3. Session ID: This helps to ensure the reproducibility of the code.

After running these lines of code, you get the output in the image below.

output from the above code by author

Here is a brief overview of the code output to help with understanding:

  • Target Type: This indicates the type of classification being performed. In this case, we are classifying breast cancer as either malignant or benign, so it is a binary classification.
  • Original Data: This shows the number of rows and columns in the dataset. The dataset in this example has 569 rows and 32 columns, including the target variable.
  • Transformed Train and Transformed Test Set: By default, the original dataset is split into a 67:33 ratio. There are 30 columns instead of 31 because the ‘id’ column has been removed as it is not useful for prediction.
  • Missing value: The output indicates ‘false’, which means there are no missing values in the dataset.

There is additional information that can be obtained by running this operation in a Jupyter or Google Colab notebook.

6. Select the best model: This is a very simple task that took only 20 seconds to complete on Google Colab. The code sorted and printed out over 15 models from highest to lowest using the "sort" parameter for the f1-score evaluation metric. You can try practising with other evaluation metrics such as accuracy and recall.

compare_model code output by author

7. Build a classifier model: The ‘create_model’ function can be used to build any of the classifier models using any of the acronyms listed in the image above.

light gbm classifier model by author

8. Hyperparameter Tuning: Hyperparameter tuning is an efficient ML workflow aimed at optimizing your ML model. “The ‘tune_model’ function is used to optimize the hyperparameters of a model through a process called hyperparameter tuning

Hyperparameter tuning with PyCaret by author

9. Feature Importance: Feature importance is a technique that assigns a value or score to each feature to indicate how significant it is in predicting the target variable. By removing redundant features and utilizing the most important ones, your model can make more accurate predictions.

PyCaret makes it easy to visualize the feature importance of your model with a graph.

Feature selection with PyCaret by author

10. Model Evaluation: Using PyCaret, you can easily evaluate your model with just one line of code. In this example, two model evaluations were performed using the ROC-AUC curve and confusion matrix.

ROC-AUC Curve using PyCaret by author
Confusion matrix using PyCaret by author

11. Prediction: You can easily make predictions to compare the test dataset and the train dataset, which will help you determine the validity of your evaluation metrics.

predict model by author

Remember to add a semicolon at the end of the code to produce the above output

12. Save your model: Saving your model allows it to be reused in various applications or websites, and it can also be a way to deploy your machine learning solution. PyCaret makes it easy to do this with just a few lines of code

After running this line of code, your trained model will be saved in a pickle (pkl) file.

save model with PyCaret by author

Glad you made it this far

cheering gif from https://www.icegif.com/

Conclusion

In conclusion, PyCaret is an efficient tool for low-code machine learning practitioners or enthusiasts. It can be applied not only to binary classification problems, but also to multi-classification, regression, clustering, data processing, and even natural language processing problems.

I encourage you to practice along with this article and share your feedback for support and improvement. You can find the link to the full Jupyter notebook here.

See you later. Bye for now! Don’t forget to give a clap, and make necessary comments for feedback.

--

--

Josiah Adesola
Josiah Adesola

Written by Josiah Adesola

Writes about machine learning, Data Science, Python. Creative. Thinker. Engineering. Twitter: @_JosiahAdesola

Responses (1)