PyCaret for Beginners: Building Your First Classification Model
Are you looking for a way to build a machine learning (ML) model without writing a lot of code? If you have a no-code or low-code background, the PyCaret library can make it easy for you to build ML models like a pro.
At the end of this article, you will:
- Understand the knowledge of PyCaret.
- Understand machine learning workflow.
- Build a breast cancer classification model using PyCaret.
What is PyCaret?
PyCaret is a low-code machine learning Python library. PyCaret eliminates the need for extensive ML processing to create a model. You can complete tasks that would normally take hours in minutes. It is amazing, right?
PyCaret is designed for individuals with little coding experience. It allows you to easily build and deploy machine learning models. It automates the machine learning workflow with minimal coding. PyCaret is designed to be flexible and scalable, saving you a significant amount of coding time. As a beginner, intermediate, or experienced ML practitioner, PyCaret stands as an excellent choice.
Prerequisites
- Fundamental knowledge of Python Programming.
- A Google Colab or Jupyter notebook to run the python code.
- Basic understanding of binary classification.
- PyCaret 2.0 version or greater.
Breast Cancer Classification Model With PyCaret
Breast cancer accounts for more than 25% of cancer cases, and millions of people die from it daily. It is caused by the abnormal growth of cells in the breast. This model will predict whether a patient’s breast cancer is malignant (deadly) or benign (non-deadly).
The dataset can be found at this Kaggle link.
- Download PyCaret: To use the PyCaret ML library, you can install it in your Jupyter notebook or Colab. This will allow you to access and utilize the library’s functions and features within these environments.
This line of code installs several Python libraries that are necessary for using PyCaret
2. Import necessary libraries: For this project, you will need two Python libraries: Pandas and PyCaret. Pandas is already installed on your Google Colab or Jupyter notebook, so you only need to install PyCaret.
3. Preview the dataset: To help identify the target variable from the remaining variables, you should check the dataset. Pandas can be used to preview the first five rows of the table.
The output produced from this code is shown below.
The target variable is the `diagnosis` column, which represents whether the cancer is malignant (M) or benign (B). It is a categorical variable that can be preprocessed with PyCaret. The remaining variables are numerical and will be used to predict the target variable.
Follow the tutorial above to learn how to import Kaggle datasets into your Google Colab in 3 easy steps
4. Data Preprocessing: This article does not focus on data preprocessing. However, it is important to note that machine learning models can only understand numerical data. Therefore, we will need to use the Scikit-learn library to convert the ‘diagnosis’ column, which contains the categorical data ‘M’ and ‘B’, to numerical values 1 and 0 respectively.
5. Setup the PyCaret environment: To use PyCaret’s built-in functions, you must first set up a PyCaret environment. If you are using Colab, you will need to run these lines of code to enable Google Colab. If you do not do this, you may encounter error messages
There are three important things to consider when setting up the PyCaret environment:
- Data: The dataset (df) should be mapped to the ‘data’ variable. Note that ‘df’ is just the variable name I chose for my dataset. You can use any variable name you prefer.
- Target: Set the target variable column using the ‘target’ parameter.
- Session ID: This helps to ensure the reproducibility of the code.
After running these lines of code, you get the output in the image below.
Here is a brief overview of the code output to help with understanding:
- Target Type: This indicates the type of classification being performed. In this case, we are classifying breast cancer as either malignant or benign, so it is a binary classification.
- Original Data: This shows the number of rows and columns in the dataset. The dataset in this example has 569 rows and 32 columns, including the target variable.
- Transformed Train and Transformed Test Set: By default, the original dataset is split into a 67:33 ratio. There are 30 columns instead of 31 because the ‘id’ column has been removed as it is not useful for prediction.
- Missing value: The output indicates ‘false’, which means there are no missing values in the dataset.
There is additional information that can be obtained by running this operation in a Jupyter or Google Colab notebook.
6. Select the best model: This is a very simple task that took only 20 seconds to complete on Google Colab. The code sorted and printed out over 15 models from highest to lowest using the "sort" parameter for the f1-score evaluation metric. You can try practising with other evaluation metrics such as accuracy and recall.
7. Build a classifier model: The ‘create_model’ function can be used to build any of the classifier models using any of the acronyms listed in the image above.
8. Hyperparameter Tuning: Hyperparameter tuning is an efficient ML workflow aimed at optimizing your ML model. “The ‘tune_model’ function is used to optimize the hyperparameters of a model through a process called hyperparameter tuning
9. Feature Importance: Feature importance is a technique that assigns a value or score to each feature to indicate how significant it is in predicting the target variable. By removing redundant features and utilizing the most important ones, your model can make more accurate predictions.
PyCaret makes it easy to visualize the feature importance of your model with a graph.
10. Model Evaluation: Using PyCaret, you can easily evaluate your model with just one line of code. In this example, two model evaluations were performed using the ROC-AUC curve and confusion matrix.
11. Prediction: You can easily make predictions to compare the test dataset and the train dataset, which will help you determine the validity of your evaluation metrics.
Remember to add a semicolon at the end of the code to produce the above output
12. Save your model: Saving your model allows it to be reused in various applications or websites, and it can also be a way to deploy your machine learning solution. PyCaret makes it easy to do this with just a few lines of code
After running this line of code, your trained model will be saved in a pickle (pkl) file.
Glad you made it this far
Conclusion
In conclusion, PyCaret is an efficient tool for low-code machine learning practitioners or enthusiasts. It can be applied not only to binary classification problems, but also to multi-classification, regression, clustering, data processing, and even natural language processing problems.
I encourage you to practice along with this article and share your feedback for support and improvement. You can find the link to the full Jupyter notebook here.
See you later. Bye for now! Don’t forget to give a clap, and make necessary comments for feedback.