Python Power Tools for Data Science - Pycaret Anomaly Detection

In Python Power Tools for Data Science articles I look at python tools that help automate or simplify common tasks a Data Scientist would need to perform. In this article I look at the Pycaret Anomaly Detection module and see how this can help automate this process.
python-power-tools
pycaret
Author

Pranath Fernando

Published

January 2, 2022

1 Python Power Tools for Data Science

In this series of articles Python Power Tools for Data Science I will be looking at a series of python tools that can make a significant improvement on common Data Science tasks. In particular, Python Power Tools are python tools that can significantly automate or simplify common tasks a Data Scientist would need to perform.

Automation and simplifcation of common tasks can bring many benefits such as:

  • Less time needed to complete tasks
  • Reduction of mistakes due to less complex code
  • Improved readability and understanding of code
  • Increased consistancy of approach to different problems
  • Easier reproducability, verification, and comparison of results

2 Pycaret Anomaly Detection Module

Pycaret is a low code python library that aims to automate many tasks required for machine learning. Tasks that would usually take hundreds of lines of code can often be replaced with just a couple of lines. It was inspired by the Caret library in R.

In comparison with the other open-source machine learning libraries, PyCaret is an alternate low-code library that can be used to replace hundreds of lines of code with few words only. This makes experiments exponentially fast and efficient. PyCaret is essentially a Python wrapper around several machine learning libraries and frameworks such as scikit-learn, XGBoost, LightGBM, CatBoost, spaCy, Optuna, Hyperopt, Ray, and many more. (Pycaret Documentation)

Pycaret has different modules specialised for different machine learning use-cases these include:

  • Classification
  • Regression
  • Clustering
  • Anomaly Detection
  • Natural Language Processing
  • Assocation Rule Mining
  • Time Series

See further articles about these other Pycaret modules and what they can offer.

In this article we will use the Anomaly Detection Module of Pycaret which is an unsupervised machine learning module that is used for identifying rare items, events, or observations. It has over 13 algorithms and plots to analyze the results, plus many other features.

3 Dataset - New York Taxi Passengers

The NYC Taxi & Limousine Commission (TLC) has released public datasets that contain data for taxi trips in NYC, including timestamps, pickup & drop-off locations, number of passengers, type of payment, and fare amount.

We will specifically use the data that contains the number of taxi passengers from July 2014 to January 2015 at half-hourly intervals, so this is a time series dataset.


# Download tax passenger data
data = pd.read_csv('https://raw.githubusercontent.com/numenta/NAB/master/data/realKnownCause/nyc_taxi.csv')
data['timestamp'] = pd.to_datetime(data['timestamp'])
# Show first few rows
data.head()
timestamp value
0 2014-07-01 00:00:00 10844
1 2014-07-01 00:30:00 8127
2 2014-07-01 01:00:00 6210
3 2014-07-01 01:30:00 4656
4 2014-07-01 02:00:00 3820

# Show last few rows
data.tail()
timestamp value
10315 2015-01-31 21:30:00 24670
10316 2015-01-31 22:00:00 25721
10317 2015-01-31 22:30:00 27309
10318 2015-01-31 23:00:00 26591
10319 2015-01-31 23:30:00 26288

# Plot dataset 
plt.figure(figsize=(20,10))
sns.lineplot(x = "timestamp", y = "value", data=data)
plt.title('Number of NYC Taxi passengers by date July 2014 - January 2015')
plt.show()

So we can’t directly use timestamp data for anomaly detection models, we need to convert this data into other features such as day, year, hour etc before we can use it - so lets do this.


# Set timestamp to index
data.set_index('timestamp', drop=True, inplace=True)
# Resample timeseries to hourly 
data = data.resample('H').sum()
# Create more features from date
data['day'] = [i.day for i in data.index]
data['day_name'] = [i.day_name() for i in data.index]
data['day_of_year'] = [i.dayofyear for i in data.index]
data['week_of_year'] = [i.weekofyear for i in data.index]
data['hour'] = [i.hour for i in data.index]
data['is_weekday'] = [i.isoweekday() for i in data.index]
data.head()
value day day_name day_of_year week_of_year hour is_weekday
timestamp
2014-07-01 00:00:00 18971 1 Tuesday 182 27 0 2
2014-07-01 01:00:00 10866 1 Tuesday 182 27 1 2
2014-07-01 02:00:00 6693 1 Tuesday 182 27 2 2
2014-07-01 03:00:00 4433 1 Tuesday 182 27 3 2
2014-07-01 04:00:00 4379 1 Tuesday 182 27 4 2

4 Pycaret workflow

4.1 Setup

The Pycaret setup() is the first part of the workflow that always needs to be performed, and is a function that takes our data in the form of a pandas dataframe and performs a number of tasks to get reading for the machine learning pipeline.


# Setup
from pycaret.anomaly import *
s = setup(data, session_id = 123)
Description Value
0 session_id 123
1 Original Data (5160, 7)
2 Missing Values False
3 Numeric Features 5
4 Categorical Features 2
5 Ordinal Features False
6 High Cardinality Features False
7 High Cardinality Method None
8 Transformed Data (5160, 19)
9 CPU Jobs -1
10 Use GPU False
11 Log Experiment False
12 Experiment Name anomaly-default-name
13 USI 5a80
14 Imputation Type simple
15 Iterative Imputation Iteration None
16 Numeric Imputer mean
17 Iterative Imputation Numeric Model None
18 Categorical Imputer mode
19 Iterative Imputation Categorical Model None
20 Unknown Categoricals Handling least_frequent
21 Normalize False
22 Normalize Method None
23 Transformation False
24 Transformation Method None
25 PCA False
26 PCA Method None
27 PCA Components None
28 Ignore Low Variance False
29 Combine Rare Levels False
30 Rare Level Threshold None
31 Numeric Binning False
32 Remove Outliers False
33 Outliers Threshold None
34 Remove Multicollinearity False
35 Multicollinearity Threshold None
36 Remove Perfect Collinearity False
37 Clustering False
38 Clustering Iteration None
39 Polynomial Features False
40 Polynomial Degree None
41 Trignometry Features False
42 Polynomial Threshold None
43 Group Features False
44 Feature Selection False
45 Feature Selection Method classic
46 Features Selection Threshold None
47 Feature Interaction False
48 Feature Ratio False
49 Interaction Threshold None

Calling the setup() function with one line of code does the following in the background:

  • Data types will be inferred for each column
  • A table of key information about the dataset and configuration settings is generated
  • Based on the types inferred and configuration chosen, the dataset will be transformed to be ready for the machine learning algorithms

Various configuration settings are available, but defaults are selected so none are required.

Some key configuration settings available include:

  • Missing numeric values are imputed (default: mean) iterative option uses lightgbm model to estimate values
  • Missing categorical values are imputed (default: constant dummy value, alteratives include mode and iterative)
  • Encode categorical values as ordinal e.g. ‘low’, ‘medium’, ‘high’
  • High cardinality (default: false) options to compress to fewer levels or replace with frequency or k-means clustering derived class.
  • Define date fields explictly
  • Normalise numeric fields (default: false) options include zscore, minmax, maxabs, robust
  • Power transforms (default: false) will transform to make data more gaussian options include yeo-johnson, quantile
  • PCA: Principal components analysis (default: false) reduce the dimensionality of the data down to a specified number of components

4.2 Selecting and training a model

At time of writing this article, there are 12 different anomaly detection models available within Pycaret, which we can display with the models() function.


# Check list of available models
models()
Name Reference
ID
abod Angle-base Outlier Detection pyod.models.abod.ABOD
cluster Clustering-Based Local Outlier pyod.models.cblof.CBLOF
cof Connectivity-Based Local Outlier pyod.models.cof.COF
iforest Isolation Forest pyod.models.iforest.IForest
histogram Histogram-based Outlier Detection pyod.models.hbos.HBOS
knn K-Nearest Neighbors Detector pyod.models.knn.KNN
lof Local Outlier Factor pyod.models.lof.LOF
svm One-class SVM detector pyod.models.ocsvm.OCSVM
pca Principal Component Analysis pyod.models.pca.PCA
mcd Minimum Covariance Determinant pyod.models.mcd.MCD
sod Subspace Outlier Detection pyod.models.sod.SOD
sos Stochastic Outlier Selection pyod.models.sos.SOS

We will choose to use the Isolation Forrest model. Isolation Forrest is similar to Random Forrest in that it’s an algorithm based on multiple descison trees, however rather than aiming to model normal data points - Isolation Forrest explictly tries to identify anomalous data points.

There are many configuration hyperparameters for this model, which can be seen when we create and print the model details as we see below.


# Create model and print configuration hyper-parameters
iforest = create_model('iforest')
print(iforest)
IForest(behaviour='new', bootstrap=False, contamination=0.05,
    max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=-1,
    random_state=123, verbose=0)

One of the key configuration options is contamination which is the proportion of outliers we are saying is in the data set. This is used when fitting the model to define the threshold on the scores of the samples. This is set by default to be 5% i.e. 0.05.

We will now train and assign the model to the dataset.


# Train and assign model to dataset
iforest_results = assign_model(iforest)
iforest_results.head()
value day day_name day_of_year week_of_year hour is_weekday Anomaly Anomaly_Score
timestamp
2014-07-01 00:00:00 18971 1 Tuesday 182 27 0 2 0 -0.015450
2014-07-01 01:00:00 10866 1 Tuesday 182 27 1 2 0 -0.006367
2014-07-01 02:00:00 6693 1 Tuesday 182 27 2 2 0 -0.010988
2014-07-01 03:00:00 4433 1 Tuesday 182 27 3 2 0 -0.017091
2014-07-01 04:00:00 4379 1 Tuesday 182 27 4 2 0 -0.017006

This adds 2 new columns to the dataset, an Anomaly column which gives a binary value if a datapoint is considered an anomaly or not, and a Anomaly_Score column which has a float value as a measure of how anomalous a datapoint is.

4.3 Model Evaluation

So lets now evaluate our model by examining the datapoints the model has labelled as anomalies.


# Show dates for first few anomalies
iforest_results[iforest_results['Anomaly'] == 1].head()
value day day_name day_of_year week_of_year hour is_weekday Anomaly Anomaly_Score
timestamp
2014-07-13 50825 13 Sunday 194 28 0 7 1 0.002663
2014-07-27 50407 27 Sunday 208 30 0 7 1 0.009264
2014-08-03 48081 3 Sunday 215 31 0 7 1 0.003045
2014-09-28 53589 28 Sunday 271 39 0 7 1 0.004440
2014-10-05 48472 5 Sunday 278 40 0 7 1 0.000325

# Plot data with anomalies highlighted in red
fig, ax = plt.subplots(figsize=(20,10))

# Create list of outlier_dates
outliers = iforest_results[iforest_results['Anomaly'] == 1]

p1 = sns.scatterplot(data=outliers, x = outliers.index, y = "value", ax=ax, color='r')
p2 = sns.lineplot(x = iforest_results.index, y = "value", data=iforest_results, color='b', ax=ax)

plt.title('Number of NYC Taxi passengers by date July 2014 - January 2015: Anomalies highlighted')
plt.show()

So we can see the model has labelled a few isolated points as anomalies between 2014-7 and the end of 2014. However near the end of 2014 and the start of 2015, we can see a huge number of anomalies, in particular for all of January 2015.

Let’s focus in on the period from January 2015.


# Plot data with anomalies highlighted in red
fig, ax = plt.subplots(figsize=(20,10))

# Focus on dates after Jan 2015
focus = iforest_results[iforest_results.index > '2015-01-01']

# Create list of outlier_dates
outliers = focus[focus['Anomaly'] == 1]

p1 = sns.scatterplot(data=outliers, x = outliers.index, y = "value", ax=ax, color='r')
p2 = sns.lineplot(x = focus.index, y = "value", data=focus, color='b', ax=ax)

plt.title('Number of NYC Taxi passengers by date January - Feburary 2015: Anomalies highlighted')
plt.show()

So the model seems to be indicating that for all of Janurary 2015 we had a large number of highly unusual passenger number patterns. What might have been going on here?

Researching the date January 2015 in New York brings up many articles about the North American Blizzard of January 2015 :

The January 2015 North American blizzard was a powerful and severe blizzard that dumped up to 3 feet (910 mm) of snowfall in parts of New England. Originating from a disturbance just off the coast of the Northwestern United States on January 23, it initially produced a light swath of snow as it traveled southeastwards into the Midwest as an Alberta clipper on January 24–25. It gradually weakened as it moved eastwards towards the Atlantic Ocean, however, a new dominant low formed off the East Coast of the United States late on January 26, and rapidly deepened as it moved northeastwards towards southeastern New England, producing pronounced blizzard conditions.

Time lapsed satellite images from the period reveals the severe weather patterns that occured.

Some photos from the New York area at the time of the Blizzard.

So our model seems to have been able to detect very well this highly unusual pattern in taxi passenger behaviour caused by this Blizzard event.

5 Conclusion

In this article we have looked at the Pycaret Anomaly detection module as a potential Python Power Tool for Data Science.

With very little code, this module has helped us detect a well documented anomaly event even just using the default configuration.

Some key advantages of using this are:

  • Quick and easy to use with little code, default parameters can work well
  • The model library is kept up to date with the latest anomaly detection models, which can help make it easier to consider a range of different models quickly
  • Despite being simple and easy to use, the library has many configuration options, as well as extra funcationality such as data pre-processing, data visualisation tools, and the ability to load and save models together with the data pipleine easily

Certrainly from this example, we can see that the Pycaret Anomaly detection module seems a great candidate as a Python Power Tool for Data Science.

Subscribe