LivingDataLab - Python Power Tools for Data Science

1 Python Power Tools for Data Science

In this series of articles Python Power Tools for Data Science I will be looking at a series of python tools that can make a significant improvement on common Data Science tasks. In particular, Python Power Tools are python tools that can significantly automate or simplify common tasks a Data Scientist would need to perform.

Automation and simplifcation of common tasks can bring many benefits such as:

Less time needed to complete tasks
Reduction of mistakes due to less complex code
Improved readability and understanding of code
Increased consistancy of approach to different problems
Easier reproducability, verification, and comparison of results

2 Pycaret Anomaly Detection Module

Pycaret is a low code python library that aims to automate many tasks required for machine learning. Tasks that would usually take hundreds of lines of code can often be replaced with just a couple of lines. It was inspired by the Caret library in R.

In comparison with the other open-source machine learning libraries, PyCaret is an alternate low-code library that can be used to replace hundreds of lines of code with few words only. This makes experiments exponentially fast and efficient. PyCaret is essentially a Python wrapper around several machine learning libraries and frameworks such as scikit-learn, XGBoost, LightGBM, CatBoost, spaCy, Optuna, Hyperopt, Ray, and many more. (Pycaret Documentation)

Pycaret has different modules specialised for different machine learning use-cases these include:

Classification
Regression
Clustering
Anomaly Detection
Natural Language Processing
Assocation Rule Mining
Time Series

See further articles about these other Pycaret modules and what they can offer.

In this article we will use the Anomaly Detection Module of Pycaret which is an unsupervised machine learning module that is used for identifying rare items, events, or observations. It has over 13 algorithms and plots to analyze the results, plus many other features.

3 Dataset - New York Taxi Passengers

The NYC Taxi & Limousine Commission (TLC) has released public datasets that contain data for taxi trips in NYC, including timestamps, pickup & drop-off locations, number of passengers, type of payment, and fare amount.

We will specifically use the data that contains the number of taxi passengers from July 2014 to January 2015 at half-hourly intervals, so this is a time series dataset.


# Download tax passenger data
data = pd.read_csv('https://raw.githubusercontent.com/numenta/NAB/master/data/realKnownCause/nyc_taxi.csv')
data['timestamp'] = pd.to_datetime(data['timestamp'])
# Show first few rows
data.head()

	timestamp	value
0	2014-07-01 00:00:00	10844
1	2014-07-01 00:30:00	8127
2	2014-07-01 01:00:00	6210
3	2014-07-01 01:30:00	4656
4	2014-07-01 02:00:00	3820


# Show last few rows
data.tail()

	timestamp	value
10315	2015-01-31 21:30:00	24670
10316	2015-01-31 22:00:00	25721
10317	2015-01-31 22:30:00	27309
10318	2015-01-31 23:00:00	26591
10319	2015-01-31 23:30:00	26288


# Plot dataset 
plt.figure(figsize=(20,10))
sns.lineplot(x = "timestamp", y = "value", data=data)
plt.title('Number of NYC Taxi passengers by date July 2014 - January 2015')
plt.show()

So we can’t directly use timestamp data for anomaly detection models, we need to convert this data into other features such as day, year, hour etc before we can use it - so lets do this.


# Set timestamp to index
data.set_index('timestamp', drop=True, inplace=True)
# Resample timeseries to hourly 
data = data.resample('H').sum()
# Create more features from date
data['day'] = [i.day for i in data.index]
data['day_name'] = [i.day_name() for i in data.index]
data['day_of_year'] = [i.dayofyear for i in data.index]
data['week_of_year'] = [i.weekofyear for i in data.index]
data['hour'] = [i.hour for i in data.index]
data['is_weekday'] = [i.isoweekday() for i in data.index]
data.head()

	value	day	day_name	day_of_year	week_of_year	hour	is_weekday
timestamp
2014-07-01 00:00:00	18971	1	Tuesday	182	27	0	2
2014-07-01 01:00:00	10866	1	Tuesday	182	27	1	2
2014-07-01 02:00:00	6693	1	Tuesday	182	27	2	2
2014-07-01 03:00:00	4433	1	Tuesday	182	27	3	2
2014-07-01 04:00:00	4379	1	Tuesday	182	27	4	2

4 Pycaret workflow

4.1 Setup

The Pycaret setup() is the first part of the workflow that always needs to be performed, and is a function that takes our data in the form of a pandas dataframe and performs a number of tasks to get reading for the machine learning pipeline.


# Setup
from pycaret.anomaly import *
s = setup(data, session_id = 123)

	Description	Value
0	session_id	123
1	Original Data	(5160, 7)
2	Missing Values	False
3	Numeric Features	5
4	Categorical Features	2
5	Ordinal Features	False
6	High Cardinality Features	False
7	High Cardinality Method	None
8	Transformed Data	(5160, 19)
9	CPU Jobs	-1
10	Use GPU	False
11	Log Experiment	False
12	Experiment Name	anomaly-default-name
13	USI	5a80
14	Imputation Type	simple
15	Iterative Imputation Iteration	None
16	Numeric Imputer	mean
17	Iterative Imputation Numeric Model	None
18	Categorical Imputer	mode
19	Iterative Imputation Categorical Model	None
20	Unknown Categoricals Handling	least_frequent
21	Normalize	False
22	Normalize Method	None
23	Transformation	False
24	Transformation Method	None
25	PCA	False
26	PCA Method	None
27	PCA Components	None
28	Ignore Low Variance	False
29	Combine Rare Levels	False
30	Rare Level Threshold	None
31	Numeric Binning	False
32	Remove Outliers	False
33	Outliers Threshold	None
34	Remove Multicollinearity	False
35	Multicollinearity Threshold	None
36	Remove Perfect Collinearity	False
37	Clustering	False
38	Clustering Iteration	None
39	Polynomial Features	False
40	Polynomial Degree	None
41	Trignometry Features	False
42	Polynomial Threshold	None
43	Group Features	False
44	Feature Selection	False
45	Feature Selection Method	classic
46	Features Selection Threshold	None
47	Feature Interaction	False
48	Feature Ratio	False
49	Interaction Threshold	None

Calling the setup() function with one line of code does the following in the background:

Data types will be inferred for each column
A table of key information about the dataset and configuration settings is generated
Based on the types inferred and configuration chosen, the dataset will be transformed to be ready for the machine learning algorithms

Various configuration settings are available, but defaults are selected so none are required.

Some key configuration settings available include:

Missing numeric values are imputed (default: mean) iterative option uses lightgbm model to estimate values
Missing categorical values are imputed (default: constant dummy value, alteratives include mode and iterative)
Encode categorical values as ordinal e.g. ‘low’, ‘medium’, ‘high’
High cardinality (default: false) options to compress to fewer levels or replace with frequency or k-means clustering derived class.
Define date fields explictly
Normalise numeric fields (default: false) options include zscore, minmax, maxabs, robust
Power transforms (default: false) will transform to make data more gaussian options include yeo-johnson, quantile
PCA: Principal components analysis (default: false) reduce the dimensionality of the data down to a specified number of components

4.2 Selecting and training a model

At time of writing this article, there are 12 different anomaly detection models available within Pycaret, which we can display with the models() function.


# Check list of available models
models()

	Name	Reference
ID
abod	Angle-base Outlier Detection	pyod.models.abod.ABOD
cluster	Clustering-Based Local Outlier	pyod.models.cblof.CBLOF
cof	Connectivity-Based Local Outlier	pyod.models.cof.COF
iforest	Isolation Forest	pyod.models.iforest.IForest
histogram	Histogram-based Outlier Detection	pyod.models.hbos.HBOS
knn	K-Nearest Neighbors Detector	pyod.models.knn.KNN
lof	Local Outlier Factor	pyod.models.lof.LOF
svm	One-class SVM detector	pyod.models.ocsvm.OCSVM
pca	Principal Component Analysis	pyod.models.pca.PCA
mcd	Minimum Covariance Determinant	pyod.models.mcd.MCD
sod	Subspace Outlier Detection	pyod.models.sod.SOD
sos	Stochastic Outlier Selection	pyod.models.sos.SOS

We will choose to use the Isolation Forrest model. Isolation Forrest is similar to Random Forrest in that it’s an algorithm based on multiple descison trees, however rather than aiming to model normal data points - Isolation Forrest explictly tries to identify anomalous data points.

There are many configuration hyperparameters for this model, which can be seen when we create and print the model details as we see below.


# Create model and print configuration hyper-parameters
iforest = create_model('iforest')
print(iforest)

IForest(behaviour='new', bootstrap=False, contamination=0.05,
    max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=-1,
    random_state=123, verbose=0)

One of the key configuration options is contamination which is the proportion of outliers we are saying is in the data set. This is used when fitting the model to define the threshold on the scores of the samples. This is set by default to be 5% i.e. 0.05.

We will now train and assign the model to the dataset.


# Train and assign model to dataset
iforest_results = assign_model(iforest)
iforest_results.head()

	value	day	day_name	day_of_year	week_of_year	hour	is_weekday	Anomaly	Anomaly_Score
timestamp
2014-07-01 00:00:00	18971	1	Tuesday	182	27	0	2	0	-0.015450
2014-07-01 01:00:00	10866	1	Tuesday	182	27	1	2	0	-0.006367
2014-07-01 02:00:00	6693	1	Tuesday	182	27	2	2	0	-0.010988
2014-07-01 03:00:00	4433	1	Tuesday	182	27	3	2	0	-0.017091
2014-07-01 04:00:00	4379	1	Tuesday	182	27	4	2	0	-0.017006

This adds 2 new columns to the dataset, an Anomaly column which gives a binary value if a datapoint is considered an anomaly or not, and a Anomaly_Score column which has a float value as a measure of how anomalous a datapoint is.

4.3 Model Evaluation

So lets now evaluate our model by examining the datapoints the model has labelled as anomalies.


# Show dates for first few anomalies
iforest_results[iforest_results['Anomaly'] == 1].head()

	value	day	day_name	day_of_year	week_of_year	hour	is_weekday	Anomaly	Anomaly_Score
timestamp
2014-07-13	50825	13	Sunday	194	28	0	7	1	0.002663
2014-07-27	50407	27	Sunday	208	30	0	7	1	0.009264
2014-08-03	48081	3	Sunday	215	31	0	7	1	0.003045
2014-09-28	53589	28	Sunday	271	39	0	7	1	0.004440
2014-10-05	48472	5	Sunday	278	40	0	7	1	0.000325


# Plot data with anomalies highlighted in red
fig, ax = plt.subplots(figsize=(20,10))

# Create list of outlier_dates
outliers = iforest_results[iforest_results['Anomaly'] == 1]

p1 = sns.scatterplot(data=outliers, x = outliers.index, y = "value", ax=ax, color='r')
p2 = sns.lineplot(x = iforest_results.index, y = "value", data=iforest_results, color='b', ax=ax)

plt.title('Number of NYC Taxi passengers by date July 2014 - January 2015: Anomalies highlighted')
plt.show()

So we can see the model has labelled a few isolated points as anomalies between 2014-7 and the end of 2014. However near the end of 2014 and the start of 2015, we can see a huge number of anomalies, in particular for all of January 2015.

Let’s focus in on the period from January 2015.


# Plot data with anomalies highlighted in red
fig, ax = plt.subplots(figsize=(20,10))

# Focus on dates after Jan 2015
focus = iforest_results[iforest_results.index > '2015-01-01']

# Create list of outlier_dates
outliers = focus[focus['Anomaly'] == 1]

p1 = sns.scatterplot(data=outliers, x = outliers.index, y = "value", ax=ax, color='r')
p2 = sns.lineplot(x = focus.index, y = "value", data=focus, color='b', ax=ax)

plt.title('Number of NYC Taxi passengers by date January - Feburary 2015: Anomalies highlighted')
plt.show()

So the model seems to be indicating that for all of Janurary 2015 we had a large number of highly unusual passenger number patterns. What might have been going on here?

Researching the date January 2015 in New York brings up many articles about the North American Blizzard of January 2015 :

The January 2015 North American blizzard was a powerful and severe blizzard that dumped up to 3 feet (910 mm) of snowfall in parts of New England. Originating from a disturbance just off the coast of the Northwestern United States on January 23, it initially produced a light swath of snow as it traveled southeastwards into the Midwest as an Alberta clipper on January 24–25. It gradually weakened as it moved eastwards towards the Atlantic Ocean, however, a new dominant low formed off the East Coast of the United States late on January 26, and rapidly deepened as it moved northeastwards towards southeastern New England, producing pronounced blizzard conditions.

Time lapsed satellite images from the period reveals the severe weather patterns that occured.

Some photos from the New York area at the time of the Blizzard.

So our model seems to have been able to detect very well this highly unusual pattern in taxi passenger behaviour caused by this Blizzard event.

5 Conclusion

In this article we have looked at the Pycaret Anomaly detection module as a potential Python Power Tool for Data Science.

With very little code, this module has helped us detect a well documented anomaly event even just using the default configuration.

Some key advantages of using this are:

Quick and easy to use with little code, default parameters can work well
The model library is kept up to date with the latest anomaly detection models, which can help make it easier to consider a range of different models quickly
Despite being simple and easy to use, the library has many configuration options, as well as extra funcationality such as data pre-processing, data visualisation tools, and the ability to load and save models together with the data pipleine easily

Certrainly from this example, we can see that the Pycaret Anomaly detection module seems a great candidate as a Python Power Tool for Data Science.

Python Power Tools for Data Science - Pycaret Anomaly Detection

Subscribe