CTR Anomaly Detection

Introduction

Click-through rate (CTR) is the ratio of users who click on a specific link to the number of total users who view a page, email, or advertisement. It is commonly used to measure the success of an online advertising campaign for a particular website as well as the effectiveness of email campaigns

Wikipedia

The main goal of the repository is to introduce a simple approach of CTR Anomaly Detection.

Depending on the available data, there are could be several approaches for anomaly detection. If data is labeled (each point in time has a label anomaly / not anomaly, then supervised learning approaches can be used. Thus, it is a classification task where logistic regression, random forest, SVM, boosting, RNN, etc. can be applied. Here you have to pay attention to:

Data imbalance because usually there are just a few anomalies (less than 5% of all available data) and
Cross validation through time.

In case labels are not provided, other unsupervised techniques should be applied. There are also several methods for anomaly detection, but the general approach is quite similar: predict a value and compare it with a realized value. And if the residual is more or less than some threshold, then it is an anomaly.

There are several methods to make a prediction: moving average (simple, weighted) taking into account the last few hours or days or weekdays, etc., ARIMA model, Prophet, seasonal Hybrid Extreme Studentized Deviate technique and many others.

Requirements

Python 3+
pandas 1.2.1+
numpy 1.19.2+
seaborn 0.11.1+
matplotlib 3.3.2+
scipy 1.3.1+

Structure

There are two Notebooks:

In the Exploratory Data Analysis notebook are given data description, variables preparation, and CTR calculation and visualization.
In the Time series anomaly detection notebook, we are going to use a moving average for prediction and 1.5 standard deviations as a threshold.

Data

The underlying data is derived from a former Kaggle competition, that was about predicting “Click-Through-Rates” (CTR). CTR is defined as clicks divided by impressions and it measures how often advertisements are clicked relative to how often they are shown. The data is available in Kaggle competition

CTR over time

The data set includes a lot of dimensional fields with (categorical) information about the environment (device, location, etc) but we are only concentrating on CTR and time series relevant fields, such as

click: 0/1 for non-click/click
hour: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC.

Results

The simple Moving Approach works quite well for anomaly detection. By varying the MA period in the plot_anmolies() function, you can try to find the optimal period value. But given the nature of the data, I recommend using a period as a multiple of 24. If you would like to find less and more robust anomalies, use period 48, 72, or more.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data		data
images		images
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
anomaly_detection.ipynb		anomaly_detection.ipynb
exploratory_data_analysis.ipynb		exploratory_data_analysis.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CTR Anomaly Detection

Introduction

Requirements

Structure

Data

CTR over time

Results

About

Languages

License

MingalievDinar/adverity

Folders and files

Latest commit

History

Repository files navigation

CTR Anomaly Detection

Introduction

Requirements

Structure

Data

CTR over time

Results

About

Topics

Resources

License

Stars

Watchers

Forks

Languages