This project provides python scripts for modeling trading data using the Negative Binomial Distribution (NBD). Specifically, it performs n-gram analysis on trading data to measure significant n-gram patterns. The NBD parameters are extracted, analyzed, and significant n-grams are filtered to understand patterns in price data, with the end goal of calculating the mean occurrence of these patterns for each asset. The aim of the program is to automatically extract and generate interesting features that are based on price dynamics within trading sessions during designated period (called 'day batches', size of batch is set in EPOCHS var).
Программа для автоматического анализа данных торговых сессий с использованием отрицательного биномиального распределения (NBD): поиск значимых категориальных рядов (н-грамм) заданного размера на основе анализа торгов.
- Project Overview
- Features
- Folder Structure
- Setup
- Usage
- Example Data
- Function Descriptions
- Configuration
- License
The code:
- Categorizes price data into n-grams: N-grams of the specified size are generated from the price data.
- Applies the Negative Binomial Distribution: Each n-gram is tested for goodness-of-fit to the NBD, and parameters are extracted.
- Filters significant n-grams: Only n-grams with specific NBD parameters are retained based on a configurable significance filter.
- Calculates mean occurrences: For each asset (one input file per asset), the code computes the mean occurrence of significant n-grams, returning these as outputs.
- Dynamic Data Loading: Reads all CSV files in the specified folder.
- Flexible Epoch Batching: Allows configurable batching for temporal analysis.
- N-Gram Analysis: Customizable n-gram size to capture various data aggregation levels.
- Negative Binomial Distribution Fitting: Fits n-grams to the NBD and extracts parameters for statistical analysis.
- Significance Filtering: Configurable filtering to retain only significant n-grams based on a Euclidean distance threshold.
- Result Output: Generates CSVs summarizing mean occurrences of significant n-grams features for each asset as a time series
- nbd_results.csv: Mean ngram occurrence features for for all assets together in one file (so that you can profile or cluster assets).
scr/nbd_model.py
: Module with theModel
class implementing NBD fitting logic./input_data
: Folder containing input CSV files, one per asset./results_nbd_mean
: Folder for storing results in CSV format.
- Python 3.x
- Pandas, NumPy
- Ensure the input data files are in the
results_prepared
directory.
Install the required packages:
pip install numpy pandas
In the main script, set paths for input and output folders, and configure:
- NGRAM_SIZE: Set the n-gram size for data aggregation (e.g., 1 for 1-grams, 2 for 2-grams).
- KLJUV_VALUE: NBD params Euclidean distance threshold to filter aggregated n-grams.
- EPOCHS: Number of days for batching data (you can start from 1 day)
Run the script as follows:
python main.py
Each input CSV file should contain the following columns:
future_timestamp | symbol | price | 30_day_hourly_future_return |
---|---|---|---|
2021-01-01 00:00 | ABC | 100.5 | 0.002 |
- future_timestamp: Timestamp for the data point.
- symbol: Asset identifier.
- price: Price at the given timestamp.
- 30_day_hourly_future_return: Future return (used as a target variable in NBD fitting).
You can modify headers in the respective part of the main program.
Reads and combines CSV files from a given folder into a single DataFrame, processing each file for the relevant columns.
Divides date ranges into epochs as defined in the configuration, batching data for analysis.
Processes time series data to estimate NBD parameters for each n-gram in each asset's data.
Filters the NBD parameters to retain only those meeting the Euclidean distance threshold.
Collects and aggregates results for each asset by computing the mean occurrences of significant n-grams.
Saves a summary of the results to a CSV file with each asset's n-gram pattern occurrences.
Saves detailed results with the mean values of each significant n-gram for each asset.
Modify constants at the beginning of the script to customize:
- Data paths: Adjust
INPUT_DATA_FOLDER
andRESULTS_FOLDER
. - Time frame: Define
START_DATE
andEND_DATE
. - Analysis settings: Configure
NGRAM_SIZE
,KLJUV_VALUE
, andEPOCHS
.
This project is licensed under the MIT License.
Модель отрицательно биномиального распределения в анализе категориальных последовательностей