Feature - CFD-Autoperiod (#16)

* Create CFDAutoperiod as a copy of Autoperiod and implement period hints density clustering * Add tests to CFDAutoperiod * Implement most of the periodicity hint validation, simplify detrending options, and tweak periodogram calls * Finish the preliminary implementation of CFD-Autoperiod and update related tests * Update CFDAutoperiod lowpass filter application logic and add more docstrings * Update Autoperiod tests * Refactor power throshold computation logic to tools package * Update CFDAutoperiod tests * Update CFDAutoperiod tests yet again * Improve CFDAutoperiod performance significantly * Improve CFDAutoperiod performance even further! * Clean CFDAutoperiod a little further * Remove some tests for Autoperiod * Replace SciPy's linear regression with fitting a NumPy first degree Polynomial to improve performance * Tweak some docstrings * Improve Autoperiod performance even further! * Fix indexing in Autoperiod hint validation * Tweak ACF and FFT periodicity detector tests * Simplify detrending options in ACFPeriodicityDetector and FFTPeriodicityDetector * Clean tools code * Remove unused code * Address an possible edge case in hint clustering and make small tweaks in CFDAutoperiod * Add more tests * Update README * Improve tests efficiency through the use of fixtures to load the data
iskandergaba · Oct 11, 2024 · 879641c · 879641c
1 parent 23a8b21
commit 879641c
Show file tree

Hide file tree

Showing 14 changed files with 727 additions and 339 deletions.
diff --git a/README.md b/README.md
@@ -12,7 +12,8 @@
 ## About Pyriodicity
 Pyriodicity provides intuitive and easy-to-use Python implementation for periodicity (seasonality) detection in univariate time series. Pyriodicity supports the following detection methods:
 - [Autocorrelation Function (ACF)](https://otexts.com/fpp3/acf.html)
-- [Autoperiod]( https://doi.org/10.1137/1.9781611972757.40)
+- [Autoperiod](https://doi.org/10.1137/1.9781611972757.40)
+- [CFD-Autoperiod](https://doi.org/10.1007/978-3-030-39098-3_4)
 - [Fast Fourier Transform (FFT)](https://otexts.com/fpp3/useful-predictors.html#fourier-series)
 
 ## Installation
@@ -23,35 +24,28 @@ pip install pyriodicity
 ```
 
 ## Example
-Start by loading a the `co2` timeseries emissions sample data from [`statsmodels`](https://www.statsmodels.org)
+Start by loading a the `co2` emissions sample time series data from [`statsmodels`](https://www.statsmodels.org)
 ```python
-from statsmodels.datasets import co2
-data = co2.load().data
+>>> from statsmodels.datasets import co2
+>>> data = co2.load().data
 ```
 
 You can then resample the data to whatever frequency you want. In this example, we downsample the data to a monthly frequency
 ```python
-data = data.resample("ME").mean().ffill()
+>>> data = data.resample("ME").mean().ffill()
 ```
 
 Use `Autoperiod` to find the list of periods based in this data (if any).
 ```python
-from pyriodicity import Autoperiod
-autoperiod = Autoperiod(data)
-periods = autoperiod.fit()
+>>> from pyriodicity import Autoperiod
+>>> autoperiod = Autoperiod(data)
+>>> autoperiod.fit()
+array([12])
 ```
 
-There are multiple parameters you can play with should you wish to. For example, you can specify a lower percentile value for a more lenient detection
-```python
-autoperiod.fit(percentile=90)
-```
-
-Or increase the number of random data permutations for a better power threshold estimation
-```python
-autoperiod.fit(k=300)
-```
+The detected periodicity length is 12 which suggests a strong yearly seasonality given that the data has a monthly frequency.
 
-Alternatively, you can use other periodicity detection methods such as `ACFPeriodicityDetector` and `FFTPeriodicityDetector` and compare results and performances.
+You can also use `CFDAutoperiod` variant of `Autoperiod` or any other supported periodicity detection method such as `ACFPeriodicityDetector` and `FFTPeriodicityDetector` and compare results and performances.
 
 ## Development Environment Setup
 This project is built and published using [Poetry](https://python-poetry.org). To setup a development environment for this project you can follow these steps:
@@ -85,3 +79,4 @@ poetry export --with test --output requirements-dev.txt
 ## References
 - [1] Hyndman, R.J., & Athanasopoulos, G. (2021) Forecasting: principles and practice, 3rd edition, OTexts: Melbourne, Australia. [OTexts.com/fpp3](https://otexts.com/fpp3). Accessed on 09-15-2024.
 - [2] Vlachos, M., Yu, P., & Castelli, V. (2005). On periodicity detection and Structural Periodic similarity. Proceedings of the 2005 SIAM International Conference on Data Mining. [doi.org/10.1137/1.9781611972757.40](https://doi.org/10.1137/1.9781611972757.40).
+- [3] Puech, T., Boussard, M., D'Amato, A., & Millerand, G. (2020). A fully automated periodicity detection in time series. In Advanced Analytics and Learning on Temporal Data: 4th ECML PKDD Workshop, AALTD 2019, Würzburg, Germany, September 20, 2019, Revised Selected Papers 4 (pp. 43-54). Springer International Publishing. [doi.org/10.1007/978-3-030-39098-3_4](https://doi.org/10.1007/978-3-030-39098-3_4).
diff --git a/pyriodicity/__init__.py b/pyriodicity/__init__.py
@@ -1,7 +1,13 @@
 from .detectors import (
     ACFPeriodicityDetector,
     Autoperiod,
+    CFDAutoperiod,
     FFTPeriodicityDetector,
 )
 
-__all__ = ["ACFPeriodicityDetector", "Autoperiod", "FFTPeriodicityDetector"]
+__all__ = [
+    "ACFPeriodicityDetector",
+    "Autoperiod",
+    "CFDAutoperiod",
+    "FFTPeriodicityDetector",
+]
diff --git a/pyriodicity/detectors/__init__.py b/pyriodicity/detectors/__init__.py
@@ -1,9 +1,11 @@
 from .acf import ACFPeriodicityDetector
 from .autoperiod import Autoperiod
+from .cfd_autoperiod import CFDAutoperiod
 from .fft import FFTPeriodicityDetector
 
 __all__ = [
     "ACFPeriodicityDetector",
     "Autoperiod",
+    "CFDAutoperiod",
     "FFTPeriodicityDetector",
 ]
diff --git a/pyriodicity/detectors/acf.py b/pyriodicity/detectors/acf.py
@@ -1,9 +1,9 @@
-from typing import Callable, Optional, Union
+from typing import Optional, Union
 
 from numpy.typing import ArrayLike, NDArray
-from scipy.signal import argrelmax
+from scipy.signal import argrelmax, detrend
 
-from pyriodicity.tools import acf, apply_window, detrend, to_1d_array
+from pyriodicity.tools import acf, apply_window, to_1d_array
 
 
 class ACFPeriodicityDetector:
@@ -55,7 +55,7 @@ def __init__(self, endog: ArrayLike):
     def fit(
         self,
         max_period_count: Optional[int] = None,
-        detrend_func: Optional[Union[str, Callable[[ArrayLike], NDArray]]] = "linear",
+        detrend_func: Optional[str] = "linear",
         window_func: Optional[Union[str, float, tuple]] = None,
         correlation_func: Optional[str] = "pearson",
     ) -> NDArray:
@@ -66,10 +66,9 @@ def fit(
         ----------
         max_period_count : int, optional, default = None
             Maximum number of periods to look for.
-        detrend_func : str, callable, default = None
-            The kind of detrending to be applied on the series. It can either be
-            'linear' or 'constant' if it the parameter is of 'str' type, or a
-            custom function that returns a detrended series.
+        detrend_func : str, default = 'linear'
+            The kind of detrending to be applied on the signal. It can either be
+            'linear' or 'constant'.
         window_func : float, str, tuple optional, default = None
             Window function to be applied to the time series. Check
             'window' parameter documentation for scipy.signal.get_window
@@ -99,13 +98,18 @@ def fit(
             List of detected periods.
         """
         # Detrend data
-        self.y = self.y if detrend_func is None else detrend(self.y, detrend_func)
+        self.y = self.y if detrend_func is None else detrend(self.y, type=detrend_func)
 
         # Apply window on data
         self.y = self.y if window_func is None else apply_window(self.y, window_func)
 
         # Compute the ACF
-        acf_arr = acf(self.y, len(self.y) // 2, correlation_func)
+        acf_arr = acf(
+            self.y,
+            lag_start=0,
+            lag_stop=len(self.y) // 2,
+            correlation_func=correlation_func,
+        )
 
         # Find the local argmax of the first half of the ACF array
         local_argmax = argrelmax(acf_arr)[0]

diff --git a/pyriodicity/detectors/autoperiod.py b/pyriodicity/detectors/autoperiod.py
@@ -1,11 +1,10 @@
-from typing import Callable, Optional, Union
+from typing import Optional, Union
 
 import numpy as np
 from numpy.typing import ArrayLike, NDArray
-from scipy.signal import argrelmax, periodogram
-from scipy.stats import linregress
+from scipy.signal import argrelmax, detrend, periodogram
 
-from pyriodicity.tools import acf, apply_window, detrend, to_1d_array
+from pyriodicity.tools import acf, apply_window, power_threshold, to_1d_array
 
 
 class Autoperiod:
@@ -22,9 +21,9 @@ class Autoperiod:
     References
     ----------
     .. [1] Vlachos, M., Yu, P., & Castelli, V. (2005).
-    On periodicity detection and Structural Periodic similarity.
-    Proceedings of the 2005 SIAM International Conference on Data Mining.
-    https://doi.org/10.1137/1.9781611972757.40
+       On periodicity detection and Structural Periodic similarity.
+       Proceedings of the 2005 SIAM International Conference on Data Mining.
+       https://doi.org/10.1137/1.9781611972757.40
 
     Examples
     --------
@@ -58,7 +57,7 @@ def fit(
         self,
         k: int = 100,
         percentile: int = 95,
-        detrend_func: Optional[Union[str, Callable[[ArrayLike], NDArray]]] = "linear",
+        detrend_func: Optional[str] = "linear",
         window_func: Optional[Union[str, float, tuple]] = None,
         correlation_func: Optional[str] = "pearson",
     ) -> NDArray:
@@ -73,10 +72,9 @@ def fit(
         percentile : int, optional, default = 95
             Percentage for the percentile parameter used in computing the power
             threshold. Value must be between 0 and 100 inclusive.
-        detrend_func : str, callable, default = None
-            The kind of detrending to be applied on the series. It can either be
-            'linear' or 'constant' if it the parameter is of 'str' type, or a
-            custom function that returns a detrended series.
+        detrend_func : str, default = 'linear'
+            The kind of detrending to be applied on the signal. It can either be
+            'linear' or 'constant'.
         window_func : float, str, tuple optional, default = None
             Window function to be applied to the time series. Check
             'window' parameter documentation for scipy.signal.get_window
@@ -105,101 +103,104 @@ def fit(
             List of detected periods.
         """
         # Detrend data
-        self.y = self.y if detrend_func is None else detrend(self.y, detrend_func)
+        self.y = self.y if detrend_func is None else detrend(self.y, type=detrend_func)
         # Apply window on data
         self.y = self.y if window_func is None else apply_window(self.y, window_func)
 
         # Compute the power threshold
-        p_threshold = self._power_threshold(self.y, k, percentile)
+        p_threshold = power_threshold(self.y, detrend_func, k, percentile)
 
         # Find period hints
-        freq, power = periodogram(self.y, window=None, detrend=None)
-        period_hints = np.array(
+        freq, power = periodogram(self.y, window=None, detrend=False)
+        hints = np.array(
             [
                 1 / f
                 for f, p in zip(freq, power)
                 if f >= 1 / len(freq) and p >= p_threshold
             ]
         )
 
-        # Compute the ACF
-        length = len(self.y)
-        acf_arr = acf(self.y, nlags=length, correlation_func=correlation_func)
-
         # Validate period hints
-        period_hints_valid = []
-        for p in period_hints:
-            q = length / p
-            start = np.floor((p + length / (q + 1)) / 2 - 1).astype(int)
-            end = np.ceil((p + length / (q - 1)) / 2 + 1).astype(int)
-
-            splits = [
-                self._split(np.arange(len(acf_arr)), acf_arr, start, end, i)
-                for i in range(start + 2, end)
-            ]
-            line1, line2, _ = splits[
-                np.array([error for _, _, error in splits]).argmin()
-            ]
-
-            if line1.slope > 0 > line2.slope:
-                period_hints_valid.append(p)
+        valid_hints = [
+            h for h in hints if self._is_hint_valid(self.y, h, correlation_func)
+        ]
 
-        period_hints_valid = np.array(period_hints_valid)
-
-        # Return the closest ACF peak for each valid period hint
-        local_argmax = argrelmax(acf_arr)[0]
+        # Return the closest ACF peak to each valid period hint
+        length = len(self.y)
+        hint_ranges = [
+            np.arange(
+                np.floor((h + length / (length / h + 1)) / 2 - 1),
+                np.ceil((h + length / (length / h - 1)) / 2 + 1),
+                dtype=int,
+            )
+            for h in valid_hints
+        ]
+        acf_arrays = [
+            acf(
+                self.y,
+                lag_start=r[0],
+                lag_stop=r[-1],
+                correlation_func=correlation_func,
+            )
+            for r in hint_ranges
+        ]
         return np.array(
             list(
                 {
-                    min(local_argmax, key=lambda x: abs(x - p))
-                    for p in period_hints_valid
+                    r[0] + min(argrelmax(arr)[0], key=lambda x: abs(x - h))
+                    for h, r, arr in zip(valid_hints, hint_ranges, acf_arrays)
                 }
             )
         )
 
     @staticmethod
-    def _power_threshold(y: ArrayLike, k: int, p: int) -> float:
+    def _is_hint_valid(
+        y: ArrayLike,
+        hint: float,
+        correlation_func: str,
+    ) -> bool:
         """
-        Compute the power threshold as the p-th percentile of the maximum
-        power values of the periodogram of k permutations of the data.
+        Validate the period hint.
 
         Parameters
         ----------
         y : array_like
             Data to be investigated. Must be squeezable to 1-d.
-        k : int
-            The number of times the data is randomly permuted to compute
-            the maximum power values.
-        p : int
-            The percentile value used to compute the power threshold.
-            It determines the cutoff point in the sorted list of the maximum
-            power values from the periodograms of the permuted data.
-            Value must be between 0 and 100 inclusive.
-
-        See Also
-        --------
-        scipy.signal.periodogram
-            Estimate power spectral density using a periodogram.
+        hint : float
+            The period hint to be validated.
+        correlation_func : str, default = 'pearson'
+            The correlation function to be used to calculate the ACF of the series
+            or the signal. Possible values are ['pearson', 'spearman', 'kendall'].
 
         Returns
         -------
-        float
-            Power threshold of the target data.
+        bool
+            Whether the period hint is valid.
         """
-        max_powers = []
-        while len(max_powers) < k:
-            _, power_p = periodogram(
-                np.random.permutation(y), window=None, detrend=None
-            )
-            max_powers.append(power_p.max())
-        max_powers.sort()
-        return np.percentile(max_powers, p)
+        length = len(y)
+        hint_range = np.arange(
+            np.floor((hint + length / (length / hint + 1)) / 2 - 1),
+            np.ceil((hint + length / (length / hint - 1)) / 2 + 1),
+            dtype=int,
+        )
+        acf_arr = acf(
+            y,
+            lag_start=hint_range[0],
+            lag_stop=hint_range[-1],
+            correlation_func=correlation_func,
+        )
+        splits = [
+            Autoperiod._split(hint_range, acf_arr, 0, len(hint_range), i)
+            for i in range(1, len(hint_range) - 1)
+        ]
+        line1, line2, _ = splits[np.array([error for _, _, error in splits]).argmin()]
+        return line1.coef[-1] > 0 > line2.coef[-1]
 
     @staticmethod
     def _split(x: ArrayLike, y: ArrayLike, start: int, end: int, split: int) -> tuple:
         """
         Approximate a function at [start, end] with two line segments at
-        [start, split - 1] and [split, end].
+        [start, split] and [split, end].
 
         Parameters
         ----------
@@ -221,22 +222,27 @@ def _split(x: ArrayLike, y: ArrayLike, start: int, end: int, split: int) -> tupl
 
         Returns
         -------
-        linregress
+        numpy.polynomial.Polynomial
             The first line segment.
-        linregress
+        numpy.polynomial.Polynomial
             The second line segment.
         float
-            The error of the approximation.
+            The approximation error.
         """
+        if not start < split < end:
+            raise ValueError(
+                "Invalid start, split, and end values ({}, {}, {})".format(
+                    start, split, end
+                )
+            )
         x1, y1, x2, y2 = (
-            x[start:split],
-            y[start:split],
-            x[split : end + 1],
-            y[split : end + 1],
-        )
-        line1 = linregress(x1, y1)
-        line2 = linregress(x2, y2)
-        error = np.sum(np.abs(y1 - (line1.intercept + line1.slope * x1))) + np.sum(
-            np.abs(y2 - (line2.intercept + line2.slope * x2))
+            x[start : split + 1],
+            y[start : split + 1],
+            x[split:end],
+            y[split:end],
         )
-        return line1, line2, error
+        line1, stats1 = np.polynomial.Polynomial.fit(x1, y1, deg=1, full=True)
+        line2, stats2 = np.polynomial.Polynomial.fit(x2, y2, deg=1, full=True)
+        resid1 = 0 if len(stats1[0]) == 0 else stats1[0][0]
+        resid2 = 0 if len(stats2[0]) == 0 else stats2[0][0]
+        return line1.convert(), line2.convert(), resid1 + resid2