Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FR] Estimations of ε [$100] #67

Open
Datseris opened this issue Jan 26, 2019 · 14 comments
Open

[FR] Estimations of ε [$100] #67

Datseris opened this issue Jan 26, 2019 · 14 comments
Labels
bounty good first issue An easy issue, suitable for new contributors. wanted feature Feature we would like to have!

Comments

@Datseris
Copy link
Member

Datseris commented Jan 26, 2019

Would be great if we had a function with various methods to estimate ε , the threshold of a recurrence plot.

This could work similarly with estimate_delay which gives delay times for timeseries.

Different methods are described in the book from N. Marwan and citations within, see chapter 1 (we cite the book in most docs). Some of these methods are actually very easy and straightforward and would be a great issue for newcomer contributor.


There is a $100 open bounty on this issue. Add to the bounty at Bountysource.

@Datseris Datseris added good first issue An easy issue, suitable for new contributors. wanted feature Feature we would like to have! labels Jan 26, 2019
@pucicu
Copy link
Contributor

pucicu commented Jan 26, 2019 via email

@Datseris
Copy link
Member Author

Awesome, please do so!

@Datseris Datseris changed the title [FR] Estimations of ε [FR] Estimations of ε [$100] Mar 16, 2019
@Datseris
Copy link
Member Author

This issue now has a 100$ bounty on it. (Which means anyone who solves it gets the 100$)

@Datseris
Copy link
Member Author

@yuxi-liu-wired
Copy link
Member

yuxi-liu-wired commented Feb 11, 2020

Note: The https://aip.scitation.org/doi/10.1063/1.5024914 paper is extremely wordy, and basically says that "counting the shortest N% of distances in the recurrence plot as recurring" is a good way to ensure some Recurrence Quantification Analysis less sensitive to embedding dimensions.

That is, keeping recurrence rate constant as embedding dimensions increase allows other RQA measures to be stabler.

@yuxi-liu-wired
Copy link
Member

I read chapter 1 of Webber and Marwan. It seems that this question is solved? Just pass kwarg to the integrator, and it's done. The only notable

Thus possibly this bounty is outdated.

@Datseris
Copy link
Member Author

@yuxiliu1995 can you please give more detail? What we are talking about here is having a method (a function) that you give in your dataset and then function returns an optimal value for ε, much like the functions estimate_delay which return an optimal value for e.g. τ.

@yuxi-liu-wired
Copy link
Member

yuxi-liu-wired commented Feb 12, 2020

I read Chapter 1 of Marwan, https://aip.scitation.org/doi/10.1063/1.5024914 , as well as many cited papers. It seems there is no good way to do this other than to try a few and see what works. It is just a collection of gut feelings, case studies, and appeal to numerical simulation.

Not only is it unmathematical, it is also simple. There are only a few methods for choosing ε:

  • ε = p * std, with 0.2 < p < 0.4

  • ε < 0.1 * mean.

  • ε = p * median with p ~ 0.2.

  • ε = p * diameter, with p < 0.1, or p ~ 0.05, or 0.2 < p < 0.4.

  • ε = p-quantile, with 0.01< p < 0.05.

  • ε > 5 * std of noise, in the case where the measurement device has a Gaussian noise distribution with a known std.

  • ε = argmin |1 − Np(ε)/Nn(ε)|, where

    • Np is the number of "significant peaks" in the histogram, where each bin counts the total number of black pixels in a diagonal line in the recurrence plot. ("significant" is not well-defined)
    • Nn is the average number of neighbours, that is, the average number of black pixels in each vertical column in the recurrence plot.
  • Draw RR vs ε on a loglog plot, and select ε in the middle of a linear region, since that's where a scaling law is working, and the dynamics is presumably found.


Relevant parts of Chapter 1 of Marwan:

The recurrence threshold is a crucial parameter in the RP analysis. Although several works have contributed to this discussion [9, 22, 24, 25], a general and systematic study on the recurrence threshold selection remains an open task for future work. Nevertheless, recurrence threshold selection is a trade-off of to have a threshold as small as possible but at the same time a sufficient number of recurrences and recurrence structures. In general, the optimal choice of σ depends on the application and the experimental conditions (e.g., noise), but in all cases it is desirable that the smallest threshold possible is chosen.

  • [9] is very similar to this chapter.
    • if there's observational noise, then, citing [22], ε > 5 * σ(dataset).
    • for (quasi-)periodic processes, citing [24], select ε = argmin |1 − Np(ε)/Nn(ε)|, where
      • Np is the number of "significant peaks" in the histogram, where each bin counts the total number of black pixels in a diagonal line in the recurrence plot, and "significant" means "more than average height of the histogram plus three times the standard deviation [of the height of the bars in the histogram? The authors did not specify]".
      • Nn is the average number of neighbours, that is, the average number of black pixels in each vertical column in the recurrence plot.
    • select ε so that the number of neighbours for every point of the trajectory is constant (fixed amount of nearest neighbours). This however breaks the symmetry of recurrence plot, and creates a generalized recurrence plot instead. Section 3.2.4 lists more generalized recurrence plots which can also be read here.
  • [25] finds that setting ε = (0.2~0.4) * diameter(data) maximizes discrimination between chaotic signal and noise using recurrence plot. The authors also says in the conclusion that ε =0.05 * diameter(data) works well.

A “rule of thumb” for the choice of the threshold σ is to select it as a few per cent (not larger than 10%) of the maximum phase space diameter [26–28]. For classification purpose and signal detection, a better choice is to select σ between 20 and 40% of the signal’s standard deviation [25]. However, the influence of noise can necessitate a larger threshold, because noise would distort any existing structure in the RP. Higher threshold may preserve these structures [9]. Suggestions from literature show that this threshold should be a few per cent of the maximum phase space diameter [26] and should not exceed 10% of the mean or the maximum phase space diameter [27,28].

  • ε = p * diameter(dataset), with p < 0.1
  • ε = p * σ(dataset), with 0.2 < p < 0.4
  • ε < 0.1 * mean distance of dataset.
  • [26] simply stated that "The threshold e is fixed, typically at a few percent of the diameter of the attractor". It says "attractor" because this paper only concerns with reconstructing attractors in series that are generated near an attractor.
  • [27] contains no useful information on selecting ε.
  • [28] is an early (1992) paper that introduced RQA. It simply tested a few RQA on Lorenz system and a few others. It only said "we fix r values as small as possible (typically no greater than 10% of the normalized mean distance of the first embedding) relative to the noise level". Here "r value" is ε.

Using the recurrence point density of the RP, the threshold can be chosen from the analysis of this measure in respect to a changing threshold [7]. The threshold can then be found by looking for a scaling region in the recurrence point density. However, this may not work for nonstationary data. For this case Zbilut et al. [7] have suggested to choose σ so that the recurrence point density is approximately 1%. For noisy periodic processes, [24] have suggested to use the diagonal structures within the RP in order to determine an optimal threshold. Their criterion minimizes the fragmentation and thickness of the diagonal lines in respect to the threshold.

  • [7]
    • calculate RR vs ε, plot on a loglog plot, and select ε in the middle of a linear region, since that's where a scaling law is working, and the dynamics is presumably found.
    • If the data are extremely nonstationary, a scaling region may not be found. Then select ε so that RR ~ 0.01.
    • Look at the recurrence plot: if it looks sparse, increase ε.
    • Windowed RQA is especially informative. If a given window fails to achieve RR = 1%, ε should be increased, or the window enlarged.
  • [24] see above.

Recent studies about RPs in our group have revealed a more exact criterion for choosing this threshold. This criterion takes into account that a measurement of a process is a composition of the real signal and some observational noise with standard deviation. In order to get similar results by using RPs, a threshold has to be chosen which is five times larger than the standard deviation of the observational noise [22]. This criterion holds for a wide class of processes.

Nothing new here.

@yuxi-liu-wired
Copy link
Member

yuxi-liu-wired commented Feb 12, 2020

The first five methods are easily implemented in RecurrenceMatrix by using scale keyword, and there's no need for a separate function that selects ε upon receiving a dataset.

  • ε = p * std, with 0.2 < p < 0.4
  • ε < 0.1 * mean.
  • ε = p * median with p ~ 0.2.
  • ε = p * maximum, with p < 0.1, or p ~ 0.05, or 0.2 < p < 0.4. Note that what is often called diameter in the papers is just maximum.
  • ε = p-quantile, with 0.01< p < 0.05.

Of these, maximum and mean are already implemented by using keyword scale. quantile is implemented but with a weird fixedrate argument, rather than using the scale keyword.

The std and median are easy to implement like mean. My current implementation is in matrices.jl (untested, since there's a bug in the RecurrenceMatrix) is

function _computedistances(x, y, metric::Metric)
    if x===y
        distlist = Vector{Real}(undef, length(x)*(length(x)-1)/2)
        index = 1
        @inbounds for i in 1:length(x)-1, j=(i+1):length(x)
            distlist[index] = evaluate(metric, x[i], y[j])
            index += 1
        end
    else
        distlist = Vector{Real}(undef, length(x)*length(y))
        index = 1
        @inbounds for xi in x, yj in y
            distlist[index] += evaluate(metric, xi, yj)
            index += 1
        end
    end
    return distlist
end

function _computescale(scale::typeof(var), x, y, metric::Metric)
    return Statistics.var(_computedistances(x, y, metric))
end

function _computescale(scale::typeof(median), x, y, metric::Metric)
    return Statistics.mean(_computedistances(x, y, metric))
end

  • ε > 5 * std of noise, in the case where the measurement device has a Gaussian noise distribution with a known std.

This one requires the user to know how their measurement device is. It cannot be done in general. However, suppose the user can assure the program that the noise in measurement device is significantly bigger than the evolution of the system over a short timescale, the program can then use a high-pass filter to get what's presumably pure noise from the measurement error, then calculate 5 * std of noise, and advice the user to select ε bigger than that.


  • ε = argmin |1 − Np(ε)/Nn(ε)|, where
    • Np is the number of "significant peaks" in the histogram, where each bin counts the total number of black pixels in a diagonal line in the recurrence plot.
    • Nn is the average number of neighbours, that is, the average number of black pixels in each vertical column in the recurrence plot.

This one can be implemented in a standalone function.

The "significant peaks" might be found by this peak finding algorithm.

Nn = RR * n, where n is the size of the dataset.


  • Draw RR vs ε on a loglog plot, and select ε in the middle of a linear region, since that's where a scaling law is working, and the dynamics is presumably found.

This might be a bit risky to automate, but if the user wants to use this method, it would be useful to have a function that draws the RR vs ε loglog plot, and let the user decide whether there is a scaling law region, and which ε in the region to use.

linear_regions can be used here

@yuxi-liu-wired
Copy link
Member

I just implemented the last two, but they are very slow and don't improve (in fact, they give worse epsilons).

I don't think they are worth it, but if you want, I'll put them into the pull commit too.

@Datseris
Copy link
Member Author

@pucicu are there any other methods we should consider here, besides what @yuxiliu1995 already wrote?

@yuxiliu1995 yes, include everything and put a comment for methods that you think are not good enough after you've tested them.

@pucicu
Copy link
Contributor

pucicu commented Feb 12, 2020 via email

@pucicu
Copy link
Contributor

pucicu commented Feb 12, 2020

I just implemented the last two, but they are very slow and don't improve (in fact, they give worse epsilons).

I don't think they are worth it, but if you want, I'll put them into the pull commit too.

The next-to-the-last one would be really helpful. Most of the others methods can be easily implemented by the user of the package, but this one would be more difficult. Therefore, if the package could provide this, it would be great.

@Datseris
Copy link
Member Author

Datseris commented Mar 3, 2020

Some work on this has started in #89 , anyone willing feel free to take over.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bounty good first issue An easy issue, suitable for new contributors. wanted feature Feature we would like to have!
Projects
None yet
Development

No branches or pull requests

3 participants