Rethinking timeseries dimensionality #6090

ricardoV94 · 2022-09-01T08:36:10Z

ricardoV94
Sep 1, 2022
Maintainer

If a single univariate gaussian random walk (GRW) with 100 time steps has a shape of (100,), what is the shape of three such GRWs? (3, 100) or (100, 3)? PyMC (V4 at least) says (3, 100).

What about a 2D multivariate gaussian random walk (MvGRW)? I assume it would have a shape of (100, 2). And three of them? (100, 3, 2) or (3, 100, 2)? I think PyMC right now would say (3, 100, 2), but we haven't refactored it yet.

Let's abbreviate the current PyMC approach as BTS (batched dimensions, time dimension, support dimensions) and the alternative one as TBS (time dimension, batched dimensions, support dimensions).

Why does it matter? Due to broadcasting to the left BTS leads one to consider the time dimension as a support dimension, which means parameters cannot change over time. Let's see why.

The univariate case

Just to refresh, the simplest case is

# 1 timeseries of length 100, with constant drift
GRW(drift=0, shape=(100,))

We can create batches by doing

# 3 timeseries of length 100, each with constant drift
GRW(drift=[0, 1, 2], shape=(3, 100))

Note that [0, 1, 2] can't broadcast to (3, 100) naturally, but because we consider time a core dimension we never attempt to do that. Under the hood we add a degenerate dimension to the right so that drift has shape (3, 1)

pymc/pymc/distributions/timeseries.py

Lines 209 to 213 in 8f02bea

    
           # Add one dimension to the right, so that mu and sigma broadcast safely along 
        
           # the steps dimension 
        
           innovations = rng.normal(loc=mu[..., None], scale=sigma[..., None], size=dist_shape) 
        
           grw = np.concatenate([init_dist[..., None], innovations], axis=-1) 
        
           return np.cumsum(grw, axis=-1)

And we have to do that everywhere...

pymc/pymc/distributions/timeseries.py

Lines 299 to 301 in 8f02bea

    
           # Add one dimension to the right, so that mu broadcasts safely along the steps 
        
           # dimension 
        
           grw_moment = at.set_subtensor(grw_moment[..., 1:], mu[..., None])

pymc/pymc/distributions/timeseries.py

Lines 318 to 320 in 8f02bea

    
           # Add one dimension to the right, so that mu and sigma broadcast safely along 
        
           # the steps dimension 
        
           series_logp = logp(Normal.dist(mu[..., None], sigma[..., None]), stationary_series)

Why don't we let parameters change over time? If the following worked

# 1 timeseries of length 3, with varying drift
GRW(drift=[0, 1, 2], shape=(3,))

Then it would make batching with constant parameters cumbersome, or at least less intuitive. To obtain the same result as in the batching example above, the user would now need to manually add the degenerate dimension themselves

# 3 timeseries of length 100, each with constant drift
# drift must have shape (3, 1), so that it can broadcast to (3, 100)
GRW(drift=[[0], [1], [2]], shape=(3, 100))

What about TBS? Due to broadcasting automatically to the left, both cases would be relatively intuitive

# 3 timeseries of length 100, each with constant drift
GRW(drift=[0, 1, 2], shape=(100, 3))

# 1 timeseries of length 3 with varying drift
GRW(drift=[0, 1, 2], shape=(3,))

# 2 timeseries of length 3, each with varying drift
GRW(drift=[[0, 1], [0, -1], [0, 1], shape=(3, 2))

The multivariate case

Spoiler: it's the same as in the univariate case, you can skip this section if multivariates don't confuse you

The advantage of TBS over BTS is similar in the MvGRW case.

Again the simplest case (which, for the user, is the same in BTS and TBS) is

# 1 (2D)timeseries of length 100, with constant drift
# drift is MvNormal mu, and cov is implied to be unit diagonal
MvGRW(drift=[0, 0], shape=(100, 2))

Assuming we don't allow parameters to change over the time dimension, creating batches in BTS is relatively intuitive.

# 3 (2D)timeseries of length 100, each with constant drift
MvGRW(drift=[[0, 0], [1, 1], [2, 2]], shape=(3, 100, 2))

Note again, that under the hood we have to add a degenerate dimension at axis=-2, to be able to take (3, 100, 2) draws.

Instead, if we were to allow parameters to change over time like this

# 1 (2D)timeseries of length 3, with varying drift
MvGRW(drift=[[0, 0], [1, 1], [2, 2]], shape=(3, 2))

Users would need to manually add the degenerate dimension for batching with constant drift

# 3 (2D)timeseries of length 100, each with constant drift
# drift.shape == (3, 1, 2) 
MvGRW(drift=[[[0, 0]], [[1, 1]], [[2, 2]], shape=(3, 100, 2))

In contrast, in TBS the two cases would simply look like

# 3 (2D)timeseries of length 100, each with constant drift
MvGRW(drift=[[0, 0], [1, 1], [2, 2]], shape=(100, 3, 2)

# 1 (2D)timeseries of length 3, with variable drift
MvGRW(drift=[[0, 0], [1, 1], [2, 2]], shape=(3, 2))

# 4 (2D)timeseries of length 3, each with variable drift
drift = [
  [[0, 0], [0, 0], [0, 0], [0, 0]],
  [[1, 1], [-1, -1], [1, 1], [-1, -1]],
  [[2, 2], [-2, -2], [2, 2], [-2, -2]],
]
# drift.shape == (3, 4, 2)
MvGRW(drift=drift, shape=(3, 4, 2)

This applies to all timeseries

This distinction is also important for other timeseries. For instance, in AR we use Scan which naturally accumulates results to the left. In order to keep BTS, we have to shuffle and add degenerate dimensions all over the place.

pymc/pymc/distributions/timeseries.py

Lines 558 to 567 in 8f02bea

    
           # We transpose inputs as scan iterates over first dimension 
        
           innov_, innov_updates_ = aesara.scan( 
        
               fn=step, 
        
               outputs_info=[{"initial": init_.T, "taps": range(-ar_order, 0)}], 
        
               non_sequences=[rhos_bcast_.T[::-1], sigma_.T, noise_rng], 
        
               n_steps=steps_, 
        
               strict=True, 
        
           ) 
        
           (noise_next_rng,) = tuple(innov_updates_.values()) 
        
           ar_ = at.concatenate([init_, innov_.T], axis=-1)

pymc/pymc/distributions/timeseries.py

Line 641 in 8f02bea

return at.full_like(rv, moment(init_dist)[..., -1, None])

pymc/pymc/distributions/timeseries.py

Lines 622 to 632 in 8f02bea

    
               expectation = at.add( 
        
                   *( 
        
                       rhos[..., i, None] * value[..., ar_order - (i + 1) : -(i + 1)] 
        
                       for i in range(ar_order) 
        
                   ) 
        
               ) 
        
           # Compute and collapse logp across time dimension 
        
           innov_logp = at.sum( 
        
               logp(Normal.dist(0, sigma[..., None]), value[..., ar_order:] - expectation), axis=-1 
        
           ) 
        
           init_logp = logp(init_dist, value[..., :ar_order])

With TBS, we could again allow for parameters (rho, sigma) to change over time without making batching with constant parameters more cumbersome.

Logp considerations

A slightly more technical consequence of BTS, when we consider time a support dimension (again to avoid cumbersome batching with constant parameters), is that we must also collapse the logp across that dimension. Otherwise, other distributions that rely on this contract like Mixture would eventually fail.

But this means we never have access to the logp per time step, which may be useful for comparison of timeseries models.

This also means we cannot simply use the logp derived automatically by Aeppl for timeseries graphs. If we followed TBS, the following hack in #6072 would not be needed:

@_logprob.register(RandomWalkRV)
def random_walk_logp(op, values, *inputs, **kwargs):
    # Although Aeppl can derive the logprob of random walks, it does not collapse
    # what PyMC considers the core dimension of steps. We do it manually here.
    (value,) = values
    # Recreate RV and obtain inner graph
    rv_node = op.make_node(*inputs)
    rv = clone_replace(
        op.inner_outputs, replace={u: v for u, v in zip(op.inner_inputs, rv_node.inputs)}
    )[op.default_output]
    # Obtain logp via Aeppl of inner graph and collapse steps dimension
    return logp(rv, value).sum(axis=-1)

Steps parameter

Not so important, feel free to skip

The steps argument, on its own, is useless in TBS. We would not know if it is supposed to match the first dimension of the parameters or if it is supposed to batch them. If it always matched the first dimension, batching with constant parameters would become cumbersome in TBS, because users would to add a degenerate dimension again. If it never matched we would lose support for time-varying parameters. In order to safely interpret steps, we need to know the explict shape / size, in which case we no longer need steps since it's simply shape[0]/size[0].

Conclusion

TBS has some advantages but the downside that 3 GRWs of 100 steps would now have a shape of (100, 3), and 3 2D-MvGRW of 100 steps would have shape (100, 3, 2). @aseyboldt suggests it may also be less performant when computing the logp due to memory contiguity. @junpenglao mentions that scan based timeseries will always need inputs to be TBS internally anyway.

Otherwise we could keep using BTS but force users to always define the time dimension in the parameters (i.e. cumbersome/error-prone [citation needed] batching of constant parameters). That sounds reasonable as well but it will hurt users at first. @aseyboldt thinks this is intuitive. I have changed my preference for this approach.

Or just keep BTS without the possibility of defining time-varying parameters like we do now. Sounds unnecessarily restrictive.

What do you think?

This was brought up in #5741, #5972 and #6072 (comment)

ricardoV94 · 2022-09-01T09:07:56Z

ricardoV94
Sep 1, 2022
Maintainer Author

CC @pymc-devs/dev-core

0 replies

aseyboldt · 2022-09-01T15:26:49Z

aseyboldt
Sep 1, 2022
Maintainer

I don't think I really have a good overview of this yet, but just a couple of thoughts so far...

I'm not so sure it is mathematically valid not to treat time as a core dimension: What probability exactly would a single logp value refer to? Let's just assume

y = GRW(drift=1, shape=100)
logps = y.logp(np.random.randn(100))

What does logps[1] mean? With the straight forward implementation it would be P(y[1] | drift, y[0]), but notice that this expression contains y[0]! In general for a time series the individual values are not conditionally independent (conditional with respect to the parameters). If you asked me to give a mathematical definition for a core dimension, I think this property is what I'd go for. I haven't checked, but I think most model comparison methods will just assume that the logps they get are conditionally independent.

For what it is worth: I'd never have expected this would work: GRW(drift=[0, 1, 2], shape=(3, 100)), but that I'd have to do GRW(drift=np.array([0, 1, 2])[:, None], shape=(3, 100)), just as with for example the MvNormal.
For high performance implementations we'd want to think a bit about memory locality: Let's say we have a GRW with n_time = 100 and n_batch = 3. In step batch=i, time=j we need observed values for batch=i, time=j-1 and time=j, so we'd want y_ij and y_i(j-1) to be close to each other in memory. If we assume C ordered arrays, this would also point to storing things in shape (3, 100).

7 replies

aseyboldt Sep 1, 2022
Maintainer

For the elemwise question, we always ask for an init_dist in our time-series whose logprob is computed as part of the series logp, so I think that's fine?

I don't think this has anything to do with init_dist, the same thing also happens with all other elements, eg P(y[2] | drift, y[1]). What I'm saying is that I think "elemwise logp" in the sense of one logp value per time point doesn't usually make much sense.

Maybe it helps to go back a step: If we have any RV y, its exp(logp) always refers to P(y | params). But I don't think it is entirely obvious what "elementwise logps" actually mean. They should definitely sum to P(y | params), but this requirement alone isn't meaningful, or we could just return things like np.zeros_like(y) + total_logp/len(y). I think where elementwise logps start making sense is if we can subdivide y into separate things y_i and if those are independent given the parameters. This is the case for a pm.Normal: for fixed parameters observing one value tells us nothing about the other values. It isn't true for pm.MvNormal, because if the two variables are highly correlated learning one value tells us a lot about the other one. And it also isn't true for GRW, because if we know the value at one position it tells us something about all later values.

The conditional independence property is also what we need for loo model comparison to be well defined (first sentence in section 2): https://arxiv.org/pdf/1507.04544.pdf

aseyboldt Sep 1, 2022
Maintainer

If you treat time as a core dimension, and you also say that the rightmost dimension of drift is part of it, you should not be able to broadcast it (according to NumPy gufunc rules). The same way that MvNormal([1], eye(3)) is technically incorrect (numpy fails IIRC), although we "cheat" in PyMC for user convenience. The gufunc abstraction is not perfect.

So you'd have to broadcast (or stack) it manually, because gufunc doesn't broadcast a dimension that gets reduced?
I never realized that....

aseyboldt Sep 1, 2022
Maintainer

Seems to be the case at least for the numba impl:

import numpy as np
import numba
from numba.types import float64

# For some reason we seem to have to declare the third argument as array
# or we just get a scalar that we can't write to.
@numba.guvectorize([(float64[:], float64[:], float64[:])], '(n),(n)->()')
def g(x, y, res):
    res[0] = (x * y).sum()

g(np.ones(5), np.ones(5))  # works
g(np.ones(5), np.array(1.)[None])  # Fails

ricardoV94 Sep 1, 2022
Maintainer Author

Actually it seems that numpy now wants to allow for core dimension broadcasting as well: https://numpy.org/neps/nep-0020-gufunc-signature-enhancement.html

Dimensions that can be broadcast. For some applications, broadcasting between operands makes sense. For instance, an all_equal function that compares vectors in arrays could have a signature (n),(n)->(), but this forces both operands to be arrays, while it would be useful also to check that, e.g., all parts of a vector are constant (maybe zero). The proposal is to allow the implementer of a gufunc to indicate that a dimension can be broadcast by post-fixing the dimension name with |1. Hence, the signature for all_equal would become (n|1),(n|1)->(). The signature seems handy more generally for “chained ufuncs”; e.g., another application might be in a putative ufunc implementing sumproduct.

ricardoV94 Sep 2, 2022
Maintainer Author

So you'd have to broadcast (or stack) it manually, because gufunc doesn't broadcast a dimension that gets reduced?

By the way, it's not just dimensions that get reduced. It should also fail with (n), (n) -> (n) or whatever output. But we can manually broadcast for users on our side, we do it for MvNormal

michaelosthege · 2022-09-01T21:51:13Z

michaelosthege
Sep 1, 2022
Maintainer

(Drift) parameters aside, I have a very strong prior that we should stick to the BTS result shape interpretation.
Meaning that

3 GRW's of 100 steps in a univariate space result in (3, 100)
3 GRW's of 100 steps in a bivariate space result in (3, 100, 2).

This is in line with our univariate RVs by placing the support dimensions at the end.
For univariate random walks that's just the nsupp=1 (steps,), but for multivariates random walks it's the nsupp=2 (steps, ndim_rw).

Did I understand correctly that the reason for the inconvenient need to broadcast the parameters internally is the fact that NumPy treats (N,)-shaped variables as row vectors?

5 replies

ricardoV94 Sep 2, 2022
Maintainer Author

The issue with broadcasting comes just from the distinction between constant parameters per time vs varying ones. Basically only one of the following can be valid in BTS: GRW(drift=[0, 1, 2], shape=(3,100)) vs GRW(drift=[0, 1, 2], shape=(3,)). Whereas with TBS either is valid (shapes would be (3,) vs (100, 3)).

For BTS I assumed most users would expect GRW(drift=[0, 1, 2], shape=(3,100)) to work, but perhaps that is not true, and they would have naturally written GRW(drift=[[0], [1], [2]], shape=(3,100)) (or just do the newaxis at the end). If that's the case or it is easy enough to teach the we could keep BTS, but switch to assuming time varying parameters.

michaelosthege Sep 2, 2022
Maintainer

IMO we should then require users to broadcast instead of doing things "magically under the hood"

Instructive ShapeError messages ...

ricardoV94 Sep 2, 2022
Maintainer Author

We are not magically doing it, we are saying that a single timeseries is defined by a single drift and n steps. The way we compute internally involves broadcasting but that's an implementation detail not a "hack".

Where we do a hack currently is for mvnormal where we manually broadcast mu because core dimensions are not broadcastable in Numpy.

ricardoV94 Sep 2, 2022
Maintainer Author

In other words we are currently doing the following valid thing:

def core_grw(drift, steps):
  return np.random.normal(drift, size=(steps,))

# steps is actually a non-vectorized argument 
# but I don't remember the API for that
grw = np.vectorize(core_grw, signature="(), () -> (t)")
grw([0, 1, 2], 100).shape == (3, 100)

The time varying drift is also valid but looks like this:

def core_grw(drift, steps):
  return np.random.normal(drift, size=(steps,))

grw = np.vectorize(core_grw, signature="(t), () -> (t)")
grw([0, 1, 2], 3).shape == (3,)

In that case it's technically invalid for drift to be broadcasted:

# Should raise, core drift cannot be broadcasted to steps
grw([0],  100)

It must have the same shape as the output, but we can do it for users because it's rather unintuitive.

ricardoV94 Sep 2, 2022
Maintainer Author

This is where we intervene for users in the MvNormal:

pymc/pymc/distributions/multivariate.py

Lines 250 to 251 in c858f0f

    
           # Aesara is stricter about the shape of mu, than PyMC used to be 
        
           mu = at.broadcast_arrays(mu, cov[..., -1])[0]

junpenglao · 2022-09-02T05:23:50Z

junpenglao
Sep 2, 2022
Maintainer

From a pure programming point of view, since most of the more complex time series model rely on aesara.scan, it is much easier (and cheaper since you dont need to transpose) to have time dimension at the first: TBS

3 replies

junpenglao Sep 2, 2022
Maintainer

Also, thanks for the great write up @ricardoV94, I strongly support change over to TBS shape format.

ricardoV94 Sep 2, 2022
Maintainer Author

Yeah maybe that's more important than my focus on constant/time-varying parameters...

junpenglao Sep 2, 2022
Maintainer

This is actually the same problem in TF/JAX - until scan allows us to put the "stack" dimension in arbitrary axis, I consider this to be a strong argument for keeping the shape the same as the output from scan

ricardoV94 · 2022-09-02T05:39:00Z

ricardoV94
Sep 2, 2022
Maintainer Author

Do we want to consider what a mixture of timeseries would mean?

That could give strong opinions on TBS vs BTS and whether we should consider time core or not. If we can mix within time, it must not be core, otherwise we can only mix across the whole timeseries.

Note that TBS really becomes problematic if time is core, because there's an implicit contract that core dimensions are always to the right. Although there was some discussion at Aesara of allowing this to be parametrized (for different reasons): aesara-devs/aesara#1040 (comment)

0 replies

OriolAbril · 2022-09-04T20:32:37Z

OriolAbril
Sep 4, 2022
Maintainer

I am not sure how relevant to this specific discussion this is, I would guess it isn't, but given the issue has been closed and I have been tagged, here are my five cents on pointwise logps for timeseries.

The initial post defines "batch", "time" and "support" dimensions. There is also mention to "core" dimensions which I am not completely sure I understand. To me core dimensions are purely conceptual and model dependent; mostly related to how we do model comparison.

A timeseries unrelated example. We could have a (n, 3) array with observed data, corresponding to multiple position measurements. Those can be modeled as a 3d multivatiate normal, which would return a pointwise log likelihood array with shape (n,). But it would also be possible to model the x component as a normal, the y component as a student t so it has thicker tails and the z component as a gamma distribution if it represents height and can only be positive. In this second case, there are no support dimensions, but the last dimension of length 3 continues to be a core dimension. To compare those models one should get the (n, 3) pointwise log llikelihood automatically retrieved by pymc, sum over the last dimension and only then pass it to loo/waic. Moreover, it is also possible (and common in astronomy for example) to use a GP or MvNormal also for the dimension of length n as measurements and their bias/error are in fact correlated, especially with the ones taken immediately before and after. Now the support dimensions would be the two of them (n, 3) but the core dimensions continue to be only the one of length 3. However, now PSIS loo or waic can't be used and brute force cross-validation would be needed.

In time series, the time dimension can be a MvNormal or a GP, but it can also be a AR or even a simple linear regression. In time series the logical cv strategy is leave future out, which can't be approximated as well with PSIS, but it we are able to compute the pointwise log likelihood conditioned on past data we can estimate LFO with say 4 refits instead of the 200 that would be needed for brute force LFO-CV. Reference: https://doi.org/10.1080/00949655.2020.1783262 (Table 1 has empirical results about refits needed in multiple cases). I want for this data to be easily accessible (maybe even retrieved directly by the converter to inferencedata), because even if there are many cases where this can't be used, there are many were it can and it allows for much faster and cheaper results.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rethinking timeseries dimensionality #6090

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 15 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Rethinking timeseries dimensionality #6090

ricardoV94 Sep 1, 2022 Maintainer

The univariate case

The multivariate case

This applies to all timeseries

Logp considerations

Steps parameter

Conclusion

Replies: 6 comments · 15 replies

ricardoV94 Sep 1, 2022 Maintainer Author

aseyboldt Sep 1, 2022 Maintainer

aseyboldt Sep 1, 2022 Maintainer

aseyboldt Sep 1, 2022 Maintainer

aseyboldt Sep 1, 2022 Maintainer

ricardoV94 Sep 1, 2022 Maintainer Author

ricardoV94 Sep 2, 2022 Maintainer Author

michaelosthege Sep 1, 2022 Maintainer

ricardoV94 Sep 2, 2022 Maintainer Author

michaelosthege Sep 2, 2022 Maintainer

ricardoV94 Sep 2, 2022 Maintainer Author

ricardoV94 Sep 2, 2022 Maintainer Author

ricardoV94 Sep 2, 2022 Maintainer Author

junpenglao Sep 2, 2022 Maintainer

junpenglao Sep 2, 2022 Maintainer

ricardoV94 Sep 2, 2022 Maintainer Author

junpenglao Sep 2, 2022 Maintainer

ricardoV94 Sep 2, 2022 Maintainer Author

OriolAbril Sep 4, 2022 Maintainer

ricardoV94
Sep 1, 2022
Maintainer

Replies: 6 comments 15 replies

ricardoV94
Sep 1, 2022
Maintainer Author

aseyboldt
Sep 1, 2022
Maintainer

aseyboldt Sep 1, 2022
Maintainer

aseyboldt Sep 1, 2022
Maintainer

aseyboldt Sep 1, 2022
Maintainer

ricardoV94 Sep 1, 2022
Maintainer Author

ricardoV94 Sep 2, 2022
Maintainer Author

michaelosthege
Sep 1, 2022
Maintainer

ricardoV94 Sep 2, 2022
Maintainer Author

michaelosthege Sep 2, 2022
Maintainer

ricardoV94 Sep 2, 2022
Maintainer Author

ricardoV94 Sep 2, 2022
Maintainer Author

ricardoV94 Sep 2, 2022
Maintainer Author

junpenglao
Sep 2, 2022
Maintainer

junpenglao Sep 2, 2022
Maintainer

ricardoV94 Sep 2, 2022
Maintainer Author

junpenglao Sep 2, 2022
Maintainer

ricardoV94
Sep 2, 2022
Maintainer Author

OriolAbril
Sep 4, 2022
Maintainer