[OTHER] Question: AR coefficients, sklearn lags understanding #65

lucheroni · 2021-07-24T16:11:55Z

Hello.

A very simple question.

Once I've estimated an autoregression as a wrapped linear model, say as

model_lr = get_sklearn_wrapper(LinearRegression, lags=10)
mlr = model_lr.fit(X[:-10], y[:-10])

how do I get the autoregression coefficients?

I cannot find the equivalent of scikit .coef_ , like

mlr = LinearRegression().fit(X, y)
mlr.coef_

Thanks.

pavelkrizek · 2021-07-24T21:27:15Z

Hello,

thanks for the question, I think that it's a good sign that we could extend some parts of the documentation which doesn't cover this.

All models have a model attribute which allows accessing the original estimator. In the case of sklearn the situation is a little bit more complicated because the actual fit happens only in the predict method (because the length of X passed to predict method is not known in the time of fitting). This is an implementation detail, but the fact is that you can access the coef_ only after calling predict method for get_sklearn_wrapper.

model_lr = get_sklearn_wrapper(LinearRegression, lags=10)
_ = model_lr.fit(X[:-10], y[:-10]).predict(X[-10:])
model_lr.model.coef_

lucheroni · 2021-07-25T10:27:41Z

Thank you for the swift answer. It works. Very interesting library! I have another simple question, if you don't mind. I'm specifically interested to the possibility of using Sklearn-API regressors as autoregressive models for time-series prediction, hence I need to understand well the hcrystalball data parsing model for time series. My question refers to the section 'Autoregressive Modelling in Sklearn' ( https://hcrystalball.readthedocs.io/en/stable/examples/tutorial/wrappers/02_ar_modelling_in_sklearn.html) and to the figure there (there must be two errors in it, x2 is skipped everywhere, and an x95 should be an x96 - second green line from the top, far right). If you have say 100 datapoints, horizon 1 and lags 3, and e is the iid error, is it correct to say that, after calling get_sklearn_wrapper(LinearRegression), the y series is first organized as y3 = f(y0,y1,y2)+e, y4=f(y1,y2,y3)+e, and so on until y100=f(y97,y98,y99), then this matrix is passed to sklearn.linear_model.LinearRegression ? Thank you.

…

On Sat, Jul 24, 2021 at 11:27 PM Pavel Krizek ***@***.***> wrote: Hello, thanks for the question, I think that it's a good sign that we could extend some parts of the documentation which doesn't cover this. All models have a model attribute which allows accessing the original estimator. In the case of sklearn the situation is a little bit more complicated because the actual fit happens only in the predict method (because the length of X passed to predict method is not known in the time of fitting). This is an implementation detail, but the fact is that you can access the coef_ only after calling predict method for get_sklearn_wrapper. model_lr = get_sklearn_wrapper(LinearRegression, lags=10) _ = (model_lr.fit(X[:-10], y[:-10]).predict(X[-10:]) model_lr.model.coef — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#65 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOMFUN26RPDZ2F2LPDMWUBDTZMV37ANCNFSM5A5TM44Q> .

-- Carlo Lucheroni School of Sciences and Technologies Universita' di Camerino via Madonna delle Carceri 9 62032 Camerino (MC) Italy Office phone: 39-0737402552

-- Ask me for my Mathematical Finance and Stochastic Dynamic Optimization courseware: 'Take the Chance!' View and download my papers on Finance and Physics from ResearchGate: https://www.researchgate.net/profile/Carlo_Lucheroni/publications View and download my research on power markets on my SSRN Author page: http://ssrn.com/author=1390778 You can also check IDEAS/RePEc for my name: https://ideas.repec.org/f/plu234.html <https://ideas.repec.org/>

MichalChromcak · 2021-07-26T07:05:43Z

Hi @lucheroni,

You are right, there should be x1, x2, x3, ... and later on x95, x96, x97 ; x96, x97, x98 etc. Thanks for spotting it.

And yes, in this case, y0, y1, y2 are passed as X to the LinearRegression and y3 as y. Or one could also formulate it relative to y as following:

X = y-3, y-2, y-1
y = y

MichalChromcak · 2021-07-26T07:45:00Z

@lucheroni, @pavelkrizek I updated the picture, but don't want to close it in case there are still some open questions. Can I close it as is? :)

lucheroni · 2021-07-26T09:31:58Z

I asked because I'm trying to check the in-sample predictions,
and I'm getting a strange behavior.
In the following snippet I define an AR(2) model with two parameters only
(no constant),
hence I'm expecting to get the first in-sample forecast as y2 = f(y0,y1).
I get the following, which I don't understand:

ar_ord = 2
# horizon
hz = 1
# ten values
train = np.array([0.,1.,0.,1.,0.,1.,0.,1.,0.,1.])
y = train
date_index = pd.date_range(start="2020-01-01", periods=len(y), freq="h")
X = pd.DataFrame(index=date_index)
#
model_lry = get_sklearn_wrapper(LinearRegression, lags=ar_ord, fit_intercept=False)
mlry_preds = model_lry.fit(X[:-hz], y[:-hz])
model_lry.model.coef_ # array([0., 1.]) - two coeffs, ok
#
mlry_errs.predict(X[-1:]) # 1, 10th val - correct prediction
mlry_errs.predict(X[-2:]) # 1,0 shouldn't be 0,1?
mlry_errs.predict(X[-3:]) # 1,0,1
mlry_errs.predict(X[-4:]) # 1,0,1,0
mlry_errs.predict(X[-5:]) # 1,0,1,0,1 
mlry_errs.predict(X[-6:]) # 1,0,1,0,1,0 from 5th val on
mlry_errs.predict(X[-7:]) # ? 0,0,0,0,0,00 getting zeros only
'''

ambader · 2021-07-26T17:27:18Z

It's all right, the formula to predict an AR(2) variable is:
xt=a1xt-1+a2xt-2

You might got confused as the x go back in time, therefore the array is 'reversed'.
Predicting the first variable of your last prediction:
x3 = a1x2 + a2x1
a1,a2 = [1.,0.]
1x0+0x1=0
and from then on, all predictions become 0.

used this code to show the differences:

for i in range(1,8):
    print("model_lry.predict(X[-"+str(i)+":])")
    print(model_lry.model.coef_)
    print(model_lry.predict(X[-i:]))

There you'll also see that the coefficients change. That is indeed another source of confusion as they are not labeled with the changing xt-1,xt-2. Hope that was what you wanted to understand.

lucheroni · 2021-07-27T12:01:26Z

Thank you for the clarification.

My point is related exactly to what you embedded in the code.

It seems that the library re-fits the data each time one calls .predict(), and it does it in a different way each time. This is at least what I get when I run the code fragment above - changing coefficients.

Usually, on the contrary, once you have prepared your design matrix, you fit the model once and the coefficients will stay the same whatever you do with the model after fitting.
In facts, in statsmodels .predict() is called a postestimation method.

It seems to me that the library registers in the namespace a model_lry.model.coef_ only
after .predict(), and at that time it refits the model.

Am I wrong on that?

ambader · 2021-07-30T08:23:39Z

Finally found the problem.
First thing to notice,
"Actual model fitting is done in predict method since the way model is fitted
depends on prediction horizon which is known only during predict call."
see:help(model_lry.fit)
Therefore, the wrapper fits new with every prediction, using the prediction target as horizon. But when I tried:
model_lry = get_sklearn_wrapper(LinearRegression, lags=ar_ord, fit_intercept=False,optimize_for_horizon=False)
the predicted coef were still altered. Digging in the code of wrappers/_sklearn.py finally revealed what I identify as the problem: When the prediction is called, is is tested for the bool of optimize_for_horizon, but at that point the _X for the fitting regression is already set - always with the len of predictionX as a horizon.
I forked the code I used to change it, hope that it is correct.

pavelkrizek · 2021-07-30T08:59:01Z

@ambader Could you please supply an example that reproduces the problem you are facing? Thank you.

ambader · 2021-07-30T09:08:41Z

It's still the same problem this whole thread is about:

ar_ord = 2
hz = 10
train_me = np.array([0.,1.,0.,1.,0.,1.,0.,1.,0.,1.])
train_me=np.tile(train_me,100)
y = train_me
date_index = pd.date_range(start="2020-01-01", periods=len(y), freq="d")
X = pd.DataFrame(index=date_index)

for i in range(1,8):
    model_lry = get_sklearn_wrapper(LinearRegression, lags=ar_ord, fit_intercept=False,optimize_for_horizon=False)
    print("model_lry.predict(X[-9:-"+str(i)+":])")
    model_lry.fit(X[:-11],y[:-11])
    print(model_lry.predict(X[-9:-i]))
    print(model_lry.model.coef_)

pavelkrizek · 2021-07-30T09:25:03Z

Ok, this behavior is expected by the implementation as the model is a function of training data which are defined by the horizon. There are also other ways how to implement the regressor as a time series forecaster, each with its advantages and drawbacks. PRs for new implementations are welcomed :)

We had also initially horizon being explicitly passed to the get_sklearn_wrapper, so the behavior was more understandable and deterministic, but this made its usage less flexible and add additional complexity (i.e. what to do when the X passed to predict doesn't have the same length as provided horizon...)

lucheroni · 2021-07-30T10:17:01Z

I hope you'll be able to implement the method in a different way, if you want to push the scikit compatibility feature of your library to being used by the time series community. As it is, the method is not compatible with the way econometricians work: in econometrics estimation is made in a given interval, and must be separated from prediction inference in other intervals (that can include or not the original estimtion interval), take statsmodels as the classic example. Imagine an estimation which requires a very long time (days), and that after it you need to predict your data in different time segments, at will. You save the fitted model then you explore its capability. You don't want to re-estimate it at each try. The advantage that I thought I could find in hcrystalball was that it could be a nice interface for doing quick econometrics with regularization (not easy in python, whereas scikit includes ridge and lasso for static problems) or with static machine learning methods extended to time series. That could alleviate the need of writing interfaces by oneself, as it is usually done.

…

On Fri, Jul 30, 2021 at 11:25 AM Pavel Krizek ***@***.***> wrote: Ok, this behavior is expected by the implementation as the model is a function of training data which are defined by the horizon. There are also other ways how to implement the regressor as a time series forecaster, each with its advantages and drawbacks. PRs for new implementations are welcomed :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#65 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOMFUN6V5557KCIYRCOEH2TT2JVXTANCNFSM5A5TM44Q> .

-- Carlo Lucheroni School of Sciences and Technologies Universita' di Camerino via Madonna delle Carceri 9 62032 Camerino (MC) Italy Office phone: 39-0737402552

-- Ask me for my Mathematical Finance and Stochastic Dynamic Optimization courseware: 'Take the Chance!' View and download my papers on Finance and Physics from ResearchGate: https://www.researchgate.net/profile/Carlo_Lucheroni/publications View and download my research on power markets on my SSRN Author page: http://ssrn.com/author=1390778 You can also check IDEAS/RePEc for my name: https://ideas.repec.org/f/plu234.html <https://ideas.repec.org/>

ambader · 2021-07-30T10:41:57Z

I understand why the horizon is set by default, but if I explicitly try to change it, it should not just ignore it without any comment. And if that behavior is intended, why does predict checks for optimize_for_horizon but not _predict ?

MichalChromcak added documentation Improvements or additions to documentation question Further information is requested labels Jul 26, 2021

MichalChromcak self-assigned this Jul 26, 2021

MichalChromcak mentioned this issue Jul 26, 2021

Update the sklearn lags picture in the docs section #66

Merged

MichalChromcak closed this as completed in #66 Jul 26, 2021

MichalChromcak reopened this Jul 26, 2021

MichalChromcak changed the title ~~question:~~ [OTHER] Question: AR coefficients, sklearn lags understanding Jul 26, 2021

ambader mentioned this issue Apr 5, 2022

Update _sklearn.py #67

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OTHER] Question: AR coefficients, sklearn lags understanding #65

[OTHER] Question: AR coefficients, sklearn lags understanding #65

lucheroni commented Jul 24, 2021

pavelkrizek commented Jul 24, 2021 •

edited

Loading

lucheroni commented Jul 25, 2021 via email

MichalChromcak commented Jul 26, 2021

MichalChromcak commented Jul 26, 2021

lucheroni commented Jul 26, 2021

ambader commented Jul 26, 2021

lucheroni commented Jul 27, 2021

ambader commented Jul 30, 2021

pavelkrizek commented Jul 30, 2021

ambader commented Jul 30, 2021

pavelkrizek commented Jul 30, 2021 •

edited

Loading

lucheroni commented Jul 30, 2021 via email

ambader commented Jul 30, 2021 •

edited

Loading

[OTHER] Question: AR coefficients, sklearn lags understanding #65

[OTHER] Question: AR coefficients, sklearn lags understanding #65

Comments

lucheroni commented Jul 24, 2021

pavelkrizek commented Jul 24, 2021 • edited Loading

lucheroni commented Jul 25, 2021 via email

MichalChromcak commented Jul 26, 2021

MichalChromcak commented Jul 26, 2021

lucheroni commented Jul 26, 2021

ambader commented Jul 26, 2021

lucheroni commented Jul 27, 2021

ambader commented Jul 30, 2021

pavelkrizek commented Jul 30, 2021

ambader commented Jul 30, 2021

pavelkrizek commented Jul 30, 2021 • edited Loading

lucheroni commented Jul 30, 2021 via email

ambader commented Jul 30, 2021 • edited Loading

pavelkrizek commented Jul 24, 2021 •

edited

Loading

pavelkrizek commented Jul 30, 2021 •

edited

Loading

ambader commented Jul 30, 2021 •

edited

Loading