You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I'm quite inexperienced regarding Reinforcement Learning so forgive me if my question is trivial :). I have a quick question about the continue predictor.
In a typical Gym environment with an agent following a random policy, I've seen things like
for_inrange(num_episodes): # 1# First observation of an episode # 2obs, info=gym_env.reset() # 3# 4done=False# 5whilenotdone: # 6 action=gym_env.action_space.sample() # 7observation, reward, done, _, _=gym_env.step(action) # 8
The continue predictor is supposed to predict whether an episode will terminate or not. How I see it, for each (non-episode initializing step; lines 7-8) we get
an action | $a_t$
a reward resulting from the action | $r_t$
a "next" observation as a result of the action | $x_t$
a "done" (or alternatively continue) flag indicating if the episode has terminated | $c_t$
My question is: do we use $x_t$ to predict $c_t$? More specifically, does the stochastic posterior incorporate $x_t$ so that the "model state" (concatenation of deterministic state and stochastic state) is used to predict $c_t$?
Another way of asking the question: do we use the observation retrieved at the step that we also receive the continue flag, to predict the continue flag? I.e. in the line observation, reward, done, _, _ = gym_env.step(action), we incorporate the observation into the stochastic state to then help predict the done?
Thanks in advance!
The text was updated successfully, but these errors were encountered:
Hi, I'm quite inexperienced regarding Reinforcement Learning so forgive me if my question is trivial :). I have a quick question about the continue predictor.
In a typical Gym environment with an agent following a random policy, I've seen things like
The continue predictor is supposed to predict whether an episode will terminate or not. How I see it, for each (non-episode initializing step; lines
7-8
) we getMy question is: do we use$x_t$ to predict $c_t$ ? More specifically, does the stochastic posterior incorporate $x_t$ so that the "model state" (concatenation of deterministic state and stochastic state) is used to predict $c_t$ ?
Another way of asking the question: do we use the observation retrieved at the step that we also receive the continue flag, to predict the continue flag? I.e. in the line
observation, reward, done, _, _ = gym_env.step(action)
, we incorporate theobservation
into the stochastic state to then help predict thedone
?Thanks in advance!
The text was updated successfully, but these errors were encountered: