Combining (deep) reinforcement learning and goals based investing in QWIM - FinRL
Reinforcement learning is a machine learning training strategy that rewards desirable actions while penalizing undesirable ones. A reinforcement learning agent can sense and comprehend its surroundings, act, and learn via trial and error in general.
The RL algorithm in FinRL is based on openai/baselines. Those are
Deep Deterministic Policy Gradient (DDPG) is an algorithm which concurrently learns a Q-function and a policy. It uses off-policy data and the Bellman equation to learn the Q-function, and uses the Q-function to learn the policy. The DDPG contains two major network: Actor-Critic Network. And it's a Q function
The Q network and policy network are similar to basic Advantage Actor-Critic, except that in DDPG, the Actor directly maps states to actions (the network's output is directly the output) rather than outputting the probability distribution across a discrete action space. The learnt networks are slowly tracked by the target networks, which are time-delayed duplicates of their original networks. The use of these target value networks considerably improves learning stability. This is why: The update equations of the network are independent on the values produced by the network itself in approaches that do not employ target networks, making it prone to divergence.
Technical indicators are heuristic or pattern-based indications generated by a security's or contract's price, volume, and/or open interest, which are employed by traders who apply technical analysis. Technical analysts utilize indicators to forecast future price changes by evaluating previous data.
The most of them are momentum indicator. A momentum indicator (oscillator) is a technical indicator that compares current and previous values to illustrate trend direction and assess the rate of price variation. It is one of the main indicators for determining the pace at which securities move. In most cases, a momentum indicator is utilized in conjunction with other indicators.
The technical indicators that is used in this project are MACD, BOLL, RSI, CCI, ADX.
- MACD = EMA(TP,13) − EMA(TP,26)
- BOLLU = MA(TP,20)+2∗stdDev(TP,20)
- BOLLD = MA(TP,20)−2∗stdDev(TP,20)
- RSI = 100-(100/(1+ave. gain/ave.loss))
- CCI = (TP - SMA(TP,20)) / (.015 x Mean Deviation(TP,20))
- ADX = |+DI- -DI|/|+DI + -DI| *100
The data souce is from yfinance, which is based on data from yahoo finance. Asset universe is the Dow Jones 30 stocks data. Training set ranges from 2017-01-01 to 2021-01-01. Trading set ranges from 2021-01-04 to 2021-11-02. The algorithm stops updated after the training period.
Under "StockPortfolioEnv", the hmax defines maximum number of shares to trade. "initial_amount" defines start money in training. "transaction_cost_pct" defines transaction cost percentage per trade. ["hmax","initial_amount","transaction_cost_pct"]
Under "MODEL_PARAMS" in each branch, each RL parameter can be set differently. For example in DDPG, parameters are: ["batch_size","buffer_size","learning_rate","total_timesteps"]
We use the 60 days as lookback range for 4 years. We use Quantstats for the performance report, which compares to the minVar portfolio of the asset universe.