Online simulator for RL-based recommendation
conda create -n KRL python=3.8
conda activate KRL
conda install pytorch torchvision -c pytorch
conda install pandas matplotlib scikit-learn tqdm ipykernel
python -m ipykernel install --user --name KRL --display-name "KRL"
See preprocess/KuaiRandDataset.ipynb for details
1.1 Immediate User Response Model
Example raw data format in preprocessed KuaiRand:
(session_id, request_id, user_id, video_id, date, time, is_click, is_like, is_comment, is_forward, is_follow, is_hate, long_view)
Example item meta data format in preprocessed KuaiRand:
(video_id, video_type, upload_type, music_type, log_duration, tag)
Example user meta data format in preprocessed KuaiRand:
(user_active_degree, is_live_streamer, is_video_author, follow_user_num_range, fans_user_num_range, friend_user_num_range, register_days_range, onehot_feat{0,1,6,9,10,11,12,13,14,15,16,17})
bash train_multi_behavior_user_response.sh
Note: multi-behavior user response models consists the state_encoder that is assumed to be the ground truth user state transition model.
1.2 User Retention Model
Pick a multi-behavior user response model for cross-session generation and retention model training, change the shell script accordingly (by setting the keyword 'KRMB_MODEL_KEY').
Generate user retention data in format:
(session_id, user_id, session_enc, return_day)
bash generate_session_data.sh
Evaluation metrics and protocol
List-wise reward (L-reward) is the average of item-wise immediate reward. We use both the average L-reward and the max L-reward across user requests in a mini-batch.
Reward-based NDCG (R-NDCG) generalizes the standard NDCG metric where the item-wise reward becomes the relevance label, and the IDCG is agnostic to the model being evaluated. Reward-weighted mean reciprocal rank(R-MRR) generalizes the standard MRR metric but replaces the item label with the item-wise reward. For both metrics, a larger value means that the learned policy performs better on the offline data.
Coverage describes the number of distinct items exposed in a mini-batch.
Intra-list diversity (ILD) estimates the embedding-based dissimilarity between items in each recommended list
bash train_{model name}_krpure_requestlevel.sh
Algorithm | Average L-reward | Max L-reward | Coverage | ILD |
---|---|---|---|---|
CF | 2.253 | 4.039 | 100.969 | 0.543 |
ListCVAE | 2.075 | 4.042 | 446.100 | 0.565 |
PRM | 2.174 | 3.811 | 27.520 | 0.53 |
Whole-session user interaction involves multiple request-feedback loops.
Evaluation metrics and protocol
Whole-session reward: total reward is the average sum of immediate rewards for each session. The average reward is the average of total reward for each request.
Depth represents how many interactions before the user leaves.
bash train_{model name}_krpure_wholesession.sh
Algorithm | Depth | Average reward | Total reward | Coverage | ILD |
---|---|---|---|---|---|
TD3 | 14.63 | 0.6476 | 9.4326 | 24.20 | 0.9864 |
A2C | 14.02 | 0.5950 | 8.3905 | 27.41 | 0.9870 |
DDPG | 14.89 | 0.6841 | 10.0873 | 20.95 | 0.9850 |
HAC | 14.98 | 0.6895 | 10.1742 | 35.70 | 0.9874 |
User retention happens after leaving of previous session and identifies the beginning of the next session.
Evaluation metrics and protocol
Return time is the average time gap between the last request of session and the first request of session.
User retention is the average ratio of visiting the system again.
bash train_{model name}_krpure_crosssession.sh
Algorithm | Return time ↓ | User retention ↑ |
---|---|---|
CEM | 3.573 | 0.572 |
TD3 | 3.556 | 0.581 |
RLUR | 3.481 | 0.607 |
Training curves check:
TrainingObservation.ipynb
Training other simulators:
bash train_ddpg_krpure_wholesession_{simulator name}.sh
Training on ML-1m dataset:
bash train_ddpg_krpure_wholesession_ml.sh
[1] Zhao, K., Liu, S., Cai, Q., Zhao, X., Liu, Z., Zheng, D., ... & Gai, K. (2023). KuaiSim: A comprehensive simulator for recommender systems. arXiv preprint arXiv:2309.12645.
Please cite the paper if you use this code in your work:
@article{zhao2023kuaisim,
title={KuaiSim: A comprehensive simulator for recommender systems},
author={Zhao, Kesen and Liu, Shuchang and Cai, Qingpeng and Zhao, Xiangyu and Liu, Ziru and Zheng, Dong and Jiang, Peng and Gai, Kun},
journal={arXiv preprint arXiv:2309.12645},
year={2023}
}