This repository contains the code for the text-to-gesture generation model using CNN.
The demonstration video of generated gestures is available at https://youtu.be/JX4Gqy-Rmso.
Eiichi Asakawa, Naoshi Kaneko, Dai Hasegawa, and Shinichi Shirakawa: Evaluation of text-to-gesture generation model using convolutional neural network, Neural Networks, Elsevier, Vol. 151, pp. 365-375, Jul. 2022. [DOI]
If you use this code for your research, please cite our paper:
@article{AsakawaNEUNET2022,
author = {Eiichi Asakawa and Naoshi Kaneko and Dai Hasegawa and Shinichi Shirakawa},
title = {Evaluation of text-to-gesture generation model using convolutional neural network},
journal = {Neural Networks},
volume = {151},
pages = {365--375},
year = {2022},
doi = {https://doi.org/10.1016/j.neunet.2022.03.041}
}
We used the PyTorch version 1.7.1 for neural network implementation. We tested the codes on the following environment:
- Ubuntu 16.04 LTS
- GPU: NVIDIA GeForce GTX 1080Ti
- Python environment: anaconda3-2020.07
- fasttext
- cv2 (4.4.0)
- ffmpeg
- Our code uses the speech and gesture dataset provided by Ginosar et al. Download the Speech2Gesture dataset by following the instruction "Download specific speaker data" in https://github.com/amirbar/speech2gesture/blob/master/data/dataset.md.
Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, and Jitendra Malik, "Learning Individual Styles of Conversational Gesture," CVPR 2019.
After downloading the Speech2Gesture dataset, your dataset folder should be like:
Gestures
├── frames_df_10_19_19.csv
├── almaram
├── frames
├── keypoints_all
├── keypoints_simple
└── videos
...
└── shelly
├── frames
├── keypoints_all
├── keypoints_simple
└── videos
- Download the text dataset from HERE and unarchive the zip file.
- Move the
words
directory in each speaker name's directory to the corresponding speaker's directory in your dataset directory.
After this step, your dataset folder should be like:
Gestures
├── frames_df_10_19_19.csv
├── almaram
├── frames
├── keypoints_all
├── keypoints_simple
├── videos
└── words
...
└── shelly
├── frames
├── keypoints_all
├── keypoints_simple
├── videos
└── words
Note that the word data for speaker Jon is very little. Therefore, it should not use for model training.
- Set up the fasttext by following the instruction HERE. Download the pre-trained model file (wiki-news-300d-1M-subword.bin) of fasttext from HERE.
- Run the script as
python dataset.py --base_path <BASE_PATH> --speaker <SPEAKER_NAME> --wordvec_file <W2V_FILE> --dataset_type <DATASET_TYPE> --frames <FRAMES>
-
Options
- <BASE_PATH>: Path to dataset folder (e.g.,
/path_to_your_dataset/Gestures/
) - <SPEAKER_NAME>: Speaker name (directory name of speaker) (e.g.,
almaram
,oliver
) - <W2V_FILE>: Path to the pre-trained model of
fasttext
(e.g.,/path_to_your_fasttext_dir/wiki-news-300d-1M-subword.bin/
) - <DATASET_TYPE>: Dataset type (
train
ortest
) - : Number of frame for training data (we used 64 for training and 192 for test data)
- <BASE_PATH>: Path to dataset folder (e.g.,
-
Example (Create Oliver's data)
# training data
python dataset.py --base_path <BASE_PATH> --speaker oliver --wordvec_file <W2V_FILE> --dataset_type train --frames 64
# test data
python dataset.py --base_path <BASE_PATH> --speaker oliver --wordvec_file <W2V_FILE> --dataset_type test --frames 192
After run the script, the direcrories containing the training or test data are created in your dataset folder. After this step, your dataset folder should be like:
Gestures
├── frames_df_10_19_19.csv
├── almaram
├── frames
├── keypoints_all
├── keypoints_simple
├── test-192
├── train-64
├── videos
└── words
...
├── shelly
├── frames
├── keypoints_all
├── keypoints_simple
├── test-192
├── train-64
├── videos
└── words
- Run the script as
python train.py --outdir_path <OUT_DIR> --speaker <SPEAKER_NAME> --gpu_num <GPU> --base_path <BASE_PATH> --train_dir <TRAIN_DIR>
- Options
- <OUT_DIR>: Directory for saving training result (e.g.,
./out_training/
) - <SPEAKER_NAME>: Speaker name (directory name of speaker) (e.g.,
almaram
,oliver
) - : GPU ID
- <BASE_PATH>: Path to dataset folder (e.g.,
/path_to_your_dataset/Gestures/
) - <TRAIN_DIR>: Directory name containing training data (e.g.,
train-64
)
- <OUT_DIR>: Directory for saving training result (e.g.,
The experimental settings (e.g., number of epochs, loss function) can change by specifying the argument. Please see the script file of train.py
for the details.
- Example (Training using Oliver's data)
python train.py --outdir_path ./out_training/ --speaker oliver --gpu_num 0 --base_path <BASE_PATH> --train_dir train-64
The resulting files will be created in ./out_training/oliver_YYYYMMDD-AAAAAA/
.
- Predict the gesture motion for test data using a trained model
- Run the script as
python test.py --base_path <BASE_PATH> --test_speaker <TEST_SPEAKER> --test_dir <TEST_DIR> --model_dir <MODEL_DIR> --model_path <MODEL_PATH> --outdir_path <OUT_DIR>
-
Options
- <BASE_PATH>: Path to dataset folder (e.g.,
/path_to_your_dataset/Gestures/
) - <TEST_SPEAKER>: Speaker name for testing (directory name of test speaker) (e.g.,
almaram
,oliver
) - <TEST_DIR>: Directory name containing test data (e.g.,
test-192
) - <MODEL_DIR>: Directory name of trained model (e.g.,
oliver_YYYYMMDD-AAAAAA
) - <MODEL_PATH>: Path to training result (e.g.,
./out_training/
) - <OUT_DIR>: Directory for saving test result (e.g.,
./out_test/
)
- <BASE_PATH>: Path to dataset folder (e.g.,
-
Example (Predict the Oliver's test data using Oliver's trained model)
python test.py --base_path <BASE_PATH> --test_speaker oliver --test_dir test-192 --model_dir oliver_YYYYMMDD-AAAAAA --model_path ./out_training/ --outdir_path ./out_test/
The resulting files (.npy
files for predicted motion) are created in ./out_test/oliver_by_oliver_YYYYMMDD-AAAAAA_test-192/
.
- Example (Predict the Rock's test data using Oliver's trained model)
python test.py --base_path <BASE_PATH> --test_speaker rock --test_dir test-192 --model_dir oliver_YYYYMMDD-AAAAAA --model_path ./out_training/ --outdir_path ./out_test/
The resulting files (.npy
files for predicted motion) will be created in ./out_test/rock_by_oliver_YYYYMMDD-AAAAAA_test-192/
.
- Create gesture movie files
- Run the script as
python make_gesture_video.py --base_path <BASE_PATH> --test_out_path <TEST_OUT_PATH> --test_out_dir <TEST_OUT_DIR> --video_out_path <VIDEO_OUT_PATH>
-
Options
- <BASE_PATH>: Path to dataset folder (e.g.,
/path_to_your_dataset/Gestures/
) - <TEST_OUT_PATH>: Directory path of test output (e.g.,
./out_test/
) - <TEST_OUT_DIR>: Directory name of output gestures (e.g.,
oliver_by_oliver_YYYYMMDD-AAAAAA_test-192
) - <VIDEO_OUT_PATH>: Directory path of output videos (e.g.,
./out_video/
)
- <BASE_PATH>: Path to dataset folder (e.g.,
-
Example
python make_gesture_video.py --base_path <BASE_PATH> --test_out_path ./out_test/ --test_out_dir oliver_by_oliver_YYYYMMDD-AAAAAA_test-192 --video_out_path ./out_video/
The gesture videos (side-by-side video of ground truth and text-to-gesture) will be created in ./out_video
. The left side gesture is ground truth, and the right side gesture is one generated by the text-to-gesture generation model. Also, the original videos of test intervals will be created in ./out_video/original/oliver/
.
If you want to use the transformer model, please use ./transformer/dataset.py
, train_transformer.py
, and test_transformer.py
instead of ./dataset.py
, train.py
, and test.py
.
The training dataset creation code ./transformer/dataset.py
creates the data including both text and audio information for model training. The created files are saved in train-64-text-audio
instead of train-64
, which should be used for transformer model training.
- Example (Create Oliver's data)
python dataset.py --base_path <BASE_PATH> --speaker oliver --wordvec_file <W2V_FILE> --frames 64
- Example (Training using Oliver's data) with
# Text2Gesture
python train_transformer.py --outdir_path ./out_training/ --speaker oliver --gpu_num 0 --base_path <BASE_PATH> --train_dir train-64-text-audio --modality text
# Speech2Gesture
python train_transformer.py --outdir_path ./out_training/ --speaker oliver --gpu_num 0 --base_path <BASE_PATH> --train_dir train-64-text-audio --modality audio
- Example (Predict the Oliver's test data using Oliver's trained model)
# Text2Gesture
python test_transformer.py --modality text --base_path <BASE_PATH> --test_speaker oliver --test_dir test-192 --model_dir oliver_YYYYMMDD-AAAAAA --model_path ./out_training/ --outdir_path ./out_test/
# Speech2Gesture
python test_transformer.py --modality audio --base_path <BASE_PATH> --test_speaker oliver --test_dir test-192 --model_dir oliver_YYYYMMDD-AAAAAA --model_path ./out_training/ --outdir_path ./out_test/