Statistics and data splitting scheme on PPI datasets #234

sduzxj · 2023-11-09T06:17:06Z

Using the following code, I will get different statistics than the documentation[1, 2], how do I get the preprocessed dataset or the code about the dataset splitting scheme [2]？
code:
import torch
import torchdrug
from torchdrug import datasets
from torchdrug import core, datasets, tasks, models, layers

from torchdrug.datasets import HumanPPI,YeastPPI,PPIAffinity,Fold
from torchdrug import data, utils
from torchdrug import transforms as T

dataset = YeastPPI('./dataset/PPI', lazy =True)#, #transform=transforms)
train_set,valid_set,test_set =dataset.split(['train', 'valid', 'test'])#.split()

print(len(train_set))
print(len(valid_set))
print(len(test_set))
output statistics : 2421, 203, 326

[1] https://torchdrug.ai/docs/api/datasets.html
[2] https://torchprotein.ai/benchmark#leaderboard-for-yeast-ppi-prediction

mahdip72 · 2023-11-13T06:05:58Z

I have the same issue with subcellular localization dataset. The number of samples in the training, validation and test sets are different compared to the PEER paper. What is the problem? Is there any additional post-prosessing step we need to do on the dataset?

mahdip72 · 2023-11-20T07:40:40Z

@sduzxj

I ran your code and got different numbers:
4945, 95, 394

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Statistics and data splitting scheme on PPI datasets #234

Statistics and data splitting scheme on PPI datasets #234

sduzxj commented Nov 9, 2023

mahdip72 commented Nov 13, 2023

mahdip72 commented Nov 20, 2023

Statistics and data splitting scheme on PPI datasets #234

Statistics and data splitting scheme on PPI datasets #234

Comments

sduzxj commented Nov 9, 2023

mahdip72 commented Nov 13, 2023

mahdip72 commented Nov 20, 2023