Skip to content

Data repository for PyGOD

License

Notifications You must be signed in to change notification settings

pygod-team/data

Repository files navigation

Data Repository for PyGOD

The statistics of the available dataset (#Con. means the number of contextual outliers, while #Strct. means the number of structural outliers. The number of outliers is slightly less than the sum of two types of outliers because of the intersection between two types of outliers.):

Dataset Type #Nodes #Edges #Feat Avg. Degree #Con. #Strct. #Outliers Outlier Ratio
'weibo' organic 8,405 407,963 400 48.5 - - 868 10.3%
'reddit' organic 10,984 168,016 64 15.3 - - 366 3.3%
'disney' organic 124 335 28 2.7 - - 6 4.8%
'books' organic 1,418 3,695 21 2.6 - - 28 2.0%
'enron' organic 13,533 176,987 18 13.1 - - 5 0.04%
'inj_cora' injected 2,708 11,060 1,433 4.1 70 70 138 5.1%
'inj_amazon' injected 13,752 515,042 767 37.2 350 350 694 5.0%
'inj_flickr' injected 89,250 933,804 500 10.5 2,240 2,240 4,414 4.9%
'gen_time' generated 1,000 5,746 64 5.7 100 100 189 18.9%
'gen_100' generated 100 618 64 6.2 10 10 18 18.0%
'gen_500' generated 500 2,662 64 5.3 10 10 20 4.0%
'gen_1000' generated 1,000 4,936 64 4.9 10 10 20 2.0%
'gen_5000' generated 5,000 24,938 64 5.0 10 10 20 0.4%
'gen_10000' generated 10,000 49,614 64 5.0 10 10 20 0.2%

To use the datasets:

from pygod.utils import load_data
data = load_data('weibo') # in PyG format

Alternative download source in Baidu Disk (Chinese): https://pan.baidu.com/s/1afEZaygCRUYWJPtVbzuRYw Access Code: bond

For injected/generated datasets, the labels meanings are as follows.

  • 0: inlier
  • 1: contextual outlier only
  • 2: structural outlier only
  • 3: both contextual outlier and structural outlier

Examples to convert the labels are as follows:

y = data.y.bool()    # binary labels (inlier/outlier)
yc = data.y >> 0 & 1 # contextual outliers
ys = data.y >> 1 & 1 # structural outliers