The statistics of the available dataset (#Con. means the number of contextual outliers, while #Strct. means the number of structural outliers. The number of outliers is slightly less than the sum of two types of outliers because of the intersection between two types of outliers.):
Dataset | Type | #Nodes | #Edges | #Feat | Avg. Degree | #Con. | #Strct. | #Outliers | Outlier Ratio |
---|---|---|---|---|---|---|---|---|---|
'weibo' | organic | 8,405 | 407,963 | 400 | 48.5 | - | - | 868 | 10.3% |
'reddit' | organic | 10,984 | 168,016 | 64 | 15.3 | - | - | 366 | 3.3% |
'disney' | organic | 124 | 335 | 28 | 2.7 | - | - | 6 | 4.8% |
'books' | organic | 1,418 | 3,695 | 21 | 2.6 | - | - | 28 | 2.0% |
'enron' | organic | 13,533 | 176,987 | 18 | 13.1 | - | - | 5 | 0.04% |
'inj_cora' | injected | 2,708 | 11,060 | 1,433 | 4.1 | 70 | 70 | 138 | 5.1% |
'inj_amazon' | injected | 13,752 | 515,042 | 767 | 37.2 | 350 | 350 | 694 | 5.0% |
'inj_flickr' | injected | 89,250 | 933,804 | 500 | 10.5 | 2,240 | 2,240 | 4,414 | 4.9% |
'gen_time' | generated | 1,000 | 5,746 | 64 | 5.7 | 100 | 100 | 189 | 18.9% |
'gen_100' | generated | 100 | 618 | 64 | 6.2 | 10 | 10 | 18 | 18.0% |
'gen_500' | generated | 500 | 2,662 | 64 | 5.3 | 10 | 10 | 20 | 4.0% |
'gen_1000' | generated | 1,000 | 4,936 | 64 | 4.9 | 10 | 10 | 20 | 2.0% |
'gen_5000' | generated | 5,000 | 24,938 | 64 | 5.0 | 10 | 10 | 20 | 0.4% |
'gen_10000' | generated | 10,000 | 49,614 | 64 | 5.0 | 10 | 10 | 20 | 0.2% |
To use the datasets:
from pygod.utils import load_data
data = load_data('weibo') # in PyG format
Alternative download source in Baidu Disk (Chinese): https://pan.baidu.com/s/1afEZaygCRUYWJPtVbzuRYw Access Code: bond
For injected/generated datasets, the labels meanings are as follows.
- 0: inlier
- 1: contextual outlier only
- 2: structural outlier only
- 3: both contextual outlier and structural outlier
Examples to convert the labels are as follows:
y = data.y.bool() # binary labels (inlier/outlier)
yc = data.y >> 0 & 1 # contextual outliers
ys = data.y >> 1 & 1 # structural outliers