IsoKernel - IsoKernel is a python library for Isolation Kernel. It includes several kernel methods using isolation mechanism.
- Isolation Kernel (IsoKernel)
- Isolation Distribution Kernel (IsoDisKernel)
PyPI install, presuming you have an up to date pip.
For a manual install of the latest code directly from GitHub:
pip install git+https://github.com/xhan97/IsoKernel.git
Alternatively download the package, install requirements, and manually run the installer:
wget https://codeload.github.com/xhan97/IsoKernel/zip/refs/heads/master
unzip IsoKernel-master.zip
rm IsoKernel-master.zip
cd IsoKernel-master
pip install -r requirements.txt
python setup.py install
The IsoKernel package inherits from sklearn classes, and thus drops in neatly
next to other sklearn with an identical calling API. Similarly it
supports input in a variety of formats: an array (or pandas dataframe) of shape (num_samples x num_features)
.
from IsoKernel import IsoKernel
from sklearn.datasets import make_blobs
data, _ = make_blobs(1000)
ik = IsoKernel(n_estimators=200, max_samples=16, method="anne") # method can be "anne" or "inne"
ik = ik.fit(data)
# get Isolation Kernel feature for all points in data.
ik.transform(data)
# get pairwise Isolation Kernel similarity for all points in data.
ik.similarity(data)
Isolation Distributional Kernel is a new way to measure the similarity between two distributions. It addresses two key issues of kernel mean embedding, where the kernel employed has:
- a feature map with intractable dimensionality which leads to high computational cost;
- data independency which leads to poor accuracy.
from IsoKernel import IsoDisKernel
from sklearn.datasets import make_blobs
data, _ = make_blobs(1000)
idk = IsoDisKernel(n_estimators=200, max_samples=16, method="anne") # method can be "anne" or "inne"
idk = idk.fit(data)
D_i = data[:10]
D_j = data[-10:]
# Directly get the similarity between two distributions
# is_normalize: whether return the normalized similarity matrix ranged of [0,1]. Default: True
sim = idk.similarity(D_i, D_j, is_normalize=True)
# get ik feature of two distributions
ikm_D_i, ikm_D_j = idk.transform(D_i, D_j)
# get kernel mean embedding
kme_D_i = idk.kernel_mean_embedding(ikm_D_i)
kme_D_j = idk.kernel_mean_embedding(ikm_D_j)
# get similarity between two distributions.
sim = idk.kme_similarity(kme_D_i, kme_D_j, is_normalize=True)
The package tests can be run after installation using the command:
pip install pytest
or, if pytest
is installed:
pytest IsoKernel/tests
If one or more of the tests fail, please report a bug at https://github.com/xhan97/IsoKernel/issues
Python 3 is recommend the better option if it is available to you.
If you have used this codebase in a scientific publication and wish to cite it, please use the following publication (Bibtex format):
@inproceedings{10.1145/3219819.3219990,
author = {Ting, Kai Ming and Zhu, Yue and Zhou, Zhi-Hua},
title = {Isolation Kernel and Its Effect on SVM},
year = {2018},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
booktitle = {Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining},
pages = {2329–2337},
numpages = {9},
location = {London, United Kingdom},
series = {KDD '18}
}
@inproceedings{ting2020Isolation,
author = {Ting, Kai Ming and Xu, Bi-Cun and Washio, Takashi and Zhou, Zhi-Hua},
title = {Isolation Distributional Kernel: A New Tool for Kernel Based Anomaly Detection},
year = {2020},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
doi = {10.1145/3394486.3403062},
pages = {198-206},
numpages = {9},
series = {KDD '20}
}
@inproceedings{HZTZL22Streaming,
author = {Han, Xin and Zhu, Ye and Ting, Kai Ming and Zhan, De-Chuan and Li, Gang},
title = {Streaming Hierarchical Clustering Based on Point-Set Kernel},
year = {2022},
isbn = {9781450393850},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3534678.3539323},
doi = {10.1145/3534678.3539323},
pages = {525–533},
numpages = {9},
keywords = {streaming data, hierarchical clustering, isolation kernel},
location = {Washington DC, USA},
series = {KDD '22}
}
Apache license