This repository functions as the official codebase for the "PrivacyGLUE: A Benchmark Dataset for General Language Understanding in Privacy Policies" paper published in the MDPI Applied Sciences special issue for NLP and applications.
PrivacyGLUE is the first comprehensive privacy-oriented NLP benchmark comprising 7 relevant and high-quality privacy tasks for measuring general language understanding in the privacy language domain. We release performances from the BERT, RoBERTa, Legal-BERT, Legal-RoBERTa and PrivBERT pretrained language models and perform model-pair agreement analysis to detect examples where models benefited from domain specialization. Our findings show that PrivBERT, the only model pretrained on privacy policies, outperforms other models by an average of 2–3% over all PrivacyGLUE tasks, shedding light on the importance of in-domain pretraining for privacy policies.
Note that a previous version of this paper was submitted to the ACL Rolling Review (ARR) on 16th December 2022 before resubmission to the MDPI Applied Sciences special issue on NLP and applications on 3rd February 2023.
Task | Study | Type | Train/Dev/Test Instances | Classes |
---|---|---|---|---|
OPP-115 | Wilson et al. (2016)* | Multi-label sequence classification | 2,185/550/697 | 12 |
PI-Extract | Duc et al. (2021) | Multi-task token classification | 2,579/456/1,029 | 3/3/3/3** |
Policy-Detection | Amos et al. (2021) | Binary sequence classification | 773/137/391 | 2 |
PolicyIE-A | Ahmad et al. (2021) | Multi-class sequence classification | 4,109/100/1,041 | 5 |
PolicyIE-B | Ahmad et al. (2021) | Multi-task token classification | 4,109/100/1,041 | 29/9** |
PolicyQA | Ahmad et al. (2020) | Reading comprehension | 17,056/3,809/4,152 | -- |
PrivacyQA | Ravichander et al. (2019) | Binary sequence classification | 157,420/27,780/62,150 | 2 |
*Data splits were not defined in Wilson et al. (2016) and were instead taken from Mousavi et al. (2020)
**PI-Extract and PolicyIE-B consist of four and two subtasks respectively, and the number of BIO token classes per subtask are separated by a forward-slash character.
Our current leaderboard consists of the BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2021), Legal-BERT (Chalkidis et al., 2020), Legal-RoBERTa (Geng et al., 2021) and PrivBERT (Srinath et al., 2021) models.
Task | Metric* | BERT | RoBERTa | Legal-BERT | Legal-RoBERTa | PrivBERT |
---|---|---|---|---|---|---|
OPP-115 | m-F1 | 78.4±0.6 | 79.5±1.1 | 79.6±1.0 | 79.1±0.7 | 82.1±0.5 |
µ-F1 | 84.0±0.5 | 85.4±0.5 | 84.3±0.7 | 84.7±0.3 | 87.2±0.4 | |
PI-Extract | m-F1 | 60.0±2.7 | 62.4±4.4 | 59.5±3.0 | 60.5±3.9 | 66.4±3.4 |
µ-F1 | 60.0±2.7 | 62.4±4.4 | 59.5±3.0 | 60.5±3.9 | 66.4±3.4 | |
Policy-Detection | m-F1 | 85.3±1.8 | 86.9±1.3 | 86.6±1.0 | 86.4±2.0 | 87.3±1.1 |
µ-F1 | 92.1±1.2 | 92.7±0.8 | 92.7±0.5 | 92.4±1.3 | 92.9±0.8 | |
PolicyIE-A | m-F1 | 72.9±1.7 | 73.2±1.6 | 73.2±1.5 | 73.5±1.5 | 75.3±2.2 |
µ-F1 | 84.7±1.0 | 84.8±0.6 | 84.7±0.5 | 84.8±0.3 | 86.2±1.0 | |
PolicyIE-B | m-F1 | 50.3±0.7 | 52.8±0.6 | 51.5±0.7 | 53.5±0.5 | 55.4±0.7 |
µ-F1 | 50.3±0.5 | 54.5±0.7 | 52.2±1.0 | 53.6±0.9 | 55.7±1.3 | |
PolicyQA | s-F1 | 55.7±0.5 | 57.4±0.4 | 55.3±0.7 | 56.3±0.6 | 59.3±0.5 |
EM | 28.0±0.9 | 30.0±0.5 | 27.5±0.6 | 28.6±0.9 | 31.4±0.6 | |
PrivacyQA | m-F1 | 53.6±0.8 | 54.4±0.3 | 53.6±0.8 | 54.4±0.5 | 55.3±0.6 |
µ-F1 | 90.0±0.1 | 90.2±0.0 | 90.1±0.1 | 90.2±0.1 | 90.2±0.1 |
*m-F1, µ-F1, s-F1 and EM refer to the Macro-F1, Micro-F1, Sample-F1 and Exact Match metrics respectively
-
This repository was tested against Python version
3.8.13
and CUDA version11.7
. Create a virtual environment with the same python version and install dependencies withpoetry
:$ poetry install
Alternatively, install dependencies in the virtual environment using
pip
:$ pip install -r requirements.txt
-
Install Git
LFS
to access upstream task data. We utilized version3.2.0
in our implementation. -
Optional: To further develop this repository, install
pre-commit
to setup pre-commit hooks for code-checks.
-
To prepare git submodules and data, execute:
$ bash scripts/prepare.sh
-
Optional: To install pre-commit hooks for further development of this repository, execute:
$ pre-commit install
We use the run_privacy_glue.sh
script to run PrivacyGLUE benchmark experiments:
usage: run_privacy_glue.sh [option...]
optional arguments:
--cuda_visible_devices <str>
comma separated string of integers passed
directly to the "CUDA_VISIBLE_DEVICES"
environment variable
(default: 0)
--fp16 enable 16-bit mixed precision computation
through NVIDIA Apex for training
(default: False)
--model_name_or_path <str>
model to be used for fine-tuning. Currently only
the following are supported:
"bert-base-uncased",
"roberta-base",
"nlpaueb/legal-bert-base-uncased",
"saibo/legal-roberta-base",
"mukund/privbert"
(default: bert-base-uncased)
--no_cuda disable CUDA even when available (default: False)
--overwrite_cache overwrite caches used in preprocessing
(default: False)
--overwrite_output_dir overwrite run directories and saved checkpoint(s)
(default: False)
--preprocessing_num_workers <int>
number of workers to be used for preprocessing
(default: None)
--task <str>
task to be worked on. The following values are
accepted: "opp_115", "piextract",
"policy_detection", "policy_ie_a", "policy_ie_b",
"policy_qa", "privacy_qa", "all"
(default: all)
--wandb log metrics and results to wandb
(default: False)
-h, --help show this help message and exit
To run the PrivacyGLUE benchmark for a supported model against all tasks, execute:
$ bash scripts/run_privacy_glue.sh --cuda_visible_devices <device_id> \
--model_name_or_path <model> \
--fp16
Note: Replace the <device_id>
argument with a GPU ID or comma-separated GPU IDs to run single-GPU or multi-GPU fine-tuning respectively. Correspondingly, replace the <model>
argument with one of our supported models listed in the usage documentation above.
We utilize the following ipynb
notebooks for analyses outside of the PrivacyGLUE benchmark:
Notebook | Description |
---|---|
visualize_domain_embeddings.ipynb | Compute and visualize BERT embeddings for Wikipedia, EURLEX and Privacy Policies using t-SNE and UMAP |
visualize_results.ipynb | Plot benchmark results and perform significance testing |
inspect_predictions.ipynb | Inspect test-set predictions for model-pair agreement analysis |
-
To run unit tests, execute:
$ make test
-
To run integration tests, execute:
$ CUDA_VISIBLE_DEVICES=<device_id> make integration
Note: Replace the
<device_id>
argument with a GPU ID or comma-separated GPU IDs to run single-GPU or multi-GPU integration tests respectively. Alternatively, pass an empty string to run CPU integration tests.
If you found PrivacyGLUE useful, we kindly ask you to cite our paper as follows:
@Article{app13063701,
AUTHOR = {Shankar, Atreya and Waldis, Andreas and Bless, Christof and
Andueza Rodriguez, Maria and Mazzola, Luca},
TITLE = {PrivacyGLUE: A Benchmark Dataset for General Language
Understanding in Privacy Policies},
JOURNAL = {Applied Sciences},
VOLUME = {13},
YEAR = {2023},
NUMBER = {6},
ARTICLE-NUMBER ={3701},
URL = {https://www.mdpi.com/2076-3417/13/6/3701},
ISSN = {2076-3417},
ABSTRACT = {Benchmarks for general language understanding have been
rapidly developing in recent years of NLP research,
particularly because of their utility in choosing
strong-performing models for practical downstream
applications. While benchmarks have been proposed in the legal
language domain, virtually no such benchmarks exist for
privacy policies despite their increasing importance in modern
digital life. This could be explained by privacy policies
falling under the legal language domain, but we find evidence
to the contrary that motivates a separate benchmark for
privacy policies. Consequently, we propose PrivacyGLUE as the
first comprehensive benchmark of relevant and high-quality
privacy tasks for measuring general language understanding in
the privacy language domain. Furthermore, we release
performances from multiple transformer language models and
perform model–pair agreement analysis to detect tasks
where models benefited from domain specialization. Our
findings show the importance of in-domain pretraining for
privacy policies. We believe PrivacyGLUE can accelerate NLP
research and improve general language understanding for humans
and AI algorithms in the privacy language domain, thus
supporting the adoption and acceptance rates of solutions
based on it.},
DOI = {10.3390/app13063701}
}