Skip to content

PrivacyGLUE: A Benchmark Dataset for General Language Understanding in Privacy Policies

License

Notifications You must be signed in to change notification settings

infsys-lab/privacy-glue

Repository files navigation

PrivacyGLUE: A Benchmark Dataset for General Language Understanding in Privacy Policies

This repository functions as the official codebase for the "PrivacyGLUE: A Benchmark Dataset for General Language Understanding in Privacy Policies" paper published in the MDPI Applied Sciences special issue for NLP and applications.

PrivacyGLUE is the first comprehensive privacy-oriented NLP benchmark comprising 7 relevant and high-quality privacy tasks for measuring general language understanding in the privacy language domain. We release performances from the BERT, RoBERTa, Legal-BERT, Legal-RoBERTa and PrivBERT pretrained language models and perform model-pair agreement analysis to detect examples where models benefited from domain specialization. Our findings show that PrivBERT, the only model pretrained on privacy policies, outperforms other models by an average of 2–3% over all PrivacyGLUE tasks, shedding light on the importance of in-domain pretraining for privacy policies.

Note that a previous version of this paper was submitted to the ACL Rolling Review (ARR) on 16th December 2022 before resubmission to the MDPI Applied Sciences special issue on NLP and applications on 3rd February 2023.

Table of Contents

  1. Tasks
  2. Leaderboard
  3. Dependencies
  4. Initialization
  5. Usage
  6. Notebooks
  7. Test
  8. Citation

Tasks 🏃

Task Study Type Train/Dev/Test Instances Classes
OPP-115 Wilson et al. (2016)* Multi-label sequence classification 2,185/550/697 12
PI-Extract Duc et al. (2021) Multi-task token classification 2,579/456/1,029 3/3/3/3**
Policy-Detection Amos et al. (2021) Binary sequence classification 773/137/391 2
PolicyIE-A Ahmad et al. (2021) Multi-class sequence classification 4,109/100/1,041 5
PolicyIE-B Ahmad et al. (2021) Multi-task token classification 4,109/100/1,041 29/9**
PolicyQA Ahmad et al. (2020) Reading comprehension 17,056/3,809/4,152 --
PrivacyQA Ravichander et al. (2019) Binary sequence classification 157,420/27,780/62,150 2

*Data splits were not defined in Wilson et al. (2016) and were instead taken from Mousavi et al. (2020)

**PI-Extract and PolicyIE-B consist of four and two subtasks respectively, and the number of BIO token classes per subtask are separated by a forward-slash character.

Leaderboard 🏁

Our current leaderboard consists of the BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2021), Legal-BERT (Chalkidis et al., 2020), Legal-RoBERTa (Geng et al., 2021) and PrivBERT (Srinath et al., 2021) models.

Task Metric* BERT RoBERTa Legal-BERT Legal-RoBERTa PrivBERT
OPP-115 m-F1 78.4±0.6 79.5±1.1 79.6±1.0 79.1±0.7 82.1±0.5
µ-F1 84.0±0.5 85.4±0.5 84.3±0.7 84.7±0.3 87.2±0.4
PI-Extract m-F1 60.0±2.7 62.4±4.4 59.5±3.0 60.5±3.9 66.4±3.4
µ-F1 60.0±2.7 62.4±4.4 59.5±3.0 60.5±3.9 66.4±3.4
Policy-Detection m-F1 85.3±1.8 86.9±1.3 86.6±1.0 86.4±2.0 87.3±1.1
µ-F1 92.1±1.2 92.7±0.8 92.7±0.5 92.4±1.3 92.9±0.8
PolicyIE-A m-F1 72.9±1.7 73.2±1.6 73.2±1.5 73.5±1.5 75.3±2.2
µ-F1 84.7±1.0 84.8±0.6 84.7±0.5 84.8±0.3 86.2±1.0
PolicyIE-B m-F1 50.3±0.7 52.8±0.6 51.5±0.7 53.5±0.5 55.4±0.7
µ-F1 50.3±0.5 54.5±0.7 52.2±1.0 53.6±0.9 55.7±1.3
PolicyQA s-F1 55.7±0.5 57.4±0.4 55.3±0.7 56.3±0.6 59.3±0.5
EM 28.0±0.9 30.0±0.5 27.5±0.6 28.6±0.9 31.4±0.6
PrivacyQA m-F1 53.6±0.8 54.4±0.3 53.6±0.8 54.4±0.5 55.3±0.6
µ-F1 90.0±0.1 90.2±0.0 90.1±0.1 90.2±0.1 90.2±0.1

*m-F1, µ-F1, s-F1 and EM refer to the Macro-F1, Micro-F1, Sample-F1 and Exact Match metrics respectively

Dependencies 🔍

  1. This repository was tested against Python version 3.8.13 and CUDA version 11.7. Create a virtual environment with the same python version and install dependencies with poetry:

    $ poetry install
    

    Alternatively, install dependencies in the virtual environment using pip:

    $ pip install -r requirements.txt
    
  2. Install Git LFS to access upstream task data. We utilized version 3.2.0 in our implementation.

  3. Optional: To further develop this repository, install pre-commit to setup pre-commit hooks for code-checks.

Initialization 🔥

  1. To prepare git submodules and data, execute:

    $ bash scripts/prepare.sh
    
  2. Optional: To install pre-commit hooks for further development of this repository, execute:

    $ pre-commit install
    

Usage ❄️

We use the run_privacy_glue.sh script to run PrivacyGLUE benchmark experiments:

usage: run_privacy_glue.sh [option...]

optional arguments:
  --cuda_visible_devices       <str>
                               comma separated string of integers passed
                               directly to the "CUDA_VISIBLE_DEVICES"
                               environment variable
                               (default: 0)

  --fp16                       enable 16-bit mixed precision computation
                               through NVIDIA Apex for training
                               (default: False)

  --model_name_or_path         <str>
                               model to be used for fine-tuning. Currently only
                               the following are supported:
                               "bert-base-uncased",
                               "roberta-base",
                               "nlpaueb/legal-bert-base-uncased",
                               "saibo/legal-roberta-base",
                               "mukund/privbert"
                               (default: bert-base-uncased)

  --no_cuda                    disable CUDA even when available (default: False)

  --overwrite_cache            overwrite caches used in preprocessing
                               (default: False)

  --overwrite_output_dir       overwrite run directories and saved checkpoint(s)
                               (default: False)

  --preprocessing_num_workers  <int>
                               number of workers to be used for preprocessing
                               (default: None)

  --task                       <str>
                               task to be worked on. The following values are
                               accepted: "opp_115", "piextract",
                               "policy_detection", "policy_ie_a", "policy_ie_b",
                               "policy_qa", "privacy_qa", "all"
                               (default: all)

  --wandb                      log metrics and results to wandb
                               (default: False)

  -h, --help                   show this help message and exit

To run the PrivacyGLUE benchmark for a supported model against all tasks, execute:

$ bash scripts/run_privacy_glue.sh --cuda_visible_devices <device_id> \
                                   --model_name_or_path <model> \
                                   --fp16

Note: Replace the <device_id> argument with a GPU ID or comma-separated GPU IDs to run single-GPU or multi-GPU fine-tuning respectively. Correspondingly, replace the <model> argument with one of our supported models listed in the usage documentation above.

Notebooks 📖

We utilize the following ipynb notebooks for analyses outside of the PrivacyGLUE benchmark:

Notebook Description
visualize_domain_embeddings.ipynb Compute and visualize BERT embeddings for Wikipedia, EURLEX and Privacy Policies using t-SNE and UMAP
visualize_results.ipynb Plot benchmark results and perform significance testing
inspect_predictions.ipynb Inspect test-set predictions for model-pair agreement analysis

Test 🔬

  1. To run unit tests, execute:

    $ make test
    
  2. To run integration tests, execute:

    $ CUDA_VISIBLE_DEVICES=<device_id> make integration
    

    Note: Replace the <device_id> argument with a GPU ID or comma-separated GPU IDs to run single-GPU or multi-GPU integration tests respectively. Alternatively, pass an empty string to run CPU integration tests.

Citation 🏛️

If you found PrivacyGLUE useful, we kindly ask you to cite our paper as follows:

@Article{app13063701,
  AUTHOR =       {Shankar, Atreya and Waldis, Andreas and Bless, Christof and
                  Andueza Rodriguez, Maria and Mazzola, Luca},
  TITLE =        {PrivacyGLUE: A Benchmark Dataset for General Language
                  Understanding in Privacy Policies},
  JOURNAL =      {Applied Sciences},
  VOLUME =       {13},
  YEAR =         {2023},
  NUMBER =       {6},
  ARTICLE-NUMBER ={3701},
  URL =          {https://www.mdpi.com/2076-3417/13/6/3701},
  ISSN =         {2076-3417},
  ABSTRACT =     {Benchmarks for general language understanding have been
                  rapidly developing in recent years of NLP research,
                  particularly because of their utility in choosing
                  strong-performing models for practical downstream
                  applications. While benchmarks have been proposed in the legal
                  language domain, virtually no such benchmarks exist for
                  privacy policies despite their increasing importance in modern
                  digital life. This could be explained by privacy policies
                  falling under the legal language domain, but we find evidence
                  to the contrary that motivates a separate benchmark for
                  privacy policies. Consequently, we propose PrivacyGLUE as the
                  first comprehensive benchmark of relevant and high-quality
                  privacy tasks for measuring general language understanding in
                  the privacy language domain. Furthermore, we release
                  performances from multiple transformer language models and
                  perform model&ndash;pair agreement analysis to detect tasks
                  where models benefited from domain specialization. Our
                  findings show the importance of in-domain pretraining for
                  privacy policies. We believe PrivacyGLUE can accelerate NLP
                  research and improve general language understanding for humans
                  and AI algorithms in the privacy language domain, thus
                  supporting the adoption and acceptance rates of solutions
                  based on it.},
  DOI =          {10.3390/app13063701}
}

About

PrivacyGLUE: A Benchmark Dataset for General Language Understanding in Privacy Policies

Resources

License

Stars

Watchers

Forks

Packages

No packages published