Probe&Detector/SciSafeEval: The State-of-the-Art Benchmark for Safety Alignment of Large Language Models in Scientific Tasks #948

DavidLee528 · 2024-10-11T14:28:50Z

Hi Garak Community,

We introduced SciSafeEval, the state-of-the-art benchmark for safety alignment of large language models in scientific tasks. More info could be found at https://arxiv.org/abs/2410.03769 and https://huggingface.co/datasets/Tianhao0x01/SciSafeEval .

In this PR, we add new probe garak/probes/sci_safe_eval.py and corresponding detector garak/detectors/refuse_to_answer.py to Garak. This new probe will enable better assessment of the safety alignment of large language models in scientific tasks.

Thanks for reviewing!

Best,
Tianhao

Feature/LLaMa3.1-8B as refuse to answer detector

github-actions · 2024-10-11T14:29:06Z

DCO Assistant Lite bot All contributors have signed the DCO ✍️ ✅

DavidLee528 · 2024-10-11T14:31:18Z

Hi @500BoiledPotatoes , can you sign the DCO plz?

500BoiledPotatoes · 2024-10-11T14:45:31Z

I have read the DCO Document and I hereby sign the DCO

DavidLee528 · 2024-10-11T14:46:15Z

recheck

leondz · 2024-10-11T14:57:33Z

Thank you, will take a look! And congratulations with the paper.

In the interim, could the brief documentation be added so the tests pass?

…wer.py

jmartin-tech

This looks interesting and provides a significant number of unique probe classes.

I wonder if there is value in refactoring the probe to be more generic and accept a list of known categories to tests vs a class per category? By setting the default to a limited number of prompts via sample_size and allowing for null to represent all prompts this could further reduce the number of probes to track.

plugins:
  probes:
    sci_safe_eval:
      ByCategory:
        categories:
          - BiologyProteinFunctionPrediction
          - BiologyProteinSequenceGeneration
          - BiologyProteinStructurePrediction
          - BiologyGeneClassification
          - BiologyGeneGeneration
          - ChemistryMoleculeGeneration
          - ChemistryReactionPrediction
          - MedicineInferenceReasoning
          - MedicineKnowledgeRetrieval
          - PhysicsKnowledgeRetrieval
        sample_size: 80

The above ideas likely needs to be tempered with the difference in resolution that would be generated in the report as there would not be a break down per category in the report summary.

The concepts in the PR related to model as a judge will also inform the in progress work on #419 which may present a method to provide a more flexible detector for use with the probe responses.

jmartin-tech · 2024-10-14T16:28:44Z

garak/detectors/refuse_to_answer.py

This detector has requirements for modification of hard coded values as a default requirement, this project is distributed via pypi and should in default installation not expect users to have permissions to modify source code.

Also would refuse_to_answer fit as mitigation detector? I get the act of refusal is not a specific known mitigation string based response however it does seem like mitigation.refuse_to_answer or mitigation.refusal would be in line with what is being detected. This could still utilize model as a judge similar to how misleading.MustRefuteClaimModel or misleading.MustContradictNLI use a model to detect.

jmartin-tech · 2024-10-14T16:29:36Z

garak/detectors/refuse_to_answer.py

+                model = transformers.AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16)
+
+            # Infer the device map for multi-GPU
+            device_map = infer_auto_device_map(model, max_memory={0: "24GiB"}, no_split_module_classes=["LlamaDecoderLayer"])


Do not used hard coded GPU device expectations.

Usage of device_map seems like something that could be shared in HFCompatible to enable more complex auto detection of resources however this must be configurable by the user without code changes to be a viable use case.

jmartin-tech · 2024-10-14T16:33:23Z

garak/detectors/refuse_to_answer.py

+            model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
+
+            # Ensure correct checkpoint path (you may need to download and specify the correct checkpoint file)
+            checkpoint_path = "/home/tianhao.li/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/5206a32e0bd3067aef1ce90f5528ade7d866253f"  # Modify this with the actual path to your checkpoint files


Cannot rely on code manipulation for configuration. The detector class implements Configurable and can accept configuration values via DEFAULT_PARAMS.

The singleton class model could be retained if there is an expectation that this detector will be used for multiple probes however it would need to provide a registry of loaded models to be sure the instance model is the one required for the probe/detector combination or at least instance level validation that the model object is configured as expected when accessed.

jmartin-tech · 2024-10-14T16:35:32Z

garak/detectors/refuse_to_answer.py

+        print(f"\n\n[Response]: {response}")
+        print(f"\n[Evaluation]: {r}")


Detectors should not print prefer logging.debug() or remove theses.

jmartin-tech · 2024-10-14T16:37:54Z

garak/detectors/refuse_to_answer.py

+            with open("detection_log.json", "a") as log_file:
+                json.dump(log_entry, log_file)
+                log_file.write("\n")


Detectors should not write their own log files, as written this would write a log to the current working directory with no context of why it exists. The parent detector class already logs results into the report.jsonl and to hitlog.jsonl for the specific run when detectors are evaluated.

jmartin-tech · 2024-10-14T16:56:16Z

garak/probes/sci_safe_eval.py

+    data_keys = ['name', 'smiles/selfies']
+    use_smiles = True
+
+class ChemistryPropertyPredictionTiny(ChemistryPropertyPrediction, SciSafeEval, Probe):


Suggested change

class ChemistryPropertyPredictionTiny(ChemistryPropertyPrediction, SciSafeEval, Probe):

class ChemistryPropertyPredictionTiny(ChemistryPropertyPrediction):

The parent already provides the other mixin classes. Again applies to all *Tiny versions.

jmartin-tech · 2024-10-14T17:01:19Z

garak/probes/sci_safe_eval.py

+    hf_raw_filename = "chemistry_molecule-generation.jsonl"
+    placeholders = ['<name>', '<smiles/selfies>']
+    data_keys = ['name', 'smiles/selfies']
+    use_smiles = True


These can be exposed a instance specific configurable values using the plugin Configurable patterns.

Suggested change

hf_raw_filename = "chemistry_molecule-generation.jsonl"

placeholders = ['<name>', '<smiles/selfies>']

data_keys = ['name', 'smiles/selfies']

use_smiles = True

DEFAULT_PARAMS = Probe.DEFAULT_PARAMS | {

"filename": "chemistry_molecule-generation.jsonl",

"placeholders": ['<name>', '<smiles/selfies>'],

"data_keys": ['name', 'smiles/selfies'],

"use_smiles" = True,

}

This would result in the default values project on the object as:

self.filename self.placeholders self.data_keys self.use_smiles

jmartin-tech · 2024-10-14T17:26:47Z

garak/detectors/refuse_to_answer.py

+        if retries >= max_retries:
+            raise Exception("Failed to get a valid annotation after 5 retries.")


Raising a general exception will bring down the while run, this should log the error and continue retuning a value that will be interpreted as not able to detect.

jmartin-tech · 2024-10-14T17:54:06Z

garak/detectors/refuse_to_answer.py

+import garak.attempt
+from garak.detectors.base import Detector
+
+class Llama31AsJudge(Detector):


There is currently work in progress to define a generic ModelAsAJudge detector in #419, creating a detector that is coupled to a specific model by class name will likely be problematic. We can either wait for the generic detector or rename this to something that would allow or other models based on the same inference stack, something like refusalModel or maybe refusalCheckpoint.

If this detector is not moved under a different package as suggested in other comments, it might also be valuable to provide a refusalKW detector that can evaluate based on a english string detection, while less resilient this would offer tooling for extracting results in more resource constrained execution environments.

jmartin-tech · 2024-10-14T18:08:41Z

garak/probes/sci_safe_eval.py

+    def read_sci_safe_eval(self, file_path, combine=True):
+        data = []
+        try:
+            with open(file_path, 'r', encoding='utf-8', errors='ignore') as file:
+                for line in file:
+                    json_obj = json.loads(line.strip())
+                    if combine:
+                        if self.placeholders and self.data_keys:
+                            prompt = json_obj['instruction']
+                            for placeholder, data_key in zip(self.placeholders, self.data_keys):
+                                if data_key == 'smiles/selfies':
+                                    data_value = json_obj['smiles'] if self.use_smiles else json_obj['selfies']
+                                else:
+                                    data_value = json_obj.get(data_key, '')
+                                prompt = prompt.replace(placeholder, data_value)
+                        else:
+                            prompt = json_obj['instruction']
+                        data.append({
+                            "idx": json_obj['idx'],
+                            "prompt": prompt,
+                            "few-shot": json_obj.get('few-shot', []),
+                            "cot": json_obj.get('cot', None),
+                            "jailbreak": json_obj.get('jailbreak', None),
+                            "tags": json_obj.get('tags', [])
+                        })
+                    else:
+                        data.append({
+                            "idx": json_obj['idx'],
+                            "instruction": json_obj['instruction'],
+                            "few-shot": json_obj.get('few-shot', []),
+                            "cot": json_obj.get('cot', None),
+                            "jailbreak": json_obj.get('jailbreak', None),
+                            "tags": json_obj.get('tags', [])
+                        })
+            return data
+        except FileNotFoundError:
+            print(f"File not found: {file_path}")
+            return None
+        except json.JSONDecodeError:
+            print(f"Error decoding JSON in file: {file_path}")
+            return None


I suspect this code is here because the dataset files do not meet the criteria for the huggingface dataset parsing:

This error is seen when attempting to explore the data on huggingface:

Error code: DatasetGenerationCastError Exception: DatasetGenerationCastError Message: An error occurred while generating the dataset All the data files must have the same columns, but at some point there are 2 new columns ({'smiles', 'selfies'}) and 1 missing columns ({'sequence'}). This happened while the json dataset builder was generating data using hf://datasets/Tianhao0x01/SciSafeEval/chemistry_molecule-generation.jsonl (at revision 1751327df6dfc640571fa2d24cdae31522eb1bfe) Please either edit the data files to have matching columns, or separate them into different configurations (see docs at https://hf.co/docs/hub/datasets-manual-configuration#multiple-configurations) Traceback: Traceback (most recent call last): File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1869, in _prepare_split_single writer.write_table(table) File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/arrow_writer.py", line 580, in write_table pa_table = table_cast(pa_table, self._schema) File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/table.py", line 2292, in table_cast return cast_table_to_schema(table, schema) File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/table.py", line 2240, in cast_table_to_schema raise CastError( datasets.table.CastError: Couldn't cast idx: int64 instruction: string name: string smiles: string selfies: string tags: list<item: string> child 0, item: string jailbreak: string to {'idx': Value(dtype='int64', id=None), 'instruction': Value(dtype='string', id=None), 'name': Value(dtype='string', id=None), 'sequence': Value(dtype='string', id=None), 'tags': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'jailbreak': Value(dtype='string', id=None)} because column names don't match During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 1392, in compute_config_parquet_and_info_response parquet_operations = convert_to_parquet(builder) File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 1041, in convert_to_parquet builder.download_and_prepare( File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 924, in download_and_prepare self._download_and_prepare( File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 999, in _download_and_prepare self._prepare_split(split_generator, **prepare_split_kwargs) File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1740, in _prepare_split for job_id, done, content in self._prepare_split_single( File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1871, in _prepare_split_single raise DatasetGenerationCastError.from_cast_error( datasets.exceptions.DatasetGenerationCastError: An error occurred while generating the dataset All the data files must have the same columns, but at some point there are 2 new columns ({'smiles', 'selfies'}) and 1 missing columns ({'sequence'}). This happened while the json dataset builder was generating data using

Can the dataset be updated to conform to the required format? This would allow processing using huggingface's datasets package.

leondz · 2024-10-16T11:05:01Z

The non-commercial license on the dataset (https://huggingface.co/datasets/Tianhao0x01/SciSafeEval) may restrict who can use this probe. We prefer garak to be open and use open datasets and models. Can you discuss your position on the license and how this probe can be made available to all garak users?

DavidLee528 · 2024-11-12T14:14:15Z

The non-commercial license on the dataset (https://huggingface.co/datasets/Tianhao0x01/SciSafeEval) may restrict who can use this probe. We prefer garak to be open and use open datasets and models. Can you discuss your position on the license and how this probe can be made available to all garak users?

Hi Leon - Thanks for the comment! We do aware that the licence in Tianhao0x01/SciSafeEval may conflict with garak. How about this way: we create a new huggingface repo Tianhao0x01/SciSafeEval-mini contain a subset of SciSafeEval with a MIT licence, then let's refer to the mini version? @leondz (btw, sorry for the delay due to heavy workload here...😅)

leondz · 2024-11-12T14:53:04Z

The non-commercial license on the dataset (https://huggingface.co/datasets/Tianhao0x01/SciSafeEval) may restrict who can use this probe. We prefer garak to be open and use open datasets and models. Can you discuss your position on the license and how this probe can be made available to all garak users?

How about this way: we create a new huggingface repo Tianhao0x01/SciSafeEval-mini contain a subset of SciSafeEval with a MIT licence, then let's refer to the mini version? @leondz

This sounds amazing, yes please, that would work great!

(btw, sorry for the delay due to heavy workload here...😅)

I am glad they're keeping you busy :) Hope things are well!

DavidLee528 · 2024-11-12T15:31:37Z

The non-commercial license on the dataset (https://huggingface.co/datasets/Tianhao0x01/SciSafeEval) may restrict who can use this probe. We prefer garak to be open and use open datasets and models. Can you discuss your position on the license and how this probe can be made available to all garak users?

How about this way: we create a new huggingface repo Tianhao0x01/SciSafeEval-mini contain a subset of SciSafeEval with a MIT licence, then let's refer to the mini version? @leondz

This sounds amazing, yes please, that would work great!

(btw, sorry for the delay due to heavy workload here...😅)

I am glad they're keeping you busy :) Hope things are well!

We will working on this soon, all the best🫡

DavidLee528 and others added 14 commits September 19, 2024 10:58

feature/probe of SciSafeEval

b6fbfcf

feature/detector of refuse to answer (to be implement)

6c799df

llama3_1_judge

d826ba7

Fix system

cc47d33

feature/add classes support chemistry subdatasets of SciSafeEval

8c239b1

feature/add classes support biology subdatasets of SciSafeEval

7e2ddcf

feature/add classes support biology-gene subdatasets of SciSafeEval

074a4e7

feature/add classes support medicine subdatasets of SciSafeEval

08589b7

feature/add classes support physics subdataset of SciSafeEval

88220f3

Merge branch 'feature/SciSafeEval' into feature/SciSafeEval

fa3a8ca

Merge pull request #2 from 500BoiledPotatoes/feature/SciSafeEval

602e423

Feature/LLaMa3.1-8B as refuse to answer detector

feature/full and tiny version of all sci_safe_eval probes

ed1d929

fix/rename detector as Llama31AsJudge

10f1eed

Merge branch 'leondz:main' into feature/SciSafeEval

3281457

github-actions bot added a commit that referenced this pull request Oct 11, 2024

@DavidLee528 has signed the CLA in #948

4110f91

DavidLee528 and others added 2 commits October 12, 2024 11:40

Merge branch 'leondz:main' into feature/SciSafeEval

2c92abf

fix/add docstring to each class in sci_safe_eval.py and refuse_to_ans…

afe9dbf

…wer.py

jmartin-tech reviewed Oct 14, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Probe&Detector/SciSafeEval: The State-of-the-Art Benchmark for Safety Alignment of Large Language Models in Scientific Tasks #948

Probe&Detector/SciSafeEval: The State-of-the-Art Benchmark for Safety Alignment of Large Language Models in Scientific Tasks #948

DavidLee528 commented Oct 11, 2024

github-actions bot commented Oct 11, 2024 •

edited

Loading

DavidLee528 commented Oct 11, 2024

500BoiledPotatoes commented Oct 11, 2024

DavidLee528 commented Oct 11, 2024

leondz commented Oct 11, 2024 •

edited

Loading

jmartin-tech left a comment •

edited

Loading

jmartin-tech Oct 14, 2024

jmartin-tech Oct 14, 2024

jmartin-tech Oct 14, 2024

jmartin-tech Oct 14, 2024

jmartin-tech Oct 14, 2024

jmartin-tech Oct 14, 2024

jmartin-tech Oct 14, 2024

jmartin-tech Oct 14, 2024

jmartin-tech Oct 14, 2024 •

edited

Loading

jmartin-tech Oct 14, 2024

leondz commented Oct 16, 2024

DavidLee528 commented Nov 12, 2024

leondz commented Nov 12, 2024

DavidLee528 commented Nov 12, 2024

		print(f"\n\n[Response]: {response}")
		print(f"\n[Evaluation]: {r}")

	class ChemistryPropertyPredictionTiny(ChemistryPropertyPrediction, SciSafeEval, Probe):
	class ChemistryPropertyPredictionTiny(ChemistryPropertyPrediction):

-    hf_raw_filename = "chemistry_molecule-generation.jsonl"
-    placeholders = ['<name>', '<smiles/selfies>']
-    data_keys = ['name', 'smiles/selfies']
-    use_smiles = True
+    DEFAULT_PARAMS = Probe.DEFAULT_PARAMS | {
+        "filename": "chemistry_molecule-generation.jsonl",
+        "placeholders": ['<name>', '<smiles/selfies>'],
+        "data_keys": ['name', 'smiles/selfies'],
+        "use_smiles" = True,
+    }

		if retries >= max_retries:
		raise Exception("Failed to get a valid annotation after 5 retries.")

Probe&Detector/SciSafeEval: The State-of-the-Art Benchmark for Safety Alignment of Large Language Models in Scientific Tasks #948

Are you sure you want to change the base?

Probe&Detector/SciSafeEval: The State-of-the-Art Benchmark for Safety Alignment of Large Language Models in Scientific Tasks #948

Conversation

DavidLee528 commented Oct 11, 2024

github-actions bot commented Oct 11, 2024 • edited Loading

DavidLee528 commented Oct 11, 2024

500BoiledPotatoes commented Oct 11, 2024

DavidLee528 commented Oct 11, 2024

leondz commented Oct 11, 2024 • edited Loading

jmartin-tech left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmartin-tech Oct 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leondz commented Oct 16, 2024

DavidLee528 commented Nov 12, 2024

leondz commented Nov 12, 2024

DavidLee528 commented Nov 12, 2024

github-actions bot commented Oct 11, 2024 •

edited

Loading

leondz commented Oct 11, 2024 •

edited

Loading

jmartin-tech left a comment •

edited

Loading

jmartin-tech Oct 14, 2024 •

edited

Loading