-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Probe&Detector/SciSafeEval: The State-of-the-Art Benchmark for Safety Alignment of Large Language Models in Scientific Tasks #948
base: main
Are you sure you want to change the base?
Conversation
Feature/LLaMa3.1-8B as refuse to answer detector
DCO Assistant Lite bot All contributors have signed the DCO ✍️ ✅ |
Hi @500BoiledPotatoes , can you sign the DCO plz? |
I have read the DCO Document and I hereby sign the DCO |
recheck |
Thank you, will take a look! And congratulations with the paper. In the interim, could the brief documentation be added so the tests pass? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks interesting and provides a significant number of unique probe classes.
I wonder if there is value in refactoring the probe to be more generic and accept a list of known categories to tests vs a class per category? By setting the default to a limited number of prompts via sample_size
and allowing for null
to represent all
prompts this could further reduce the number of probes to track.
plugins:
probes:
sci_safe_eval:
ByCategory:
categories:
- BiologyProteinFunctionPrediction
- BiologyProteinSequenceGeneration
- BiologyProteinStructurePrediction
- BiologyGeneClassification
- BiologyGeneGeneration
- ChemistryMoleculeGeneration
- ChemistryReactionPrediction
- MedicineInferenceReasoning
- MedicineKnowledgeRetrieval
- PhysicsKnowledgeRetrieval
sample_size: 80
The above ideas likely needs to be tempered with the difference in resolution that would be generated in the report as there would not be a break down per category in the report summary.
The concepts in the PR related to model as a judge will also inform the in progress work on #419 which may present a method to provide a more flexible detector for use with the probe responses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This detector has requirements for modification of hard coded values as a default requirement, this project is distributed via pypi
and should in default installation not expect users to have permissions to modify source code.
Also would refuse_to_answer
fit as mitigation
detector? I get the act of refusal is not a specific known mitigation
string based response however it does seem like mitigation.refuse_to_answer
or mitigation.refusal
would be in line with what is being detected. This could still utilize model as a judge similar to how misleading.MustRefuteClaimModel
or misleading.MustContradictNLI
use a model to detect.
model = transformers.AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16) | ||
|
||
# Infer the device map for multi-GPU | ||
device_map = infer_auto_device_map(model, max_memory={0: "24GiB"}, no_split_module_classes=["LlamaDecoderLayer"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not used hard coded GPU device expectations.
Usage of device_map
seems like something that could be shared in HFCompatible
to enable more complex auto detection of resources however this must be configurable by the user without code changes to be a viable use case.
model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct" | ||
|
||
# Ensure correct checkpoint path (you may need to download and specify the correct checkpoint file) | ||
checkpoint_path = "/home/tianhao.li/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/5206a32e0bd3067aef1ce90f5528ade7d866253f" # Modify this with the actual path to your checkpoint files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cannot rely on code manipulation for configuration. The detector class implements Configurable
and can accept configuration values via DEFAULT_PARAMS
.
The singleton class model could be retained if there is an expectation that this detector will be used for multiple probes however it would need to provide a registry of loaded models to be sure the instance model is the one required for the probe/detector combination or at least instance level validation that the model object is configured as expected when accessed.
print(f"\n\n[Response]: {response}") | ||
print(f"\n[Evaluation]: {r}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Detectors should not print
prefer logging.debug()
or remove theses.
with open("detection_log.json", "a") as log_file: | ||
json.dump(log_entry, log_file) | ||
log_file.write("\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Detectors should not write their own log files, as written this would write a log to the current working directory with no context of why it exists. The parent detector
class already logs results into the report.jsonl
and to hitlog.jsonl
for the specific run when detectors are evaluated.
data_keys = ['name', 'smiles/selfies'] | ||
use_smiles = True | ||
|
||
class ChemistryPropertyPredictionTiny(ChemistryPropertyPrediction, SciSafeEval, Probe): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class ChemistryPropertyPredictionTiny(ChemistryPropertyPrediction, SciSafeEval, Probe): | |
class ChemistryPropertyPredictionTiny(ChemistryPropertyPrediction): |
The parent already provides the other mixin classes. Again applies to all *Tiny
versions.
hf_raw_filename = "chemistry_molecule-generation.jsonl" | ||
placeholders = ['<name>', '<smiles/selfies>'] | ||
data_keys = ['name', 'smiles/selfies'] | ||
use_smiles = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These can be exposed a instance specific configurable values using the plugin Configurable
patterns.
hf_raw_filename = "chemistry_molecule-generation.jsonl" | |
placeholders = ['<name>', '<smiles/selfies>'] | |
data_keys = ['name', 'smiles/selfies'] | |
use_smiles = True | |
DEFAULT_PARAMS = Probe.DEFAULT_PARAMS | { | |
"filename": "chemistry_molecule-generation.jsonl", | |
"placeholders": ['<name>', '<smiles/selfies>'], | |
"data_keys": ['name', 'smiles/selfies'], | |
"use_smiles" = True, | |
} |
This would result in the default values project on the object as:
self.filename
self.placeholders
self.data_keys
self.use_smiles
if retries >= max_retries: | ||
raise Exception("Failed to get a valid annotation after 5 retries.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raising a general exception will bring down the while run, this should log the error and continue retuning a value that will be interpreted as not able to detect.
import garak.attempt | ||
from garak.detectors.base import Detector | ||
|
||
class Llama31AsJudge(Detector): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is currently work in progress to define a generic ModelAsAJudge
detector in #419, creating a detector that is coupled to a specific model by class name will likely be problematic. We can either wait for the generic detector or rename this to something that would allow or other models based on the same inference stack, something like refusalModel
or maybe refusalCheckpoint
.
If this detector is not moved under a different package as suggested in other comments, it might also be valuable to provide a refusalKW
detector that can evaluate based on a english string detection, while less resilient this would offer tooling for extracting results in more resource constrained execution environments.
def read_sci_safe_eval(self, file_path, combine=True): | ||
data = [] | ||
try: | ||
with open(file_path, 'r', encoding='utf-8', errors='ignore') as file: | ||
for line in file: | ||
json_obj = json.loads(line.strip()) | ||
if combine: | ||
if self.placeholders and self.data_keys: | ||
prompt = json_obj['instruction'] | ||
for placeholder, data_key in zip(self.placeholders, self.data_keys): | ||
if data_key == 'smiles/selfies': | ||
data_value = json_obj['smiles'] if self.use_smiles else json_obj['selfies'] | ||
else: | ||
data_value = json_obj.get(data_key, '') | ||
prompt = prompt.replace(placeholder, data_value) | ||
else: | ||
prompt = json_obj['instruction'] | ||
data.append({ | ||
"idx": json_obj['idx'], | ||
"prompt": prompt, | ||
"few-shot": json_obj.get('few-shot', []), | ||
"cot": json_obj.get('cot', None), | ||
"jailbreak": json_obj.get('jailbreak', None), | ||
"tags": json_obj.get('tags', []) | ||
}) | ||
else: | ||
data.append({ | ||
"idx": json_obj['idx'], | ||
"instruction": json_obj['instruction'], | ||
"few-shot": json_obj.get('few-shot', []), | ||
"cot": json_obj.get('cot', None), | ||
"jailbreak": json_obj.get('jailbreak', None), | ||
"tags": json_obj.get('tags', []) | ||
}) | ||
return data | ||
except FileNotFoundError: | ||
print(f"File not found: {file_path}") | ||
return None | ||
except json.JSONDecodeError: | ||
print(f"Error decoding JSON in file: {file_path}") | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this code is here because the dataset files do not meet the criteria for the huggingface
dataset parsing:
This error is seen when attempting to explore the data on huggingface
:
Error code: DatasetGenerationCastError
Exception: DatasetGenerationCastError
Message: An error occurred while generating the dataset
All the data files must have the same columns, but at some point there are 2 new columns ({'smiles', 'selfies'}) and 1 missing columns ({'sequence'}).
This happened while the json dataset builder was generating data using
hf://datasets/Tianhao0x01/SciSafeEval/chemistry_molecule-generation.jsonl (at revision 1751327df6dfc640571fa2d24cdae31522eb1bfe)
Please either edit the data files to have matching columns, or separate them into different configurations (see docs at https://hf.co/docs/hub/datasets-manual-configuration#multiple-configurations)
Traceback: Traceback (most recent call last):
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1869, in _prepare_split_single
writer.write_table(table)
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/arrow_writer.py", line 580, in write_table
pa_table = table_cast(pa_table, self._schema)
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/table.py", line 2292, in table_cast
return cast_table_to_schema(table, schema)
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/table.py", line 2240, in cast_table_to_schema
raise CastError(
datasets.table.CastError: Couldn't cast
idx: int64
instruction: string
name: string
smiles: string
selfies: string
tags: list<item: string>
child 0, item: string
jailbreak: string
to
{'idx': Value(dtype='int64', id=None), 'instruction': Value(dtype='string', id=None), 'name': Value(dtype='string', id=None), 'sequence': Value(dtype='string', id=None), 'tags': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'jailbreak': Value(dtype='string', id=None)}
because column names don't match
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 1392, in compute_config_parquet_and_info_response
parquet_operations = convert_to_parquet(builder)
File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 1041, in convert_to_parquet
builder.download_and_prepare(
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 924, in download_and_prepare
self._download_and_prepare(
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 999, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1740, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1871, in _prepare_split_single
raise DatasetGenerationCastError.from_cast_error(
datasets.exceptions.DatasetGenerationCastError: An error occurred while generating the dataset
All the data files must have the same columns, but at some point there are 2 new columns ({'smiles', 'selfies'}) and 1 missing columns ({'sequence'}).
This happened while the json dataset builder was generating data using
Can the dataset be updated to conform to the required format? This would allow processing using huggingface's datasets
package.
The non-commercial license on the dataset (https://huggingface.co/datasets/Tianhao0x01/SciSafeEval) may restrict who can use this probe. We prefer garak to be open and use open datasets and models. Can you discuss your position on the license and how this probe can be made available to all garak users? |
Hi Leon - Thanks for the comment! We do aware that the licence in |
This sounds amazing, yes please, that would work great!
I am glad they're keeping you busy :) Hope things are well! |
We will working on this soon, all the best🫡 |
Hi Garak Community,
We introduced SciSafeEval, the state-of-the-art benchmark for safety alignment of large language models in scientific tasks. More info could be found at https://arxiv.org/abs/2410.03769 and https://huggingface.co/datasets/Tianhao0x01/SciSafeEval .
In this PR, we add new probe
garak/probes/sci_safe_eval.py
and corresponding detectorgarak/detectors/refuse_to_answer.py
to Garak. This new probe will enable better assessment of the safety alignment of large language models in scientific tasks.Thanks for reviewing!
Best,
Tianhao