-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/multilingual #943
base: main
Are you sure you want to change the base?
Feature/multilingual #943
Changes from 7 commits
b6464ea
f94bb2e
2238d18
7202e19
1105bb1
6bb7da3
717f0ff
b35cc1e
bbb6c76
51baeb2
dc3a4ab
ee82261
7cb8acc
ec9b40a
d50d19e
808f34a
8a41c95
2fc2dd5
8283b65
395840d
73363f9
57d14e5
3b3b60a
bae54d7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -66,6 +66,7 @@ Code reference | |
payloads | ||
_config | ||
_plugins | ||
translator | ||
|
||
Plugin structure | ||
^^^^^^^^^^^^^^^^ | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,165 @@ | ||
The `translator.py` module in the Garak framework is designed to handle text translation tasks using various translation services and models. | ||
It provides several classes, each implementing different translation strategies and models, including both cloud-based services like DeepL and NIM, and local models like m2m100 from Hugging Face. | ||
|
||
garak.translator | ||
============= | ||
|
||
.. automodule:: garak.translator | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: | ||
|
||
Multilingual support | ||
==================== | ||
|
||
This feature adds multilingual probes and detector keywords and triggers. | ||
You can check the model vulnerability for multilingual languages. | ||
|
||
* limitation: | ||
- This function only supports for `bcp47` code is "en". | ||
- Huggingface detector only supports English. You need to bring the target language NLI model for the detector. | ||
- Some detectors only support English, such as the `snowball` detector. | ||
- If you fail to load probes or detectors, you need to choose a smaller translation model. | ||
|
||
pre-requirements | ||
---------------- | ||
|
||
.. code-block:: bash | ||
|
||
pip install nvidia-riva-client==2.16.0 pyenchant==3.2.2 | ||
|
||
Support translation service | ||
--------------------------- | ||
|
||
- Huggingface | ||
- This code uses the following translation models: | ||
- `Helsinki-NLP/opus-mt-en-{lang} <https://huggingface.co/docs/transformers/model_doc/marian>`_ | ||
- `facebook/m2m100_418M <https://huggingface.co/facebook/m2m100_418M>`_ | ||
- `facebook/m2m100_1.2B <https://huggingface.co/facebook/m2m100_1.2B>`_ | ||
- `DeepL <https://www.deepl.com/docs-api>`_ | ||
- `NIM <https://build.nvidia.com/nvidia/megatron-1b-nmt>`_ | ||
|
||
API KEY | ||
------- | ||
|
||
You can use DeepL API or NIM API to translate probe and detector keywords and triggers. | ||
|
||
You need an API key for the preferred service. | ||
- `DeepL <https://www.deepl.com/en/pro-api>`_ | ||
- `NIM <https://build.nvidia.com/nvidia/megatron-1b-nmt>`_ | ||
|
||
Supported languages: | ||
- `DeepL <https://developers.deepl.com/docs/resources/supported-languages>`_ | ||
- `NIM <https://build.nvidia.com/nvidia/megatron-1b-nmt/modelcard>`_ | ||
|
||
Set up the API key with the following command: | ||
|
||
DeepL | ||
~~~~~ | ||
|
||
.. code-block:: bash | ||
|
||
export DEEPL_API_KEY=xxxx | ||
|
||
NIM | ||
~~~ | ||
|
||
.. code-block:: bash | ||
|
||
export NIM_API_KEY=xxxx | ||
|
||
config file | ||
----------- | ||
|
||
You can pass the translation service, source language, and target language by the argument. | ||
|
||
- translation_service: "nim" or "deepl", "local" | ||
- lang_spec: "ja", "ja,fr" etc. (you can set multiple language codes) | ||
Comment on lines
+76
to
+77
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. where in the config does this go? recommend something like:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It sets up value by the
I follow this advice |
||
|
||
* Note: The `Helsinki-NLP/opus-mt-en-{lang}` case uses different language formats. The language codes used to name models are inconsistent. Two-digit codes can usually be found here, while three-digit codes require a search such as “language code {code}". More details can be found `here <https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models>`_. | ||
|
||
You can also configure this via a config file: | ||
|
||
.. code-block:: yaml | ||
|
||
run: | ||
translation_service: {you choose translation service "nim" or "deepl", "local"} | ||
lang_spec: {you choose language code} | ||
|
||
Examples for multilingual | ||
------------------------- | ||
|
||
DeepL | ||
~~~~~ | ||
|
||
To use the translation option for garak, run the following command: | ||
|
||
.. code-block:: bash | ||
|
||
export DEEPL_API_KEY=xxxx | ||
python3 -m garak --model_type nim --model_name meta/llama-3.1-8b-instruct --probes encoding --translation_service deepl --lang_spec ja | ||
|
||
If you save the config file as "garak/configs/simple_translate_config_deepl.yaml", use this command: | ||
|
||
.. code-block:: bash | ||
|
||
export DEEPL_API_KEY=xxxx | ||
python3 -m garak --model_type nim --model_name meta/llama-3.1-8b-instruct --probes encoding --config garak/configs/simple_translate_config_deepl.yaml | ||
|
||
Example config file: | ||
|
||
.. code-block:: yaml | ||
|
||
run: | ||
translation_service: "deepl" | ||
lang_spec: "ja" | ||
|
||
NIM | ||
~~~ | ||
|
||
For NIM, run the following command: | ||
|
||
.. code-block:: bash | ||
|
||
export NIM_API_KEY=xxxx | ||
python3 -m garak --model_type nim --model_name meta/llama-3.1-8b-instruct --probes encoding --translation_service nim --lang_spec ja | ||
|
||
If you save the config file as "garak/configs/simple_translate_config_nim.yaml", use this command: | ||
|
||
.. code-block:: bash | ||
|
||
export NIM_API_KEY=xxxx | ||
python3 -m garak --model_type nim --model_name meta/llama-3.1-8b-instruct --probes encoding --config garak/configs/simple_translate_config_nim.yaml | ||
|
||
Example config file: | ||
|
||
.. code-block:: yaml | ||
|
||
run: | ||
translation_service: "nim" | ||
lang_spec: "ja" | ||
|
||
Local | ||
~~~~~ | ||
|
||
For local translation, use the following command: | ||
|
||
.. code-block:: bash | ||
|
||
python3 -m garak --model_type nim --model_name meta/llama-3.1-8b-instruct --probes encoding --translation_service local --lang_spec ja | ||
|
||
If you save the config file as "garak/configs/simple_translate_config_local.yaml", use this command: | ||
|
||
.. code-block:: bash | ||
|
||
python3 -m garak --model_type nim --model_name meta/llama-3.1-8b-instruct --probes encoding --config garak/configs/simple_translate_config_local.yaml | ||
|
||
Example config file: | ||
|
||
.. code-block:: yaml | ||
|
||
run: | ||
translation_service: local | ||
local_model_name: "facebook/m2m100_418M" | ||
local_tokenizer_name: "facebook/m2m100_418M" | ||
lang_spec: "ja" |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -247,6 +247,22 @@ def main(arguments=None) -> None: | |
action="store_true", | ||
help="Launch garak in interactive.py mode", | ||
) | ||
parser.add_argument('--lang_spec', type=str, help='Target language for translation') | ||
parser.add_argument( | ||
"--translation_service", | ||
choices=["deepl", "nim", "local"], | ||
help="Choose the translation service to use (overrides config file setting)", | ||
) | ||
parser.add_argument( | ||
"--local_model_name", | ||
type=str, | ||
help="Model name", | ||
) | ||
parser.add_argument( | ||
"--local_tokenizer_name", | ||
type=str, | ||
help="Tokenizer name", | ||
) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do these need to be exposed here? Should we accept a json There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with it. I think it will expand it for other eco-system such as ollama, ggml and so on. |
||
|
||
logging.debug("args - raw argument string received: %s", arguments) | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -14,6 +14,7 @@ | |
from garak.configurable import Configurable | ||
from garak.generators.huggingface import HFCompatible | ||
import garak.attempt | ||
from garak.translator import SimpleTranslator, LocalTranslator, is_english | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. factoring this up to the harness keeps base class module-level imports light, which they must be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I remove it. |
||
|
||
|
||
class Detector(Configurable): | ||
|
@@ -61,7 +62,38 @@ def __init__(self, config_root=_config): | |
) | ||
|
||
logging.info(f"detector init: {self}") | ||
|
||
translation_service = "" | ||
if hasattr(config_root.run, 'translation_service'): | ||
translation_service = config_root.run.translation_service | ||
if translation_service == "local": | ||
self.translator = LocalTranslator(config_root) | ||
else: | ||
self.translator = SimpleTranslator(config_root) | ||
if hasattr(config_root.run, 'lang_spec'): | ||
self.target_lang = config_root.run.lang_spec | ||
if hasattr(self, 'substrings'): | ||
self.substrings = self.translate_keywords(self.substrings) | ||
|
||
def _translate(self, words: List[str]) -> List[str]: | ||
if hasattr(self, 'target_lang') is False or self.bcp47 == "*": | ||
return words | ||
translated_keywords = [] | ||
for lang in self.target_lang.split(","): | ||
if self.bcp47 == lang: | ||
continue | ||
for word in words: | ||
mean_word_judge = is_english(word) | ||
if mean_word_judge: | ||
translated_keywords.append(self.translator._get_response(word, self.bcp47, lang)) | ||
else: | ||
translated_keywords.append(word) | ||
words = list(words) | ||
words.extend(translated_keywords) | ||
return words | ||
|
||
def translate_keywords(self, keywords: List[str]) -> List[str]: | ||
return self._translate(keywords) | ||
|
||
def detect(self, attempt: garak.attempt.Attempt) -> Iterable[float]: | ||
"""Takes a list of Attempts; classifies them; returns a list of results | ||
in the range 0.0-1.0 where 0.0 represents no hit, 1.0 represents successful hit | ||
|
@@ -169,6 +201,7 @@ class StringDetector(Detector): | |
def __init__(self, substrings, config_root=_config): | ||
super().__init__(config_root=config_root) | ||
self.substrings = substrings | ||
self.substrings = self.translate_keywords(self.substrings) | ||
|
||
def detect( | ||
self, attempt: garak.attempt.Attempt, case_sensitive=False | ||
|
@@ -206,6 +239,7 @@ def detect( | |
detector_results = [] | ||
if "triggers" in attempt.notes: | ||
triggers = attempt.notes["triggers"] | ||
triggers = self.translate_keywords(triggers) | ||
if isinstance(triggers, str): | ||
triggers = [triggers] | ||
for output in attempt.all_outputs: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is non-standard - we should follow the same pattern as in
garak.generators.base.Generator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. Please share the example code how to set up the API_KEY value.