Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/multilingual #943

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

SnowMasaya
Copy link

Add multilingual support

Verification

  • [pass] Run the tests and ensure they pass python -m pytest tests/
  • [pass] Verify the thing does what it should: I add docs/source/translator.rst

Feature explain

You need an API key for the preferred service.

DeepL

export DEEPL_API_KEY=xxxx

Config file

You can pass the translation service, source language, and target language as arguments.

  • translation_service: "nim" or "deepl", "local"
  • lang_spec: "ja", "ja,fr" etc. (you can set multiple language codes)

Note: The Helsinki-NLP/opus-mt-en-{lang} case uses different language formats. The language codes used to name models are inconsistent. Two-digit codes can usually be found here, while three-digit codes require a search like "language code {code}".

run:
  translation_service: {your chosen translation service "nim" or "deepl", "local"}
  lang_spec: {your chosen language code}

Examples for multilingual support

DeepL

To use the translation option for garak, run the following command:

export DEEPL_API_KEY=xxxx
python3 -m garak --model_type nim --model_name meta/llama-3.1-8b-instruct --probes encoding --translation_service deepl --lang_spec ja

If you save the config file as "garak/configs/simple_translate_config_deepl.yaml", use this command:

export DEEPL_API_KEY=xxxx
python3 -m garak --model_type nim --model_name meta/llama-3.1-8b-instruct --probes encoding --config garak/configs/simple_translate_config_deepl.yaml

Example config file:

run:
  translation_service: "deepl"
  lang_spec: "ja"

NIM

For NIM, run the following command:

export NIM_API_KEY=xxxx
python3 -m garak --model_type nim --model_name meta/llama-3.1-8b-instruct --probes encoding --translation_service nim --lang_spec ja

If you save the config file as "garak/configs/simple_translate_config_nim.yaml", use this command:

export NIM_API_KEY=xxxx
python3 -m garak --model_type nim --model_name meta/llama-3.1-8b-instruct --probes encoding --config garak/configs/simple_translate_config_nim.yaml

Example config file:

run:
  translation_service: "nim"
  lang_spec: "ja"

Local

For local translation, use the following command:

python3 -m garak --model_type nim --model_name meta/llama-3.1-8b-instruct --probes encoding --translation_service local --lang_spec ja

If you save the config file as "garak/configs/simple_translate_config_local.yaml", use this command:

python3 -m garak --model_type nim --model_name meta/llama-3.1-8b-instruct --probes encoding --config garak/configs/simple_translate_config_local.yaml

Example config file:

run:
  translation_service: local
  local_model_name: "facebook/m2m100_418M"
  local_tokenizer_name: "facebook/m2m100_418M"
  lang_spec: "ja"

Specific Hardware Examples:

Local Translation needs GPU

  • GPU related
    • Specific support required cuda
    • Minium GPU Memory:
      • Helsinki-NLP/opus-mt-en-{}: around 1500MB
      • facebook/m2m100_418M: around 8000MB
      • facebook/m2m100_1.2B: around 15000MB

Copy link
Contributor

github-actions bot commented Oct 9, 2024

DCO Assistant Lite bot:
Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you all sign our Developer Certificate of Origin before we can accept your contribution. You can sign the DCO by just posting a Pull Request Comment same as the below format.


I have read the DCO Document and I hereby sign the DCO


1 out of 2 committers have signed the DCO.
✅ (SnowMasaya)[https://github.com/SnowMasaya]
@masayaOgushi
You can retrigger this bot by commenting recheck in this Pull Request

- Integrated support for multiple translation services including local and external APIs.
  - local: Huggingface model uses for translation
  - deepl:DeepL uses for translation
  - nim: NIM uses for translation
- Implemented utility functions for language detection and text processing.

Signed-off-by: Masaya Ogushi <mogushi@nvidia.com>
- Addd translation function for base probe class
- prompts and triggers translate by base class method
- attempt_descr translation

Signed-off-by: Masaya Ogushi <mogushi@nvidia.com>
- Translation handling for detector keywords and substrings, triggers.

Signed-off-by: Masaya Ogushi <mogushi@nvidia.com>
- Added support for specifying translation services directly from the CLI.
- Implemented options to set  target languages for translation.

Signed-off-by: Masaya Ogushi <mogushi@nvidia.com>
- Added new dependencies required for enhanced translation features.

Signed-off-by: Masaya Ogushi <mogushi@nvidia.com>
- Added detailed explanations of the translationn method
- Included examples of how translation services are configured and utilized within the codebase.

Signed-off-by: Masaya Ogushi <mogushi@nvidia.com>
@SnowMasaya
Copy link
Author

SnowMasaya commented Oct 9, 2024

I have read the DCO Document and I hereby sign the DCO

Copy link
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review, testing is still in progress.

The test failures in macOS look to be an incomplete dependency requirement that may need to be reworked or removed. A default installation should not require install of an external library. Hence the dependency on pyenchant here may be problematic.

Traceback:
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/importlib/__init__.py:90: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/detectors/test_detectors_riskywords.py:8: in <module>
    import garak.detectors.base
garak/detectors/__init__.py:1: in <module>
    from .base import *
garak/detectors/base.py:17: in <module>
    from garak.translator import SimpleTranslator, LocalTranslator, is_english
garak/translator.py:15: in <module>
    import enchant
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/enchant/__init__.py:81: in <module>
    from enchant import _enchant as _e
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/enchant/_enchant.py:[157](https://github.com/leondz/garak/actions/runs/11246708414/job/31296143918?pr=943#step:5:158): in <module>
    raise ImportError(msg)
E   ImportError: The 'enchant' C library was not found and maybe needs to be installed.
E   See  https://pyenchant.github.io/pyenchant/install.html
E   for details

requirements.txt Outdated
Comment on lines 48 to 49
pyenchant==3.2.2
Copy link
Collaborator

@jmartin-tech jmartin-tech Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These entires are new direct requirements and need to be above the # tests section in this file.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I move it.

Comment on lines 82 to 117
if hasattr(config_root, 'run'):
if hasattr(config_root.run, 'translation_service'):
translation_service = config_root.run.translation_service
if translation_service == "local":
if probename == "encoding":
self.translator = LocalEncodingTranslator(config_root)
elif probename == "goodside":
self.translator = LocalGoodsideTranslator(config_root)
elif probename == "dan":
self.translator = LocalDanTranslator(config_root)
else:
self.translator = LocalTranslator(config_root)
elif translation_service == "deepl" or translation_service == "nim":
if probename == "encoding":
self.translator = EncodingTranslator(config_root)
elif probename == "goodside":
self.translator = GoodsideTranslator(config_root)
elif probename == "dan":
self.translator = DanTranslator(config_root)
else:
self.translator = SimpleTranslator(config_root)

if hasattr(config_root, 'run'):
if hasattr(config_root.run, 'lang_spec'):
self.target_lang = config_root.run.lang_spec

if hasattr(self, 'triggers') and len(self.triggers) > 0:
if self.is_nested_list(self.triggers):
trigger_list = []
for trigger in self.triggers:
trigger_words = self._translate(trigger)
for word in trigger_words:
trigger_list.append([word])
self.triggers = trigger_list
else:
self.triggers = self._translate(self.triggers)
Copy link
Collaborator

@jmartin-tech jmartin-tech Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes assumptions that config_root will be the global _config object, this will not always be valid. The _translate() method below assumes self.translator will always exist.

Probes should not be responsible for creation of the translator object, since the service will be utilized by both probes and detectors. This may be important for impacts to local translation as It looks like the current implementation would instantiate and load the model object for each probe.

One option may be to have the translator package retain a singleton translator built from _config.run on first access request by passing the probe.bcp47 value and the target language to get a translator object.

Copy link
Author

@SnowMasaya SnowMasaya Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I separate translation responsibility form probes and detectors.
I add the check function self.translator exist or not each instance.
I will develop for singleton translator built from _config.run.

for prompt in prompts:
mean_word_judge = is_english(prompt)
if mean_word_judge:
translate_prompt = self.translator._get_response(prompt, self.bcp47, lang)
Copy link
Collaborator

@jmartin-tech jmartin-tech Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is currently no guarantee self.translator will exist. Probes are not required to call super().__init__(), while they should in most cases this needs a guard to ensure the attribute is set.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I will set up the check function for self.translator

Comment on lines 153 to 158
def detector():
_config.run.translation_service = 'local'
detector = Detector(_config)
detector.bcp47 = "en"
detector.target_lang = "jap"
return detector
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to reset the _config via importlib.reload after the request completes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I use importlib.reload for reset the _config



def test_startswith_detect():
_config.run.translation_service = 'local'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impacts to _config inside a test need to be reset after the test completes, this may be a case where a shared fixture could be created in a conftest.py here in the detectors path or added to the existing global conftest.py.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I move set up _config part to conftest.py

Comment on lines 22 to 25
@pytest.fixture
def config():
config = _config
return config
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More _config impacts that need to be reset after test completion. Likely will benefit from a shared fixture.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I move the set up _config part to conftest.py

Comment on lines 65 to 88
@pytest.mark.parametrize("translator_class, lang_specs, model_name, tokenizer_name", [
(LocalTranslator, ["ja", "fr"], "facebook/m2m100_1.2B", "facebook/m2m100_1.2B"),
(LocalDanTranslator, ["ja", "fr"], "facebook/m2m100_1.2B", "facebook/m2m100_1.2B"),
(LocalEncodingTranslator, ["ja", "fr"], "facebook/m2m100_1.2B", "facebook/m2m100_1.2B"),
(LocalGoodsideTranslator, ["ja", "fr"], "facebook/m2m100_1.2B", "facebook/m2m100_1.2B"),
(LocalTranslator, ["jap", "fr"], "Helsinki-NLP/opus-mt-en-{}", "Helsinki-NLP/opus-mt-en-{}"),
(LocalDanTranslator, ["jap", "fr"], "Helsinki-NLP/opus-mt-en-{}", "Helsinki-NLP/opus-mt-en-{}"),
(LocalEncodingTranslator, ["jap", "fr"], "Helsinki-NLP/opus-mt-en-{}", "Helsinki-NLP/opus-mt-en-{}"),
(LocalGoodsideTranslator, ["jap", "fr"], "Helsinki-NLP/opus-mt-en-{}", "Helsinki-NLP/opus-mt-en-{}"),
])
def test_local_translate_multiple_languages(translator_class, lang_specs, model_name, tokenizer_name):
config = _config
input_text = "Hello, how are you?"
config.run.local_model_name = model_name
config.run.local_tokenizer_name = tokenizer_name

for lang_spec in lang_specs:
config.run.lang_spec = lang_spec
translator = translator_class(config_root=config)
translated_text = translator._get_response(input_text,
source_lang="en",
target_lang=lang_spec)
assert isinstance(translated_text, str)
assert translated_text != input_text
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test requires download of three models, facebook/m2m100_1.2B ~ 10GB, opus-mt-en-jap ~ 800MB and opus-mt-en-fr ~ 1GB

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make the storage check function. If the environment does not have enough storage, those test skip.

garak/cli.py Outdated
Comment on lines 256 to 265
parser.add_argument(
"--local_model_name",
type=str,
help="Model name",
)
parser.add_argument(
"--local_tokenizer_name",
type=str,
help="Tokenizer name",
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these need to be exposed here? Should we accept a json option_file here to allow for a generic generator to be used here instead of restricting to HF models?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with it. I think it will expand it for other eco-system such as ollama, ggml and so on.

Comment on lines 65 to 74
@pytest.mark.parametrize("translator_class, lang_specs, model_name, tokenizer_name", [
(LocalTranslator, ["ja", "fr"], "facebook/m2m100_1.2B", "facebook/m2m100_1.2B"),
(LocalDanTranslator, ["ja", "fr"], "facebook/m2m100_1.2B", "facebook/m2m100_1.2B"),
(LocalEncodingTranslator, ["ja", "fr"], "facebook/m2m100_1.2B", "facebook/m2m100_1.2B"),
(LocalGoodsideTranslator, ["ja", "fr"], "facebook/m2m100_1.2B", "facebook/m2m100_1.2B"),
(LocalTranslator, ["jap", "fr"], "Helsinki-NLP/opus-mt-en-{}", "Helsinki-NLP/opus-mt-en-{}"),
(LocalDanTranslator, ["jap", "fr"], "Helsinki-NLP/opus-mt-en-{}", "Helsinki-NLP/opus-mt-en-{}"),
(LocalEncodingTranslator, ["jap", "fr"], "Helsinki-NLP/opus-mt-en-{}", "Helsinki-NLP/opus-mt-en-{}"),
(LocalGoodsideTranslator, ["jap", "fr"], "Helsinki-NLP/opus-mt-en-{}", "Helsinki-NLP/opus-mt-en-{}"),
])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing in github actions does not have access to GPU enabled systems, also this project cannot expect a GPU to be available in default environments.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I try to use CPU default value for self.device in the translator.py.

Comment on lines 251 to 252
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
lang_specs = getattr(config_root.run, 'lang_spec', "jap")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cuda capability is not a default expectation for this project, while this does fall back to "cpu" there is likely work needed to enable this object on all supported platforms.

lang_spec by default based on this PR should just be "en" and should not load a model when no translation will ever occur. The translator services should be agnostic of any actual translation and source_lang to target_lang should just pass thru when both values are equal returning the original values.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cuda capability is not a default expectation for this project, while this does fall back to "cpu" there is likely work needed to enable this object on all supported platforms.

I try to change the default value "cpu". I think it can work if machine does not have GPU.

lang_spec by default based on this PR should just be "en" and should not load a model when no translation will ever occur.

OK. I change the default value of lang_spec. And if the lang_spec is "en", it will pass to load a model

source_lang to target_lang should just pass thru when both values are equal returning the original values.

OK. I add the check function.

update remove punctuation
update english judge
add translate function
add logging translate result
add Reverse translate for hf detector and snowball probes

Signed-off-by: Masaya Ogushi <mogushi@nvidia.com>
check translator instance
remove translate function
reset config

Signed-off-by: Masaya Ogushi <mogushi@nvidia.com>
check translator instance
add reverse translator
add test reverse translator

Signed-off-by: Masaya Ogushi <mogushi@nvidia.com>
remove argument
using generator_option_file

Signed-off-by: Masaya Ogushi <mogushi@nvidia.com>
add load translator instance

Signed-off-by: Masaya Ogushi <mogushi@nvidia.com>
check storage size
set up each instance for each test

Signed-off-by: Masaya Ogushi <mogushi@nvidia.com>
remove pyenchant
Using nltk instead of pyenchant

Signed-off-by: Masaya Ogushi <mogushi@nvidia.com>
update how to use translation function

Signed-off-by: Masaya Ogushi <mogushi@nvidia.com>
github-actions bot added a commit that referenced this pull request Oct 23, 2024
SnowMasaya and others added 3 commits October 23, 2024 14:13
Signed-off-by: SnowGushiGit <snow.akogi.pgel@gmail.com>
Signed-off-by: Masaya Ogushi <mogushi@nvidia.com>
Copy link
Collaborator

@leondz leondz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this is a significant amount of work.

This is for translation-based multilingual, rather than general multilingual support, so we should be careful to separate these two functions. The ability to select which languages garak uses is distinct from how that language is achieved (in prompts, detectors, and so on).

There are some refactorings needed to get the PR to a sustainable state. We may need a couple more hooks, either injected into base classes or added there. It might be simplest to add these hooks to the harness. Happy to set up a call or instant message chat to discuss this. Given the breadth of the changes, it is likely to be beneficial to discuss plans and get good alignment while doing the rest of the changes.

Comment on lines +65 to +70
if hasattr(config_root, 'plugins'):
if hasattr(config_root.plugins, 'generators'):
if "translation_service" in config_root.plugins.generators.keys():
translation_service = config_root.plugins.generators["translation_service"]
self.translator = _config.load_translator(translation_service=translation_service,
classname="")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the base detector need to look for generators configurations?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I defined the translation instance on the _config.py.
It defined by the generators configurations.
I think your advice to add these hooks to the harness is the better.

Comment on lines +147 to +152
if hasattr(config_root, 'plugins'):
if hasattr(config_root.plugins, 'generators'):
if "translation_service" in config_root.plugins.generators.keys():
translation_service = config_root.plugins.generators["translation_service"]
self.reverse_translator = _config.load_translator(translation_service=translation_service,
classname="reverse")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be factored out

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will move this part to harness.

Comment on lines +161 to +163
if hasattr(self, 'reverse_translator'):
if self.reverse_translator is not None:
non_none_outputs = self.reverse_translator.translate_prompts(non_none_outputs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we handle this in the harness rather than the detector? It's an orchestration issue

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I try it.

Comment on lines 200 to 202
if hasattr(self, 'translator'):
if self.translator is not None:
self.substrings = self.translator.translate_prompts(self.substrings)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we handle this in the harness?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I try it.

Comment on lines 21 to 23
if hasattr(self, 'translator'):
if self.translator is not None:
triggers = self.translator.translate_prompts(triggers)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this piece of code appears a lot. we should really avoid this, and use/add/inject a hook in the base class

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I don't understand how to implement it.
Could you share with me your example code?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider factoring up to harness

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider factoring up to harness

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

@@ -14,6 +14,7 @@
from garak.configurable import Configurable
from garak.generators.huggingface import HFCompatible
import garak.attempt
from garak.translator import SimpleTranslator, LocalTranslator, is_english
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

factoring this up to the harness keeps base class module-level imports light, which they must be

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SimpleTranslator, LocalTranslator does not need to import into the garak/detectors/base.py.

I remove it.

Comment on lines +57 to +70
DeepL
~~~~~

.. code-block:: bash

export DEEPL_API_KEY=xxxx

NIM
~~~

.. code-block:: bash

export NIM_API_KEY=xxxx

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is non-standard - we should follow the same pattern as in garak.generators.base.Generator

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. Please share the example code how to set up the API_KEY value.

Comment on lines +76 to +77
- translation_service: "nim" or "deepl", "local"
- lang_spec: "ja", "ja,fr" etc. (you can set multiple language codes)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where in the config does this go?

recommend something like:

run:
  language:
    translation:
      translation_service: nim
      api_key: xxx
      lang_from: en
      lang_to: ja,fr

Copy link
Author

@SnowMasaya SnowMasaya Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sets up value by the generator_option_file.
It needs set up the generator json file

{
  "lang_spec": "ja",
  "translation_service": "local",
  "local_model_name": "facebook/m2m100_418M",
  "local_tokenizer_name": "facebook/m2m100_418M"
}

I follow this advice

Copy link
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SnowMasaya thank you, this is a significant benefit for the project.

Adding a number of my pending comments here, please be aware these are likely similar to many @leondz made, although possibly from different reasoning. Happy to continue iterating.

Comment on lines 65 to 82
if hasattr(config_root, 'plugins'):
if hasattr(config_root.plugins, 'generators'):
if "translation_service" in config_root.plugins.generators.keys():
translation_service = config_root.plugins.generators["translation_service"]
self.translator = _config.load_translator(translation_service=translation_service,
classname="")
if hasattr(self, 'substrings'):
if hasattr(self, 'translator'):
if self.translator is not None:
self.substrings = self.translator.translate_prompts(self.substrings)

if hasattr(config_root, 'plugins'):
if hasattr(config_root.plugins, 'generators'):
if "translation_service" in config_root.plugins.generators.keys():
translation_service = config_root.plugins.generators["translation_service"]
self.reverse_translator = _config.load_translator(translation_service=translation_service,
classname="reverse")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted before base.Detector should not access _config directly.

Additionally the location where language configuration is being placed in the config file structure is inconsistent with intended use, config_root.plugins.generators is where plugins of type Generator reference to obtain injected properties used for the target application under test and should not be mixed with language selection for the run.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I follow the following code example?

Comment on lines +81 to +82
The translator config writes to a file and the path passed, with
`--generator_option_file` as JSON. An example
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--generator_option_file is for the target generator under test and should not be used to pass run specific data.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I follow this advice.

#943 (comment)

But I think misunderstanding it...
Could you tell me the best way to set up config value?

Comment on lines +147 to +153
if hasattr(config_root, 'plugins'):
if hasattr(config_root.plugins, 'generators'):
if "translation_service" in config_root.plugins.generators.keys():
translation_service = config_root.plugins.generators["translation_service"]
self.reverse_translator = _config.load_translator(translation_service=translation_service,
classname="reverse")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detectors should not access _config directly. The detectors also should not load the translator at instantiation. The detector should be able to request a compatible translator from the a service that owns the responsibility for handling translation.

Consider:

def detector_translator(self):
    from garak import translator

    translator.getInstance(self)

and in translator:

@staticmethod
def getInstance(klass):
    service = _translator_instance.gel(klass, None)
    if service is None:
        service = load_translator(klass)
        _translator_instance[klass] = service
    return service

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I try it.

Comment on lines +143 to +147
if hasattr(self, 'translator'):
if self.translator is not None:
challenges = self.translator.translate_prompts([challenge])
else:
challenges = [challenge]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to create a list and then iterate inside the turn?

I suspect this is misplaced, translation in this probe should only need to happen for the initial challenge in the probe and all generators used to perform red_team actions by manipulating on that prompt using a model in the target language.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got same review comment.

I will recheck code and considering which part is the better place to implement translation function.

If you have any recommend way to implement it, please let me know.

Comment on lines +246 to +251
if hasattr(config_root, 'run'):
if hasattr(config_root.run, 'translation_service'):
translation_service = config_root.run.translation_service
class_name = self.probename.split(".")[-2]
self.translator = _config.load_translator(translation_service=translation_service,
classname=class_name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as detectors, the probe plugins should not own translator configuration logic.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

Comment on lines +5 to +13
from garak.probes.encoding import BaseEncodingProbe, InjectBase16, InjectBase32, InjectAscii85, \
InjectHex, InjectQP, InjectUU, InjectMime, \
InjectROT13, InjectBase2048, InjectBraille, \
InjectMorse, InjectNato, InjectEcoji, InjectZalgo
from garak import _config, _plugins
from garak.translator import is_english
import pytest
import garak
import importlib
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Limit imports, the encoding classes here are not need as they are enumerated via the PROBES constant.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I will rewrite it.

@@ -12,6 +23,7 @@ def test_InjectBase64_len_cap():
assert len(p.prompts) < num_payloads * num_templates * num_encoders


@pytest.fixture(scope="function")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this doing? Tests should not be annotated as fixtures.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. This test does not need it. I remove it this part.

@@ -20,6 +32,7 @@ def test_InjectBase64_prompt_trigger_match():
assert len(p.prompts) == len(p.triggers)


@pytest.fixture(scope="function")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this doing? Tests should not be annotated as fixtures.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. This part does not need it. I remove it this part.

Comment on lines +78 to +80
_config.plugins.generators["translation_service"] = 'local'
_config.plugins.generators["local_model_name"] = model_name
_config.plugins.generators["local_tokenizer_name"] = tokenizer_name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted lang_spec configuration should not be in _config.plugins.generators. These values will come from config however they should not be a key under generators as the structure of plugins.<plugin_type> is meant to encapsulate shared configuration of each plugin of type in this case Generator.

Possible suggestion:

_config.run.lang_spec = "ja"
_config.run.translation_service = { "type": "local", "model_name", "model_config" }

Also is there a reason for local_model_name & local_tokenizer_name? They look to be equal in all supplied tests, this suggest to me there should is not a need for specification of each independently. If we need to expand on the tokenizer configuration it should likely go inside a model_config object that can supply parameters to a Configurable object.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer to place this in a dedicated sub-key of _config.run, perhaps one related to multilingual config using translate (suggestion above #943 (comment))

Copy link
Author

@SnowMasaya SnowMasaya Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I follow this advice. I set up value by the generator_option_file.
I think I misunderstand about it.

#943 (comment)

I try to use _config.run set up each propriety.

Also is there a reason for local_model_name & local_tokenizer_name?

Most hugging face model uses same name tokenizer and model names.
I think model_config would be better.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not mix configuration and runtime service creation in the same class, the translator package can consume the _config however the _config should not instantiate any object.

@leondz
Copy link
Collaborator

leondz commented Oct 30, 2024

Adding a number of my pending comments here, please be aware these are likely similar to many @leondz made, although possibly from different reasoning. Happy to continue iterating.

Indeed, sorry for the dupe review, there was quite a lot of code so I reviewed directly instead of processing @jmartin-tech 's comments as well

add mean judge for reverse translation
change translation model size

Signed-off-by: Masaya Ogushi <mogushi@nvidia.com>
translate trigger words

Signed-off-by: Masaya Ogushi <mogushi@nvidia.com>
add reverse translation
remove trigger translation
fix test code

Signed-off-by: Masaya Ogushi <mogushi@nvidia.com>
add translation instance

Signed-off-by: Masaya Ogushi <mogushi@nvidia.com>
add library for reverse translation

Signed-off-by: Masaya Ogushi <mogushi@nvidia.com>
@SnowMasaya
Copy link
Author

SnowMasaya commented Oct 31, 2024

Thank you, this is a significant amount of work.

This is for translation-based multilingual, rather than general multilingual support, so we should be careful to separate these two functions. The ability to select which languages garak uses is distinct from how that language is achieved (in prompts, detectors, and so on).

There are some refactorings needed to get the PR to a sustainable state. We may need a couple more hooks, either injected into base classes or added there. It might be simplest to add these hooks to the harness. Happy to set up a call or instant message chat to discuss this. Given the breadth of the changes, it is likely to be beneficial to discuss plans and get good alignment while doing the rest of the changes.

Thank you for review.
I would like to set up a call for understanding the best way to implement this function.
Please stop the review the following commit because it has not apply your review.
The following commit for my coworker request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants