We support three methods for customizing datasets.
- [Recommended] Use the command line argument directly to specify
--dataset xxx.json yyy.jsonl zzz.csv
, which is more convenient for supporting custom datasets. It supports five data formats (usingSmartPreprocessor
, supported dataset formats are listed below) and supportsdataset_id
anddataset_path
. No need to modify thedataset_info.json
file. This method is suitable for users who are new to ms-swift, while the following two methods are suitable for developers who want to extend ms-swift. - Adding datasets to
dataset_info.json
is more flexible but cumbersome compared to the first method, and supports using two preprocessors and specifying their parameters:RenameColumnsPreprocessor
,ConversationsPreprocessor
(default is to useSmartPreprocessor
). You can directly modify the built-indataset_info.json
in Swift, or pass in an external json file using--custom_dataset_info xxx.json
(for users who prefer pip install over git clone to expand datasets). - Registering datasets: More flexible but cumbersome compared to the first and second methods, it supports using functions to preprocess datasets. Methods 1 and 2 are implemented by leveraging method 3. You can directly modify the source code for expansion, or pass in a custom registration path using
--custom_register_path xxx.py
, where the script will parse the py file (for pip install users).
Supports directly passing in custom dataset_id
(compatible with MS and HF) and dataset_path
, as well as simultaneously passing in multiple custom datasets and their respective sample sizes. The script will automatically preprocess and concatenate the datasets. If a dataset_id
is passed in, it will default to using the 'default' subset in the dataset_id and set the split to 'train'. If the dataset_id has already been registered, it will use the subsets, split, and preprocessing functions that were passed in during registration. If a dataset_path
is passed in, it can be specified as a relative path or an absolute path, where the relative path is relative to the current running directory.
The specified format for each dataset is as follows: [HF or MS::]{dataset_name} or {dataset_id} or {dataset_path}[:subset1/subset2/...][#dataset_sample]
. The simplest case requires specifying only dataset_name, dataset_id, or dataset_path.
# Defaulting to using the dataset_id from modelscope, while also supporting the dataset_id from huggingface.
--dataset {dataset_id} {dataset_path} HF::{dataset_id}
# Dataset Mixing: the following command takes subset1 and subset2 from dataset_id and randomly samples 20,000 records. If `#{dataset_sample}` is not used, all samples from the dataset will be used.
--dataset {dataset_name}#20000 {dataset_id}:{subset1}/{subset2}#20000 {dataset_path}#10000
The supported file formats for the script include csv
, json
, and jsonl
. You need to ensure that the incoming file conforms to the following dataset formats (only a partial list is provided). All of these formats support the system
field (it is important to note that if the system
field is specified in the csv format, it cannot be set to None
and can only be specified as an empty string. There is no such restriction for the json and jsonl formats). Files in json
and jsonl
formats support multi-turn dialogue (csv
does not support this).
Format 1:
Pre-Training
response
11111
aaaaa
AAAAA
{"response": "11111"}
{"response": "aaaaa"}
{"response": "AAAAA"}
Single-Round Dialogue
system,query,response
00000,11111,22222
00001,aaaaa,bbbbb
00002,AAAAA,BBBBB
{"system": "00000", "query": "11111", "response": "22222"}
{"query": "aaaaa", "response": "bbbbb"}
{"system": "00001", "query": "AAAAA", "response": "BBBBB"}
Multi-Round Dialogue
{"system": "00000", "query": "55555", "response": "66666"}
{"query": "eeeee", "response": "fffff", "history": []}
{"system": "00001", "query": "EEEEE", "response": "FFFFF", "history": [["query1", "response1"], ["query2", "response2"]]}
[{"system": "00000", "query": "55555", "response": "66666"},
{"query": "eeeee", "response": "fffff", "history": []},
{"system": "00001", "query": "EEEEE", "response": "FFFFF", "history": [["query1", "response1"], ["query2", "response2"]]}]
Format 2:
{"conversations": [{"from": "system", "value": "00000"}, {"from": "user", "value": "11111"}, {"from": "assistant", "value": "22222"}]}
{"conversations": [{"from": "user", "value": "aaaaa"}, {"from": "assistant", "value": "bbbbb"}, {"from": "user", "value": "ccccc"}, {"from": "assistant", "value": "ddddd"}]}
{"conversations": [{"from": "user", "value": "AAAAA"}, {"from": "assistant", "value": "BBBBB"}, {"from": "user", "value": "CCCCC"}, {"from": "assistant", "value": "DDDDD"}]}
Format 3:
{"messages": [{"role": "system", "content": "00000"}, {"role": "user", "content": "11111"}, {"role": "assistant", "content": "22222"}]}
{"messages": [{"role": "user", "content": "aaaaa"}, {"role": "assistant", "content": "bbbbb"}, {"role": "user", "content": "ccccc"}, {"role": "assistant", "content": "ddddd"}]}
{"messages": [{"role": "user", "content": "AAAAA"}, {"role": "assistant", "content": "BBBBB"}, {"role": "user", "content": "CCCCC"}, {"role": "assistant", "content": "DDDDD"}]}
Format 4:
{"system": "00000", "conversation": [{"human": "11111", "assistant": "22222"}]}
{"conversation": [{"human": "aaaaa", "assistant": "bbbbb"}]}
{"system": "00001", "conversation": [{"human": "AAAAA", "assistant": "BBBBB"}, {"human": "CCCCC", "assistant": "DDDDD"}, {"human": "EEEEE", "assistant": "FFFFF"}]}
Format 5:
system,instruction,input,output
00000,11111,22222,33333
00001,aaaaa,bbbbb,ccccc
00002,AAAAA,BBBBB,CCCCC
Extra pre-training format:
{"text": "11111"}
{"text": "aaaaa"}
{"text": "AAAAA"}
Human preference alignment:
Language model (DPO/ORPO/SimPO/CPO)
{"system": "123", "query": "11111", "response": "22222", "rejected_response": "33333", "history": [["query1", "response1"], ["query2", "response2"]]}
{"system": "123", "query": "aaaaa", "response": "bbbbb", "rejected_response": "ccccc", "history": [["query1", "response1"], ["query2", "response2"]]}
{"system": "123", "query": "AAAAA", "response": "BBBBB", "rejected_response": "CCCCC", "history": [["query1", "response1"], ["query2", "response2"]]}
- system and history are optional.
Language model (KTO)
{"query": "11111", "response": "22222", "label": true}
{"query": "aaaaa", "response": "bbbbb", "label": false}
{"system": "123", "query": "AAAAA", "response": "BBBBB", "label": true, "history": [["query1", "response1"], ["query2", "response2"]]}
Note: label
needs to be of type bool, not str.
- system and history are optional.
Vision MLLM (DPO/ORPO/SimPO/CPO)
{"system": "123", "query": "11111", "response": "22222", "rejected_response": "33333", "images": ["image_path"], "history": [["query1", "response1"], ["query2", "response2"]]}
{"system": "123", "query": "aaaaa", "response": "bbbbb", "rejected_response": "ccccc", "images": ["image_path"], "history": [["query1", "response1"], ["query2", "response2"]]}
{"system": "123", "query": "AAAAA", "response": "BBBBB", "rejected_response": "CCCCC", "images": ["image_path"], "history": [["query1", "response1"], ["query2", "response2"]]}
-
different models have varying support for the number of images. Please refer to the corresponding best practices document for each model.
-
system and history are optional.
Tool-Calling Agent
Format 1
{"tools":"{API_LIST}","conversations": [{"from": "system", "value": "00000"}, {"from": "user", "value": "11111"}, {"from": "assistant", "value": "22222"}]}
{"tools":"{API_LIST}","conversations": [{"from": "user", "value": "aaaaa"}, {"from": "assistant", "value": "bbbbb"}, {"from": "tool", "value": "ccccc"}, {"from": "assistant", "value": "ddddd"}]}
{"tools":"{API_LIST}","conversations": [{"from": "user", "value": "AAAAA"}, {"from": "assistant", "value": "BBBBB"}, {"from": "tool", "value": "CCCCC"}, {"from": "assistant", "value": "DDDDD"}]}
Format 2
{"tools":"{API_LIST}","messages": [{"role": "system", "content": "00000"}, {"role": "user", "content": "11111"}, {"role": "assistant", "content": "22222"}]}
{"tools":"{API_LIST}","messages": [{"role": "user", "content": "aaaaa"}, {"role": "assistant", "content": "bbbbb"}, {"role": "tool", "content": "ccccc"}, {"role": "assistant", "content": "ddddd"}]}
{"tools":"{API_LIST}","messages": [{"role": "user", "content": "AAAAA"}, {"role": "assistant", "content": "BBBBB"}, {"role": "tool", "content": "CCCCC"}, {"role": "assistant", "content": "DDDDD"}]}
For the tools format, please refer to Agent-Deoloyment Document You can choose the corresponding prompt by setting --tools_prompt
.
The tool
field represents the return result of the tool calling.
You can refer to the builtin dataset_info.json in Swift to expand datasets. You can directly add it in the built-in dataset_info.json, or you can pass in the path to an external dataset_info.json, a JSON string, or a dictionary using --custom_dataset_info 1.json
.
Adding dataset_id:
# MS
# Usage: `--dataset <dataset_name>`
"<dataset_name>": {
"dataset_id": "xxx/xxx"
}
# HF
# Usage: `--dataset HF::<dataset_name>` or directly use the `USE_HF` environment variable.
"<dataset_name>": {
"hf_dataset_id": "xxx/xxx"
}
Adding dataset_path:
# You can specify relative and absolute paths. Relative paths are relative to the directory where dataset_info.json is located.
# Usage: `--dataset <dataset_name>`
"<dataset_name>": {
"dataset_path": "xxx"
}
Supported parameters include:
- dataset_id: The corresponding ModelScope dataset_id, default is
None
. The simplest setup requires specifying one ofdataset_id
,hf_dataset_id
, ordataset_path
. - subsets: A list of names of the subsets, default is
[]
, which means using the 'default' subset. - split: Default is ['train'], usually not necessary to set.
- hf_dataset_id: The corresponding HuggingFace dataset_id, default is
None
. - dataset_path: Used to specify the local path of the dataset, e.g. 1.jsonl, default is
None
. It can take relative or absolute paths. If using a relative path, it is relative to the directory where the dataset_info.json is located. If dataset_path is set, then dataset_id, subsets, and hf_dataset_id parameters are ignored. - columns: The default preprocessor used is
SmartPreprocessor
. Specifying this parameter sets it toRenameColumnsPreprocessor
. You need to rename the columns in the dataset and convert them to the style of format 1 mentioned above. - conversations: Specifying this parameter sets the preprocessor to
ConversationsPreprocessor
('columns' takes priority over 'conversations'). - remove_useless_columns: Specifies whether to remove unnecessary columns (including: 'query', 'response', 'rejected_response', 'system', 'history', 'images'), default is
True
, usually not necessary to set. - tags: Used to annotate the dataset, default is
[]
, usually not necessary to set.
If the parameters in dataset_info.json
are not sufficient for your needs, such as adding custom prompts, requiring advanced dataset cleaning, or complex dataset retrieval and preprocessing, you can use the method of registering datasets using functions for data retrieval and preprocessing.
The following is an example of registering datasets. The complete py file can be viewed at custom.py, and the sh script can be viewed at custom. You can parse the registered content by specifying --custom_register_path xxx.py
.
from typing import Optional, Tuple
from datasets import Dataset as HfDataset
from modelscope import MsDataset
from swift.llm import get_dataset, register_dataset, get_dataset_from_repo
from swift.utils import get_logger
logger = get_logger()
class CustomDatasetName:
stsb_en = 'stsb-en'
def _preprocess_stsb(dataset: HfDataset) -> HfDataset:
prompt = """Task: Based on the given two sentences, provide a similarity score between 0.0 and 5.0.
Sentence 1: {text1}
Sentence 2: {text2}
Similarity score: """
query = []
response = []
for d in dataset:
query.append(prompt.format(text1=d['text1'], text2=d['text2']))
response.append(f"{d['label']:.1f}")
return HfDataset.from_dict({'query': query, 'response': response})
register_dataset(CustomDatasetName.stsb_en, 'swift/stsb', None, _preprocess_stsb, get_dataset_from_repo)
if __name__ == '__main__':
# test dataset
train_dataset, val_dataset = get_dataset([CustomDatasetName.stsb_en],
check_dataset_strategy='warning')
print(f'train_dataset: {train_dataset}')
print(f'val_dataset: {val_dataset}')
The register_dataset
function will register the dataset in the DATASET_MAPPING
. The parameters of this function are as follows:
-
dataset_name
: Required, representing the name of the dataset, which is also the unique ID of the dataset. -
dataset_id_or_path
: Required, representing thedataset_id
on the ModelScope Hub or the localdataset_dir
. -
subsets
: List of subsets of the dataset, default is[]
. -
split
: Default is ['train']. -
preprocess_func
: Preprocessing function. -
get_function
: Default value isNone
. The function to get the dataset. If passedNone
, the decorator approach will be used to register the dataset. If passed a function, the normal approach will be used to register.get_function
should returnHfDataset
orTuple[HfDataset, Optional[HfDataset]]
. If only one dataset is returned, it will be the train_dataset. If two datasets are returned, they will be the train_dataset and val_dataset, respectively. Theget_dataset
function supports obtaining multiple datasets, for example:get_dataset(['dataset1', 'dataset2'])
. We will concatenate the training and validation parts of each subset and return the merged train_dataset and val_dataset.The
HfDataset
returned by the function needs to follow certain specifications. If you want to do pre-training, you only need to include theresponse
field, please refer to the'tigerbot-law-zh'
dataset for details. For instruction tuning (single-round dialogue), thequery
andresponse
fields need to be included, representing the user's query and the AI assistant's answer in instruction tuning respectively, please refer to the'alpaca-zh'
dataset for details. For multi-round dialogue, an additionalhistory
field needs to be added, representing the historical information of the dialogue, please refer to the'damo-agent-mini-zh'
dataset for details. If each dataset sample has a differentsystem
, an additional system field needs to be added, you can also refer to the'damo-agent-mini-zh'
dataset for details. -
**kwargs
: Other parameters used to annotate the dataset. This parameter generally does not need to be set.
The following is an example of custom models. The complete py file can be viewed at custom.py, and the sh script can be viewed at custom. You can parse the registered content by specifying --custom_register_path xxx.py
.
from typing import Any, Dict
import torch
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
from transformers.utils.versions import require_version
from swift.llm import LoRATM, TemplateType, get_model_tokenizer, register_model
from swift.utils import get_logger
logger = get_logger()
class CustomModelType:
tigerbot_7b = 'tigerbot-7b'
tigerbot_13b = 'tigerbot-13b'
tigerbot_13b_chat = 'tigerbot-13b-chat'
class CustomTemplateType:
tigerbot = 'tigerbot'
@register_model(CustomModelType.tigerbot_7b,
'TigerResearch/tigerbot-7b-base-v3', LoRATM.llama,
TemplateType.default_generation)
@register_model(CustomModelType.tigerbot_13b,
'TigerResearch/tigerbot-13b-base-v2', LoRATM.llama,
TemplateType.default_generation)
@register_model(CustomModelType.tigerbot_13b_chat,
'TigerResearch/tigerbot-13b-chat-v4', LoRATM.llama,
CustomTemplateType.tigerbot)
def get_tigerbot_model_tokenizer(model_dir: str,
torch_dtype: torch.dtype,
model_kwargs: Dict[str, Any],
load_model: bool = True,
**kwargs):
use_flash_attn = kwargs.pop('use_flash_attn', False)
if use_flash_attn:
require_version('transformers>=4.34')
logger.info('Setting use_flash_attention_2: True')
model_kwargs['use_flash_attention_2'] = True
model_config = AutoConfig.from_pretrained(
model_dir, trust_remote_code=True)
model_config.pretraining_tp = 1
model_config.torch_dtype = torch_dtype
logger.info(f'model_config: {model_config}')
tokenizer = AutoTokenizer.from_pretrained(
model_dir, trust_remote_code=True)
model = None
if load_model:
model = AutoModelForCausalLM.from_pretrained(
model_dir,
config=model_config,
torch_dtype=torch_dtype,
trust_remote_code=True,
**model_kwargs)
return model, tokenizer
if __name__ == '__main__':
# test model base
model, tokenizer = get_model_tokenizer(
CustomModelType.tigerbot_7b, use_flash_attn=False)
print(model.__class__.__name__)
# test model chat
model, tokenizer = get_model_tokenizer(
CustomModelType.tigerbot_13b_chat, use_flash_attn=False)
print(model.__class__.__name__)
register_model
will register the model in MODEL_MAPPING
. The meaning of the parameters of this function are as follows:
model_type
: Required field. Represents the name of the model, and is also the unique id.model_id_or_path
: Required field. Represents themodel_id
of the model in ModelScope Hub, or the local model directorymodel_dir
.lora_target_modules
: Default isNone
. Represents the default lora_target_modules to use when--lora_target_modules DEFAULT
or--lora_target_modules AUTO
is specified in the sh script, or when--lora_target_modules
is not specified.template
: Default isTemplateType.default
. Represents the default dialogue template to use when--template_type AUTO
is specified in the sh script, or when--template_type
is not specified.get_function
: Default value isNone
. The function to get model and tokenizer. If passedNone
, the decorator approach will be used to register the model. If passed a function, the normal approach will be used to register.requires
: Default is[]
. Represents the dependencies required by the model that differ from other models. This parameter generally does not need to be set.torch_dtype
: Default isNone
. Represents the recommended torch_dtype for the model to use. This parameter generally does not need to be set.revision
: Default isNone
. Used to specify the version number of the model. Ifmodel_id_or_path
is a local model directory, this parameter is not effective. This parameter generally does not need to be set.ignore_file_pattern
: Default isNone
. Represents the regular pattern of file names to be ignored when downloading, this parameter will be passed tosnapshot_download
. For example,r'.+\.bin$'
,r'.+\.savetensors$'
, etc. This parameter generally does not need to be set.**kwargs
: Other parameters used to annotate model capabilities. This parameter generally does not need to be set.
The following is an example of custom models. The complete py file can be viewed at custom.py, and the sh script can be viewed at custom.
from swift.llm import (Template, ModelType, dataset_map,
get_model_tokenizer, get_template, get_dataset,
print_example, register_template, DatasetName)
from swift.utils import get_logger
logger = get_logger()
class CustomTemplateType:
tigerbot = 'tigerbot'
# Ref: https://github.com/TigerResearch/TigerBot/blob/main/infer.py
register_template(
CustomTemplateType.tigerbot,
Template(['{{SYSTEM}}'], ['\n\n### Instruction:\n{{QUERY}}\n\n### Response:\n'], [],
[['eos_token_id']]))
if __name__ == '__main__':
# test template
train_dataset, _ = get_dataset(DatasetName.blossom_math_zh)
_, tokenizer = get_model_tokenizer(ModelType.qwen_7b_chat, load_model=False)
template = get_template(CustomTemplateType.tigerbot, tokenizer)
train_dataset = dataset_map(train_dataset, template.encode)
print_example(train_dataset[0], tokenizer)
register_template
will register the dialogue template in TEMPLATE_MAPPING
. The meaning of the parameters of this function are as follows:
template_type
: Required field, represents the name of the dialogue template, and is also the unique id of the template.template
: Required field, needs to pass in aTemplate
. To initializeTemplate
, the following parameters need to be passed in:prefix
,prompt
,chat_sep
,suffix
,default_system
.
The template initialization function will obtain the complete chat template based on these four contents. The meaning of these four configuration contents are as follows.
prefix
: Represents the prefix part of the dialogue template, generally including system part, prefix tokens, bos tokens, etc. We use{{SYSTEM}}
as the placeholder for the system. If{{SYSTEM}}
does not exist in the prefix, then this Template does not support system, e.g.damo-agent-mini-zh
dataset.prompt
: Represents a round of dialogue in the dialogue template. We use{{QUERY}}
as the placeholder for the human query part in each round of dialogue,{{ROUND0}}
represents the placeholder for which round of dialogue this is, starting from 0, and{{ROUND1}}
starts from 1. The AI assistant's reply part will be concatenated afterprompt
, so we have not designed a placeholder for it. We will only calculate the loss for the AI assistant's reply part.chat_sep
: If multi-round dialogue is needed,chat_sep
will be used as the separator between each round of dialogue, such as: newline, etc. If set to None, then this Template does not support multi-round dialogue.suffix
: Used as the suffix part of the dialogue template, generally eos token. Will be concatenated after the last round of dialogue.default_system
: The default system.