arekit-ss
[AREkit double "s"] -- is an object-pair context sampler
for datasources,
powered by AREkit
NOTE: For custom text sampling, please follow the ARElight project.
Install dependencies:
pip install git+https://github.com/nicolay-r/arekit-ss.git@0.24.0
Download AREkit related data, from which sources
are required:
python -m arekit.download_data
Example of composing prompts:
python -m arekit_ss.sample --writer csv --source rusentrel --sampler prompt \
--prompt "For text: '{text}', the attitude between '{s_val}' and '{t_val}' is: '{label_val}'" \
--dest_lang en --docs_limit 1
Mind the case (issue #18): switching to another language may affect on amount of extracted data because of
terms_per_context
parameter that crops context by fixed and predefined amount of words.
source
-- source name from the list of the supported sources.terms_per_context
-- amount of words (terms) in between SOURCE and TARGET objects.object-source-types
-- filter specific source object typesobject-target-types
-- filter specific target object typesrelation_types
-- list of types, in which items separated with|
char; all by defaultsplits
-- Manual selection of the data-types related splits that should be chosen for the sampling process; types should be separated by ':' sign; for example: 'train:test'
sampler
-- List of the supported samplers:nn
-- CNN/LSTM architecture related, including frames annotation from RuSentiFrames.no-vectorize
-- flag is applicable only fornn
, and denotes no need to generate embeddings for features
bert
-- BERT-based, single-input sequence.prompt
-- prompt-based sampler for LLM systems [prompt engeneering guide]prompt
-- text of the prompt which includes the following parameters:{text}
is an original text of the sample{s_val}
and{t_val}
values of the source and target of the pairs respectively{label_val}
value of the label
writer
-- the output format of samples:mask_entities
-- mask entity mode.- Text translation parameters:
src_lang
-- original language of the text.dest_lang
-- target language of the text.
output_dir
-- target directory for samples storing- Limiting the amount of documents from source:
docs_limit
-- amount of documents to be considered for sampling from the whole source.doc_ids
-- list of the document IDs.