Skip to content

Low Resource Context Relation Sampler for contexts with relations for fact-checking and fine-tuning your LLM models, powered by AREkit

License

Notifications You must be signed in to change notification settings

nicolay-r/arekit-ss

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

arekit-ss 0.24.0

arekit-ss [AREkit double "s"] -- is an object-pair context sampler for datasources, powered by AREkit

NOTE: For custom text sampling, please follow the ARElight project.

Installation

Install dependencies:

pip install git+https://github.com/nicolay-r/arekit-ss.git@0.24.0

Download AREkit related data, from which sources are required:

python -m arekit.download_data

Usage

Example of composing prompts:

python -m arekit_ss.sample --writer csv --source rusentrel --sampler prompt \
  --prompt "For text: '{text}', the attitude between '{s_val}' and '{t_val}' is: '{label_val}'" \
  --dest_lang en --docs_limit 1

Mind the case (issue #18): switching to another language may affect on amount of extracted data because of terms_per_context parameter that crops context by fixed and predefined amount of words.

Parameters

  • source -- source name from the list of the supported sources.
    • terms_per_context -- amount of words (terms) in between SOURCE and TARGET objects.
    • object-source-types -- filter specific source object types
    • object-target-types -- filter specific target object types
    • relation_types -- list of types, in which items separated with | char; all by default
    • splits -- Manual selection of the data-types related splits that should be chosen for the sampling process; types should be separated by ':' sign; for example: 'train:test'
  • sampler -- List of the supported samplers:
    • nn -- CNN/LSTM architecture related, including frames annotation from RuSentiFrames.
      • no-vectorize -- flag is applicable only for nn, and denotes no need to generate embeddings for features
    • bert -- BERT-based, single-input sequence.
    • prompt -- prompt-based sampler for LLM systems [prompt engeneering guide]
      • prompt -- text of the prompt which includes the following parameters:
        • {text} is an original text of the sample
        • {s_val} and {t_val} values of the source and target of the pairs respectively
        • {label_val} value of the label
  • writer -- the output format of samples:
    • csv -- for AREnets framework;
    • jsonl -- for OpenNRE framework.
    • sqlite -- SQLite-3.0 database.
  • mask_entities -- mask entity mode.
  • Text translation parameters:
    • src_lang -- original language of the text.
    • dest_lang -- target language of the text.
  • output_dir -- target directory for samples storing
  • Limiting the amount of documents from source:
    • docs_limit -- amount of documents to be considered for sampling from the whole source.
    • doc_ids -- list of the document IDs.

output_prompts

Powered by