From 7c796bdb57e8577369655239a7a24824496c375b Mon Sep 17 00:00:00 2001 From: Martin Perier-Dulhoste Date: Tue, 14 Sep 2021 11:47:39 +0200 Subject: [PATCH] docs(data_utils): fix docstring of get_batch_indices --- biotransformers/lightning_utils/data.py | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/biotransformers/lightning_utils/data.py b/biotransformers/lightning_utils/data.py index 3cb808f..79235d3 100755 --- a/biotransformers/lightning_utils/data.py +++ b/biotransformers/lightning_utils/data.py @@ -193,24 +193,23 @@ def get_batch_indices( but rather constant number of tokens. Some the batch can contain a few long sequences or multiple small ones. - This sampler returns batches of indices to achieve this property. It also decides if sequences must be cropped and return the desired length. The cropping length is sampled randomly for each sequence at each epoch in the range of crop_sizes values. - THis sampler computes a list of list of tuple which contains indices and - lengths of sequences inside the batch. + This sampler computes a list of list of tuple which contains indices and + lengths of sequences inside the batch. + Example: returning [[(1, 100), (3, 600)],[(4, 100), (7, 1200), (10, 600)], [(12, 1000)]] - means that the first batch will be composed of sequence at index 1 and 8 with - lengths 100 and 600. The third batch contains only sequence 12 with a length + means that the first batch will be composed of sequence at index 1 and 3 with + lengths 100 and 600. The third batch contains only sequence 12 with a length of 1000. Args: sequence_strs: list of string - toks_per_batch (int): Maximum number of token per batch - extra_toks_per_seq (int, optional): . Defaults to 0. - crop_sizes (Tuple[int, int]): min and max sequence lengths when cropping + toks_per_batch: maximum number of token per batch + crop_sizes: min and max sequence lengths when cropping Returns: List: List of batches indexes and lengths