Truncating should effect only the train set #166

JohnGiorgi · 2019-08-06T20:20:41Z

When batching data, Saber truncates / right-pads each sequence to match a length of saber.constants.MAX_SENT_LEN.

Truncating sequences should only happen on the train set, ensuring that we don't drop examples in the evaluation partitions (dataset_folder/valid.* and dataset_folder/test.*)

Furthermore, a user should be able to specify some kind of percentile (e.g. 0.99), which would set the max sequence length to whatever length truncates only 1% of all training examples. This would be a principled way to choose the value. This could lead to big reductions in training time if there were a handful of very long sentences.

The text was updated successfully, but these errors were encountered:

JohnGiorgi added the invalid This doesn't seem right label Aug 6, 2019

JohnGiorgi self-assigned this Aug 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Truncating should effect only the train set #166

Truncating should effect only the train set #166

JohnGiorgi commented Aug 6, 2019 •

edited

Loading

Truncating should effect only the train set #166

Truncating should effect only the train set #166

Comments

JohnGiorgi commented Aug 6, 2019 • edited Loading

JohnGiorgi commented Aug 6, 2019 •

edited

Loading