Potentially undetected count overflow in do_count() #1325

ampli · 2022-07-12T20:01:50Z

While running tests on "amy" with ASN/UBSAN, I got this:

... amy/link-grammar/parse/count.c:1390:7: runtime error: signed integer overflow: 2147483647 * 5569944920 cannot be represented in type 'long int'
... amy/link-grammar/parse/count.c:1390:7: runtime error: signed integer overflow: -7518702831433443131 + -6485378441171344729 cannot be represented in type 'long int'

This just happens to never occur with the other languages and the various corpus-* files, but theoretically, it could happen due to the current code.

All the calculations are done in signed 64-bit (the storage is in signed 32-bit due to a recent change, but this has no implications here).
It is signed 64-bit due to historical reasons, but it can be a good idea to keep it signed because overflows could then be detected by the compiler's UBSAN checks.

The problem arises from the clipping value. It is INT_MAX, which is 2^31-1.
Observe the following code:

link-grammar/link-grammar/parse/count.c

Lines 1371 to 1381 in c1b15b8

    
           CACHE_COUNT(l_any, leftcount = count, 
        
           	do_count(ctxt, lw, w, le->next, d->left->next, lnull_cnt)); 
        
           if (le->multi) 
        
           	CACHE_COUNT(l_cmulti, hist_accumv(&leftcount, d->cost, count), 
        
           		do_count(ctxt, lw, w, le, d->left->next, lnull_cnt)); 
        
           if (d->left->multi) 
        
           	CACHE_COUNT(l_dmulti, hist_accumv(&leftcount, d->cost, count), 
        
           		do_count(ctxt, lw, w, le->next, d->left, lnull_cnt)); 
        
           if (d->left->multi && le->multi) 
        
           	CACHE_COUNT(l_dcmulti, hist_accumv(&leftcount, d->cost, count), 
        
           		do_count(ctxt, lw, w, le, d->left, lnull_cnt));

Each of the 4 do_count() calls may return INT_MAX, and hence leftcount (and similarly rightcount) can be up to 4*(2^31-1) = 2^33-4.

Then we have this code:

link-grammar/link-grammar/parse/count.c

Line 1390 in c1b15b8

CACHE_COUNT(l_bnr, hist_muladdv(&total, &leftcount, d->cost, count),

We may get here up to (total + (2^33-4) * (2^31-1) ) which may be > 2^63 .

And we also have this:

link-grammar/link-grammar/parse/count.c

Line 1428 in c1b15b8

hist_muladd(&total, &leftcount, 0.0, &rightcount);

So we may get here (total + (2^33-4) * (2^33-4)) = which is much more than 2^63 (and total here may already be near or > 2^63.

Possible solutions:

Do nothing, as this has empirically proved not to be a problem with "real" dicts.
Clamp leftcount and rightcount before using them in the multiplications.
In parse_count_clamp(), clamp to 2^29-1.

(2) and (3) will add a slight overhead, but by analyzing the overflow possibilities I found some places in which the efficiency can be improved:

In the macro CACHE_COUNT(), c is unnecessarily too wide, and count_t can be used instead. However (I didn't check) maybe the compiler already does such an optimization.
The parse_count_clamp() call that has a debug printout "OVERFLOW1" is not needed since the loop can be performed no more than (2^8 * 2) times, and accumulate no more than (2*31-1) per iteration (max. total ~2^40)). To the final parse_count_clamp()("OVERFLOW2"), after potentially accumulation up to additional2^31-1` is enough.

>>>>>>>> For now, I would choose option (2).

@linas,
Please check if my analysis is correct, and I will send a PR (if needed).

The text was updated successfully, but these errors were encountered:

linas · 2022-07-12T21:08:14Z

I spent the last 15 minutes looking at this, but don't see a clear answer. I see this:

(a) The safest solution is to modify all of the hist_* defines to check for overflow. That has a performance impact.
(b) Comment out all uses of parse_count_clamp(), review each use of hist_* and determine which ones need to be checked, and add parse_count_clamp() back in, as needed; remove all other uses of parse_count_clamp()
(c) Some muddled alternative to (b), involving reviewing each use of parse_count_clamp() to see if it's actually needed. This risks missing a location that should have been checked.
(d) changing INT_MAX to 2^29-1 and crossing ones fingers and hoping that's enough.

I think your proposal 2. is my (a) and your proposal 3. is my (d)

(d) is the easiest.
(a) is second-easiest, but it leads to a muddy and hard-to-read parse/histogram.h I don't like the muddle.
(b) is the best for performance, but also intellectually the hardest.

The question is really "do I feel like a perfectionist today?" and "Do I have the mental energy to do this?" Maybe I should try, and you can review my work?

ampli · 2022-07-12T23:35:22Z

(b) Comment out all uses of parse_count_clamp(), review each use of hist_* and determine which ones need to be checked, and add parse_count_clamp() back in, as needed; remove all other uses of parse_count_clamp()
(c) Some muddled alternative to (b), involving reviewing each use of parse_count_clamp() to see if it's actually needed. This risks missing a location that should have been checked.

I already mostly analyzed all of that in my post above, and my conclusion was that the changes I proposed in (2) will fix the problem:

Remove the parse_count_clamp() call marked as "OVERFLOW1".
Add 2 parse_count_clamp() calls: Just before using leftcount and rightcount in a multiplication
(just before the lines if (0 < hist_total(&leftcount)) and if (0 < hist_total(&rightcount))).
Leave the other 2 existing parse_count_clamp() intact.

I forgot to analyze the impact of line 1390 after applying my proposed fix.

link-grammar/link-grammar/parse/count.c

Line 1390 in c1b15b8

CACHE_COUNT(l_bnr, hist_muladdv(&total, &leftcount, d->cost, count),

In that case, leftcount is clipped. total is already clipped from the existing call to parse_count_clamp() at the end of the match loop, and count is read from a clipped table entry (or do_count() result, which is clipped). So the max. total is (2^31-1) + ((2^31-1) * (2^31-1)) < 2^62.
Then the multiplication

link-grammar/link-grammar/parse/count.c

Line 1428 in c1b15b8

hist_muladd(&total, &leftcount, 0.0, &rightcount);

can add to it at most ((2^31-1) * (2^31-1)) < 2^62, so the result is still less than 2^63.
(it's very subtle, I hope I didn't miss something...).

(d) changing INT_MAX to 2^29-1 and crossing ones fingers and hoping that's enough.
It seems to me it may not hurt to do it too.

In any case, I can send a PR for my proposal if it seems fine to you.
In case you choose to do it yourself, I can review it of course.

linas · 2022-07-13T02:35:46Z

Yes, send a pull req. At each location, add comments such as /* can add at most ((2^31-1) * (2^31-1)) < 2^62, so the result is still less than 2^63 */ ... because it is subtle, and hard to review.

ampli · 2022-07-14T03:20:41Z

I said:

The parse_count_clamp() call that has a debug printout "OVERFLOW1" is not needed since the loop can be performed no more than (2^8 * 2) times

The number (2^8 * 2) is incorrect, since the loop is over the disjunct. But assuming less than 2^31 disjunct per word, an overflow here is still impossible.
It doesn't seem reasonable that the program can ever handle billions of disjuncts per word, when a large part of them leads to a very high count number (overflow or near overflow). So at most, a sanity check can be done (elsewhere, when the disjuncts are counted for creating the disjunct memory block) that the number of disjuncts is less than 2^31.

ampli · 2022-07-14T03:37:32Z

a sanity check can be done

I didn't implement that, seems just as an unneeded overhead...

linas · 2022-07-14T05:17:23Z

billions of disjuncts per word,

That won't happen. There won't even be a million.

However, in the earlier days, I had SQL dicts that had a hundred UNKNOWN-WORD, each of these with maybe 10K disjuncts. These caused combinatoric overflows and parsing timeouts, even if I set parse-timeout=600

ampli · 2022-08-12T16:21:01Z

There won't even be a million

In generations mode, we have ~3M disjuncts per word (in the middle of sentences). However, as we know this causes a vast slowness... The speed can be improved (a WIP), but the real solution is to implement the disjunct sampling method you have suggested. However, this will be for a future version (i.e. not for 5.11.0).

ampli added the bug label Jul 12, 2022

ampli mentioned this issue Jul 14, 2022

A fix for undetected count overflows #1327

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potentially undetected count overflow in do_count() #1325

Potentially undetected count overflow in do_count() #1325

ampli commented Jul 12, 2022 •

edited

Loading

linas commented Jul 12, 2022

ampli commented Jul 12, 2022 •

edited

Loading

linas commented Jul 13, 2022

ampli commented Jul 14, 2022

ampli commented Jul 14, 2022

linas commented Jul 14, 2022

ampli commented Aug 12, 2022

Potentially undetected count overflow in do_count() #1325

Potentially undetected count overflow in do_count() #1325

Comments

ampli commented Jul 12, 2022 • edited Loading

linas commented Jul 12, 2022

ampli commented Jul 12, 2022 • edited Loading

linas commented Jul 13, 2022

ampli commented Jul 14, 2022

ampli commented Jul 14, 2022

linas commented Jul 14, 2022

ampli commented Aug 12, 2022

ampli commented Jul 12, 2022 •

edited

Loading

ampli commented Jul 12, 2022 •

edited

Loading