Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potentially undetected count overflow in do_count() #1325

Open
ampli opened this issue Jul 12, 2022 · 7 comments
Open

Potentially undetected count overflow in do_count() #1325

ampli opened this issue Jul 12, 2022 · 7 comments
Labels

Comments

@ampli
Copy link
Member

ampli commented Jul 12, 2022

While running tests on "amy" with ASN/UBSAN, I got this:

... amy/link-grammar/parse/count.c:1390:7: runtime error: signed integer overflow: 2147483647 * 5569944920 cannot be represented in type 'long int'
... amy/link-grammar/parse/count.c:1390:7: runtime error: signed integer overflow: -7518702831433443131 + -6485378441171344729 cannot be represented in type 'long int'

This just happens to never occur with the other languages and the various corpus-* files, but theoretically, it could happen due to the current code.

All the calculations are done in signed 64-bit (the storage is in signed 32-bit due to a recent change, but this has no implications here).
It is signed 64-bit due to historical reasons, but it can be a good idea to keep it signed because overflows could then be detected by the compiler's UBSAN checks.

The problem arises from the clipping value. It is INT_MAX, which is 2^31-1.
Observe the following code:

CACHE_COUNT(l_any, leftcount = count,
do_count(ctxt, lw, w, le->next, d->left->next, lnull_cnt));
if (le->multi)
CACHE_COUNT(l_cmulti, hist_accumv(&leftcount, d->cost, count),
do_count(ctxt, lw, w, le, d->left->next, lnull_cnt));
if (d->left->multi)
CACHE_COUNT(l_dmulti, hist_accumv(&leftcount, d->cost, count),
do_count(ctxt, lw, w, le->next, d->left, lnull_cnt));
if (d->left->multi && le->multi)
CACHE_COUNT(l_dcmulti, hist_accumv(&leftcount, d->cost, count),
do_count(ctxt, lw, w, le, d->left, lnull_cnt));

Each of the 4 do_count() calls may return INT_MAX, and hence leftcount (and similarly rightcount) can be up to 4*(2^31-1) = 2^33-4.

Then we have this code:

CACHE_COUNT(l_bnr, hist_muladdv(&total, &leftcount, d->cost, count),

We may get here up to (total + (2^33-4) * (2^31-1) ) which may be > 2^63 .

And we also have this:

hist_muladd(&total, &leftcount, 0.0, &rightcount);

So we may get here (total + (2^33-4) * (2^33-4)) = which is much more than 2^63 (and total here may already be near or > 2^63.

Possible solutions:

  1. Do nothing, as this has empirically proved not to be a problem with "real" dicts.
  2. Clamp leftcount and rightcount before using them in the multiplications.
  3. In parse_count_clamp(), clamp to 2^29-1.

(2) and (3) will add a slight overhead, but by analyzing the overflow possibilities I found some places in which the efficiency can be improved:

  1. In the macro CACHE_COUNT(), c is unnecessarily too wide, and count_t can be used instead. However (I didn't check) maybe the compiler already does such an optimization.
  2. The parse_count_clamp() call that has a debug printout "OVERFLOW1" is not needed since the loop can be performed no more than (2^8 * 2) times, and accumulate no more than (2*31-1) per iteration (max. total ~2^40)). To the final parse_count_clamp()("OVERFLOW2"), after potentially accumulation up to additional2^31-1` is enough.

>>>>>>>> For now, I would choose option (2).

@linas,
Please check if my analysis is correct, and I will send a PR (if needed).

@ampli ampli added the bug label Jul 12, 2022
@linas
Copy link
Member

linas commented Jul 12, 2022

I spent the last 15 minutes looking at this, but don't see a clear answer. I see this:

  • (a) The safest solution is to modify all of the hist_* defines to check for overflow. That has a performance impact.
  • (b) Comment out all uses of parse_count_clamp(), review each use of hist_* and determine which ones need to be checked, and add parse_count_clamp() back in, as needed; remove all other uses of parse_count_clamp()
  • (c) Some muddled alternative to (b), involving reviewing each use of parse_count_clamp() to see if it's actually needed. This risks missing a location that should have been checked.
  • (d) changing INT_MAX to 2^29-1 and crossing ones fingers and hoping that's enough.

I think your proposal 2. is my (a) and your proposal 3. is my (d)

  • (d) is the easiest.
  • (a) is second-easiest, but it leads to a muddy and hard-to-read parse/histogram.h I don't like the muddle.
  • (b) is the best for performance, but also intellectually the hardest.

The question is really "do I feel like a perfectionist today?" and "Do I have the mental energy to do this?" Maybe I should try, and you can review my work?

@ampli
Copy link
Member Author

ampli commented Jul 12, 2022

(b) Comment out all uses of parse_count_clamp(), review each use of hist_* and determine which ones need to be checked, and add parse_count_clamp() back in, as needed; remove all other uses of parse_count_clamp()
(c) Some muddled alternative to (b), involving reviewing each use of parse_count_clamp() to see if it's actually needed. This risks missing a location that should have been checked.

I already mostly analyzed all of that in my post above, and my conclusion was that the changes I proposed in (2) will fix the problem:

  • Remove the parse_count_clamp() call marked as "OVERFLOW1".
  • Add 2 parse_count_clamp() calls: Just before using leftcount and rightcount in a multiplication
    (just before the lines if (0 < hist_total(&leftcount)) and if (0 < hist_total(&rightcount))).
  • Leave the other 2 existing parse_count_clamp() intact.

I forgot to analyze the impact of line 1390 after applying my proposed fix.

CACHE_COUNT(l_bnr, hist_muladdv(&total, &leftcount, d->cost, count),

In that case, leftcount is clipped. total is already clipped from the existing call to parse_count_clamp() at the end of the match loop, and count is read from a clipped table entry (or do_count() result, which is clipped). So the max. total is (2^31-1) + ((2^31-1) * (2^31-1)) < 2^62.
Then the multiplication
hist_muladd(&total, &leftcount, 0.0, &rightcount);

can add to it at most ((2^31-1) * (2^31-1)) < 2^62, so the result is still less than 2^63.
(it's very subtle, I hope I didn't miss something...).

(d) changing INT_MAX to 2^29-1 and crossing ones fingers and hoping that's enough.
It seems to me it may not hurt to do it too.

In any case, I can send a PR for my proposal if it seems fine to you.
In case you choose to do it yourself, I can review it of course.

@linas
Copy link
Member

linas commented Jul 13, 2022

Yes, send a pull req. At each location, add comments such as /* can add at most ((2^31-1) * (2^31-1)) < 2^62, so the result is still less than 2^63 */ ... because it is subtle, and hard to review.

@ampli
Copy link
Member Author

ampli commented Jul 14, 2022

I said:

  1. The parse_count_clamp() call that has a debug printout "OVERFLOW1" is not needed since the loop can be performed no more than (2^8 * 2) times

The number (2^8 * 2) is incorrect, since the loop is over the disjunct. But assuming less than 2^31 disjunct per word, an overflow here is still impossible.
It doesn't seem reasonable that the program can ever handle billions of disjuncts per word, when a large part of them leads to a very high count number (overflow or near overflow). So at most, a sanity check can be done (elsewhere, when the disjuncts are counted for creating the disjunct memory block) that the number of disjuncts is less than 2^31.

@ampli
Copy link
Member Author

ampli commented Jul 14, 2022

a sanity check can be done

I didn't implement that, seems just as an unneeded overhead...

@linas
Copy link
Member

linas commented Jul 14, 2022

billions of disjuncts per word,

That won't happen. There won't even be a million.

However, in the earlier days, I had SQL dicts that had a hundred UNKNOWN-WORD, each of these with maybe 10K disjuncts. These caused combinatoric overflows and parsing timeouts, even if I set parse-timeout=600

@ampli
Copy link
Member Author

ampli commented Aug 12, 2022

There won't even be a million

In generations mode, we have ~3M disjuncts per word (in the middle of sentences). However, as we know this causes a vast slowness... The speed can be improved (a WIP), but the real solution is to implement the disjunct sampling method you have suggested. However, this will be for a future version (i.e. not for 5.11.0).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants