Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange behaviour with nested patterns #22

Open
BabeleDunnit opened this issue Mar 3, 2017 · 21 comments
Open

Strange behaviour with nested patterns #22

BabeleDunnit opened this issue Mar 3, 2017 · 21 comments

Comments

@BabeleDunnit
Copy link

BabeleDunnit commented Mar 3, 2017

Hi Mischa, me again :)

While writing some tests, I found a strange behaviour and I was able to reproduce it in a very tight way.

I will attach directly the source hereunder, just copy and compile it.

Briefly, there is a "#define STRANGE_BEHAVIOUR" in the code which controls the creation of AC with a slightly different set of patterns. While one works as expected, the other seems to act a bit strange IMHO. Also, there are some comments. It seems that the sub-matching "ana" pattern breaks something somewhere. Have a look.

Just comment "#define STRANGE_BEHAVIOUR" to get an AC behaving as expected.

Maybe this is related to ticket #20?

Thank you again for your excellent work.

Later, Aaron



#include "msutil.h"
#include <assert.h>
#include "acism.h"

static int actual = 0;
static int
on_match(int strnum, int textpos, MEMREF const *pattv)
{
    (void)strnum, (void)textpos, (void)pattv;
    ++actual;
    printf("%9d %7d '%.*s'\n", textpos, strnum, (int)pattv[strnum].len, pattv[strnum].ptr);
    return 0;
}

int
main(int argc, char **argv)
{

ACISM *psp = NULL;

#define STRANGE_BEHAVIOUR
#ifdef STRANGE_BEHAVIOUR

// this is strange.
// run on string "bananas", will find 1 "ana", 2 "na", 1 "banana" and NO "nana".
// I would expect 1 "banana", 1 "nana", 2 "ana" and 2 "na".
// is this a bug or it's just me which do not know Aho-Corasick well enough?

    MEMREF patterns[4] = {
        { "banana", strlen("banana")},
        { "nana", strlen("nana")},
        { "ana", strlen("ana")},
        { "na", strlen("na")},
    };

    psp = acism_create(patterns, 4);
#else

// this works as expected.
// run on string "bananas", will find 2 "na", 1 "banana" and 1 "nana".

    MEMREF patterns[3] = {
        { "banana", strlen("banana")},
        { "nana", strlen("nana")},
        { "na", strlen("na")},
    };

    psp = acism_create(patterns, 3);
#endif

    assert(psp);

    acism_dump(psp, PS_STATS, stderr, patterns);

    const char* t = "bananas";

    MEMREF text = {
        t, strlen(t)
    };

    int state = 0;
    while(acism_more(psp, text, (ACISM_ACTION*)on_match, patterns, &state));

    acism_destroy(psp);

    return 0;
}


@mischasan
Copy link
Owner

mischasan commented Mar 3, 2017 via email

@mischasan
Copy link
Owner

prune_backlinks now removed completely, and it ain't coming back.

@BabeleDunnit
Copy link
Author

Hi Mischa, yes, I did see the "prune_backlinks" stuff and did try it out, but unfortunately does not fix this problem.. :(

later,
Aaron

@mischasan
Copy link
Owner

mischasan commented Mar 6, 2017 via email

@mischasan
Copy link
Owner

Okay, thank you so much for that really simple test case.
The fix was to acism_create. As soon as I have fixed my certificates, I will upload the unit test babele_t.

@BabeleDunnit
Copy link
Author

HI Mischa,

if you are using the command-line git, you could just re-clone the repo using HTTPS instead of SSL, do the fix, commit and and push. Git will ask you just your github login/pwd, no need of SSL certificates at all.

If you are instead using some GUI please forget my suggestion :)

@mischasan
Copy link
Owner

mischasan commented Mar 17, 2017 via email

@mischasan
Copy link
Owner

mischasan commented Mar 17, 2017 via email

@BabeleDunnit
Copy link
Author

Hi Mischa,

Some good and bad news... :)
Please don't send me ninjas.

I confirm that the fix hereabove works for the test case, but now something else has broken... and this is really quite strange:

with patterns

MEMREF patterns[2] = {
    {"/2.", strlen("/2.")},
    {"Firefox/2.0", strlen("Firefox/2.0")},
};

and string

const char* t = "***/2.----Firefox/2.0";

you are not going to catch the second "/2."... and this is strange because all other tests on "banana" submatching are passing.

here comes the code:


#include "msutil.h"
#include <assert.h>
#include "acism.h"

static int actual = 0;
static int
on_match(int strnum, int textpos, MEMREF const *pattv)
{
    (void)strnum, (void)textpos, (void)pattv;
    ++actual;
    printf("%9d %7d '%.*s'\n", textpos, strnum, (int)pattv[strnum].len, pattv[strnum].ptr);
    return 0;
}

int
main(int argc, char **argv)
{

    ACISM *psp = NULL;

    MEMREF patterns[2] = {
        {"/2.", strlen("/2.")},
        {"Firefox/2.0", strlen("Firefox/2.0")},
    };

    psp = acism_create(patterns, 2);

    assert(psp);

    acism_dump(psp, PS_STATS, stderr, patterns);

    const char* t = "***/2.----Firefox/2.0";

    MEMREF text = {
        t, strlen(t)
    };

    int state = 0;
    while(acism_more(psp, text, (ACISM_ACTION*)on_match, patterns, &state));

    acism_destroy(psp);

    return 0;
}


@mischasan
Copy link
Owner

mischasan commented Mar 20, 2017 via email

@BabeleDunnit
Copy link
Author

BabeleDunnit commented Mar 20, 2017

You mean it works for you? no bugs?

With your previous bugfix applied? The one on prune_backlinks() being completely removed?

Even more strange. I am using your codebase for this test, and I definitely have the bug. I will try to re-clone and check on a fresh codebase, maybe I have modified something somewhere.

----------EDIT-----------

Aha! You did some more small changes, I see now! Gonna apply them and let you know.
Later,
Aaron

@mischasan
Copy link
Owner

mischasan commented Mar 20, 2017 via email

@mischasan
Copy link
Owner

mischasan commented Mar 20, 2017 via email

@BabeleDunnit
Copy link
Author

Hi Mischa,

I did everything from scratch to be sure.

re-cloned your repo in a fresh location. Issuing "make" to build the lib, I have a compilation problem:

https://github.com/mischasan/aho-corasick/blob/master/acism_create.c#L88 :
tp is already defined at line 74.

Which compiler/platform are you using? You should get a compilation error too IMHO... (I am on Ubuntu GCC 5.2.1)

I commented /* TNODE* */ at line 88. Issuing "make", it builds.

then, again, compiling my "acism_pl_test.c", the last test code I already sent you:

gcc -o plbug acism_pl_test.c msutil.c tap.c -L. -lacism

and then

./plbug

I get

strs:2 syms:12 chars:14 trans:23 empty:6 mod:0 hash:0 size:660
6 0 '/2.'
21 1 'Firefox/2.0'

so, no match on the second "/2."

Could you please try to reproduce in a freshly cloned repo? I suspect there is something strange going on here... Why you do not have the compilation error? Am I compiling OK? Am I missing something in my compilation line?

this drives me nuts :)

Thanks,
Aaron

@mischasan
Copy link
Owner

mischasan commented Mar 21, 2017 via email

@mischasan
Copy link
Owner

mischasan commented Mar 21, 2017 via email

@mischasan
Copy link
Owner

mischasan commented Mar 21, 2017 via email

@BabeleDunnit
Copy link
Author

Hi Mischa, I forked and will try to fix it... could you please point me in the right direction?

small recap:

  1. I did found the "banana bug"
  2. you removed prune_backlinks() and "...&& bp->child" in add_backlinks(), which fixed the "banana" bug
  3. but unfortunately this change introduced another bug which is even worse, which is the "firefox" bug above.

Gimme some hint (if possible at all... I know I will need to dvelve in your code and this is not going to be an easy walk...) :)

thanks again and later,
Aaron

@mischasan
Copy link
Owner

mischasan commented Jul 3, 2017 via email

@BabeleDunnit
Copy link
Author

good. thank you. BTW, I spotted a quite nasty bug, but I am going to open a new issue on that... klingon coders never sleep :)

@mischasan
Copy link
Owner

The fix is in for the nonmatching suffix error. Apologies for the time it took.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants