Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix search issue with special characters, and escape them in SQL. #844 #846

Merged
merged 8 commits into from
Aug 15, 2024

Conversation

GuillaumeValadas
Copy link
Contributor

I write a fix to escape specials char and escape them in SQL to allow research without 500 server error.

Its related to the issue #844

Hope it will help.

Copy link
Contributor

@fe-hicking fe-hicking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks promising! @hannob @onli -- thoughts?

Thanks so much for providing a patch! 👏

include/functions_entries.inc.php Outdated Show resolved Hide resolved
include/functions_entries.inc.php Outdated Show resolved Hide resolved
garvinhicking and others added 2 commits July 17, 2024 07:13
Co-authored-by: Garvin Hicking <38074677+fe-hicking@users.noreply.github.com>
Co-authored-by: Garvin Hicking <38074677+fe-hicking@users.noreply.github.com>
@@ -883,7 +883,8 @@ function &serendipity_searchEntries($term, $limit = '', $searchresults = '') {
$term = str_replace('&quot;', '"', $term);
$relevance_enabled = true;
if (preg_match('@["\+\-\*~<>\(\)]+@', $term)) {
$cond['find_part'] = "MATCH(title,body,extended) AGAINST('$term' IN BOOLEAN MODE)";
$term = serendipity_db_escape_string($term);
$cond['find_part'] = "MATCH(title,body,extended) AGAINST('\"$term\"' IN BOOLEAN MODE)";
} else {
$cond['find_part'] = "MATCH(title,body,extended) AGAINST('$term')";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, we should use db_escape_string here too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe that mean you can remove if cond if you apply it anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, we should use db_escape_string here too?

I will also fix db_escape_string for mysqli because it use deprecated mysqli alias.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @@[\+\-\*~<>\(\)"].[\S]*@ is currently wrong there, maybe a copy+paste mistake?

Anyhow... I'm afraid we regular-expression based evaluation of this might not bring us further. Because the regexp would match true on a string like: "+valid +valid +".

So I think we have two options here:

A: Introduce a tokenizer that analyzes the whole string and ALL parts in it, and performs checks on it
B: Execute the query and check if it works or not

Approach A will be quite complex and way too much effort to do this reliably. Not sure there is a parser/tokenizer for this out there, maybe part of Doctrine DBAL.

Approach B is a bit nasty but might be little effort. We could execute the query, use the "expect error" flag, and check if there is a result, something like:

$searchMode = 'non-boolean';
if (preg_match('@["\+\-\*~<>\(\)]+@', $term)) {
$termRows = serendipity_db_query("SELECT * FROM serendipity_entries WHERE $term LIMIT 1", true, 'both', false, false, false, true); // Last "true" is important here
  if ($termRows !== false) {
    $searchMode = 'boolean'; // Terms works as boolean expression
  }  
}
if ($searchMode === 'bolean') {
    $cond['find_part'] = "MATCH(title,body,extended) AGAINST('$term' IN BOOLEAN MODE)";
} else {
    $cond['find_part'] = "MATCH(title,body,extended) AGAINST('$term')";
}

(untested, just a thought)

…arch anyway + change alias to mysqli_real_escape_string
@garvinhicking
Copy link
Member

Thanks looks good to me!waiting for @hannob and @onli a bit to see if they approve too :)

@onli
Copy link
Member

onli commented Jul 17, 2024

I'm not sure the approach will work, but I might have misunderstood the issue. The example string failing was with a - at the end, and I thought this fails because the boolean mode expects a negated search term after that sign. I don't think that mysqli_escape_string (what serendipity_db_escape_string would call here) changes the -, so how does escaping the search term fixes the problem?

Also, the old code had a switch there between boolean and regular mode, won't that still be needed? Wasn't the regular against a natural language search with theoretically better results?

I absolutely might miss something here :) But I'd suggest to check whether the search still fully works with this patch applied.

@garvinhicking
Copy link
Member

I'm not sure the approach will work, but I might have misunderstood the issue. The example string failing was with a - at the end, and I thought this fails because the boolean mode expects a negated search term after that sign. I don't think that mysqli_escape_string (what serendipity_db_escape_string would call here) changes the -, so how does escaping the search term fixes the problem?

Also, the old code had a switch there between boolean and regular mode, won't that still be needed? Wasn't the regular against a natural language search with theoretically better results?

I absolutely might miss something here :) But I'd suggest to check whether the search still fully works with this patch applied.

Oohh. Right. I see it now - the patch uses "" quotes around the term. So it actually removes the "BOOLEAN MODE" functionality for exact matching.

So we need something like:

Search input A: This is a long search query
Search input B: Some +keyword -badkeyword search
Search input C: Some malformed query +++++-

Input A should match "This is a long search query" without boolean mode.
Input B should match the term in boolean mode: "Some +keyword -badkeyword search" (so not for '"Some +keyword -badkeyword search"')
Input C should be caught. Probably we would need some kind of try/catch surrounding the query to catch a malformed query, and then probably remove all special characters from the term (the ones matched in preg_match()) and then re-try the query...

(Disclaimer: I haven't checked the initial bugreport thoroughly)

@GuillaumeValadas
Copy link
Contributor Author

I'm not sure the approach will work, but I might have misunderstood the issue. The example string failing was with a - at the end, and I thought this fails because the boolean mode expects a negated search term after that sign. I don't think that mysqli_escape_string (what serendipity_db_escape_string would call here) changes the -, so how does escaping the search term fixes the problem?

Also, the old code had a switch there between boolean and regular mode, won't that still be needed? Wasn't the regular against a natural language search with theoretically better results?

I absolutely might miss something here :) But I'd suggest to check whether the search still fully works with this patch applied.

Oohh. Right. I see it now - the patch uses "" quotes around the term. So it actually removes the "BOOLEAN MODE" functionality for exact matching.

So we need something like:

Search input A: This is a long search query
Search input B: Some +keyword -badkeyword search
Search input C: Some malformed query +++++-

Input A should match "This is a long search query" without boolean mode.
Input B should match the term in boolean mode: "Some +keyword -badkeyword search" (so not for '"Some +keyword -badkeyword search"')
Input C should be caught. Probably we would need some kind of try/catch surrounding the query to catch a malformed query, and then probably remove all special characters from the term (the ones matched in preg_match()) and then re-try the query...

(Disclaimer: I haven't checked the initial bugreport thoroughly)

So just bring back the IF like the commit before was ?

…not + change regex to only catch BOOLEAN Operator that prefix a word
@garvinhicking
Copy link
Member

Thanks for the further refinement - that looks promising, I couldn't test various scenarious myself though. Can you tell which examples you used where it works and where it broke before?

I still feel a bit like this can still lead to cases where "BOOLEAN MODE" would be invalid, but I doubt we can hack together a more solid parser on our own...

@GuillaumeValadas
Copy link
Contributor Author

Ok I misunderstood the boolean operator systems.

So I bring back the If statement and boolean mode or not.

I think problem was the regex who match if you put a single operator like "test -".
This one only match when operator are suffix of a word like -test or +chocolate.

@GuillaumeValadas
Copy link
Contributor Author

So now you enter boolean mode only with {operator}whatever.

no more "test -" goes to boolean mode anymore it goes on "normal" mode that cause no problems.

Feel free to test or comment if its not enough.

@onli
Copy link
Member

onli commented Jul 20, 2024

Oh, that's a good idea!

@GuillaumeValadas
Copy link
Contributor Author

image

I add another regular expression check to not enter the boolean mode if any operator live alone at the end of the string search.

Any blue match in the image don't go in boolean mode.

I think now all cases are cover.
Maybe you can test this on your side.

@onli
Copy link
Member

onli commented Aug 15, 2024

Thank @GuillaumeValadas. In my test this works. In the pathological case with the - at the end the logic mode was disabled, and with proper search terms it worked, also other logic mode searches worked fine. I think this is definitely an improvement.

@garvinhicking Let's merge this! :) Or is there still a blocker from your side?

@garvinhicking
Copy link
Member

@onli I trust your decission, go ahead :)

@onli onli merged commit 61c420a into s9y:master Aug 15, 2024
onli added a commit that referenced this pull request Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants