Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the lsi meets `sqrt': Numerical argument is out of domain - "sqrt" (Math::DomainError) #153

Closed
hakehuang opened this issue Feb 28, 2017 · 24 comments

Comments

@hakehuang
Copy link

hakehuang commented Feb 28, 2017

below is my scripts

lsi = ClassifierReborn::LSI.new
lsi.add_item 'log message Error: 1', :Error
lsi.add_item 'log message Error: 0', :Pass

result  = lsi.classify 'log message Error: 1'

trace log


D:/projects/P_hobbit/AI/log_classifier/lib/classifier-reborn/extensions/vector.rb:58:in `sqrt': Numerical argument is out of domain - "sqrt" (Math::DomainError)
	from D:/projects/P_hobbit/AI/log_classifier/lib/classifier-reborn/extensions/vector.rb:58:in `block in SV_decomp'
	from D:/projects/P_hobbit/AI/log_classifier/lib/classifier-reborn/extensions/vector.rb:57:in `times'
	from D:/projects/P_hobbit/AI/log_classifier/lib/classifier-reborn/extensions/vector.rb:57:in `SV_decomp'
	from D:/projects/P_hobbit/AI/log_classifier/lib/classifier-reborn/lsi.rb:311:in `build_reduced_matrix'
	from D:/projects/P_hobbit/AI/log_classifier/lib/classifier-reborn/lsi.rb:143:in `build_index'
	from D:/projects/P_hobbit/AI/log_classifier/lib/classifier-reborn/lsi.rb:77:in `add_item'
	from D:/projects/P_hobbit/AI/log_classifier/pass_fail.rb:34:in `<main>'

I find the issue can be fixed with below change, please help to review

#154

@tra38
Copy link
Contributor

tra38 commented Feb 28, 2017

The code could fix this specific issue (I haven't checked to be sure), but would break other code. That line was used to filter out words that have 2 or fewer characters...and while I'm not quite sure why it does this filtering, I'm afraid that the LSI might fail horribly when handling very small words. Current automated tests are failing since they are dependent on the current filtering behavior. If we can figure out why the previous programmers caused the "small words" to be filtered, we then can decide whether it is possible to add an exception that will allow us to accept digits.

In any event, I would suggest writing a new automated test for handling edge cases where numerical digits matter, so that we don't accidentally reintroduce the same behavior in the future...while also making sure all previous automated tests pass as well.

@Ch4s3
Copy link
Member

Ch4s3 commented Feb 28, 2017

The LSI will in fact fail horribly with a NaN/NaN error if you remove this filter.

@hakehuang
Copy link
Author

can you give me some test examples? @Ch4s3 . I have fixed the unit test issues. the 1 byte judgement is a real user case in my application, and I believe this requirement is universal

@Ch4s3
Copy link
Member

Ch4s3 commented Mar 1, 2017

let me take a look tonight

@Ch4s3
Copy link
Member

Ch4s3 commented Mar 2, 2017

could you better describe your use case @hakehuang?

@hakehuang
Copy link
Author

hakehuang commented Mar 2, 2017 via email

@Ch4s3
Copy link
Member

Ch4s3 commented Mar 2, 2017

It seems like you could do that more reliably with a regex or simple string match.

@hakehuang
Copy link
Author

hakehuang commented Mar 3, 2017 via email

@Ch4s3
Copy link
Member

Ch4s3 commented Mar 3, 2017

I'm still not sure LSI is correct. Have you tried the Bayesian classifier? You can set it up not to use stop words. However if I were you, I would just write a simple parser and match on the number.

You could also use scan

foo = "Errors is: 0"
bar = "Errors for this is : 3"
foo_num = foo.scan.scan(/\d/)
bar_num = bar.scan(/\d/)

@hakehuang
Copy link
Author

there are many of such patterns, below I just list a few. and all those errors are mixed in a log, with many human readable context for debugging purpose. My idea is to have a log parser, which can classify all the error types, and give me a summary of all. I tried Bayesian and Naive Bayes, which works, but only LSI can give me a search function.

undefined symbol
undefined reference to
not defined
not define
java.lang.Exception: java.lang.InterruptedException
no definition for
enumeration value is out of
identifier is undefined
defined but not used
not fit in region
invalid operands to binary | (have 'int' and 'void *')
unable to allocate space for sections/blocks with a total estimated minimum size
with offset out of bounds
error loading bundle activator
no such file or directory
cannot be found
passing arg n of makes pointer from integer without a cast
was unable to load
exceeds the maximum allowed for
cannot open source file
cannot find source file
cannot fit into
not allowed
not facet-valid with respect to pattern
can not open
pointless integer comparison, the result is always false
cannot call
cannot be assigned to
cannot call intrinsic function
a function call cannot appear in a constant-expression
too few arguments in function call
was not declared in this scope
may be used uninitialized in this function
interact script return value
first use in this function
clock_config.h(34) : Fatal Error[Pe1696]: cannot open source file "fsl_common.h"
clock_config.h(34) : Fatal Error[Pe1696]: cannot open source file "fsl_common.h"
board.c(60) : Fatal Error[Pe1696]: cannot open source file "fsl_common.h"
clock_config.h(34) : Fatal Error[Pe1696]: cannot open source file "fsl_common.h"
clock_config.h(34) : Fatal Error[Pe1696]: cannot open source file "fsl_common.h"
board.c(60) : Fatal Error[Pe1696]: cannot open source file "fsl_common.h"
MKV58F24.h(326) : Fatal Error[Pe1696]: cannot open source file "MKV58F24.h(326) : Fatal Error[Pe1696]: cannot open source file "FreeRTOS.h(98) : Fatal Error[Pe1696]: cannot open source file "FreeRTOSConfig.h"
fsl_flash.h(68) : Fatal Error[Pe1696]: cannot open source file "fsl_common.h"

@tra38
Copy link
Contributor

tra38 commented Mar 5, 2017

Since I found your use case interesting, I decided to try to replicate the original case, except that it...er...works.

lsi = ClassifierReborn::LSI.new
lsi.add_item 'log message Error: 1', :Error
lsi.add_item 'log message Error: 0', :Pass

lsi.classify 'log message Error: 1'
#=> :Pass

Obviously, it's giving us the wrong answer, and looking at the LSI object suggests that it is due to the program ignoring one-character objects (digits) and not including them in the word_hashes:

=> #<ClassifierReborn::LSI:0x007f7f79980828
 @auto_rebuild=true,
 @built_at_version=2,
 @cache_node_vectors=nil,
 @items=
  {"log message Error: 1"=>
    #<ClassifierReborn::ContentNode:0x007f7f79972200
     @categories=[:Error],
     @lsi_norm=GSL::Vector
[ 5.774e-01 5.774e-01 5.774e-01 ],
     @lsi_vector=GSL::Vector
[ 6.309e-01 6.309e-01 6.309e-01 ],
     @raw_norm=GSL::Vector
[ 5.774e-01 5.774e-01 5.774e-01 ],
     @raw_vector=GSL::Vector
[ 6.309e-01 6.309e-01 6.309e-01 ],
     @word_hash={:log=>1, :messag=>1, :error=>1}>,
   "log message Error: 0"=>
    #<ClassifierReborn::ContentNode:0x007f7f799713c8
     @categories=[:Pass],
     @lsi_norm=GSL::Vector
[ 5.774e-01 5.774e-01 5.774e-01 ],
     @lsi_vector=GSL::Vector
[ 6.309e-01 6.309e-01 6.309e-01 ],
     @raw_norm=GSL::Vector
[ 5.774e-01 5.774e-01 5.774e-01 ],
     @raw_vector=GSL::Vector
[ 6.309e-01 6.309e-01 6.309e-01 ],
     @word_hash={:log=>1, :messag=>1, :error=>1}>},
 @language="en",
 @version=2,
 @word_list=
  #<ClassifierReborn::WordList:0x007f7f799711c0
   @location_table={:log=>0, :messag=>1, :error=>2}>>

So there's still that issue to deal with.

But we also have another issue at play. It's working fine on my machine while it's crashing on yours. My hypothesis for why it's crashing is based on the specific error message

D:/projects/P_hobbit/AI/log_classifier/lib/classifier-reborn/extensions/vector.rb:58

You are using vector.rb because you do not have the GSL and the the GSL Ruby Gem (to interface with the GSL) installed. Basically, if you don't have GSL on your computer, we load up our own (slower) scientific calculation library instead, which included the file "vector.rb". So there must be a bug within classifier-reborn's vector.rb file that is causing this specific error message to occur. According to the docs though, it is recommended that you install GSL, since it will make LSI "at least 10x" faster, so if you plan on using LSI, I would suggest you set up GSL on your local machine.

If you plan on not installing GSL, well...Unfortunately, I don't know enough about SVD to feel confident about debugging it. @Ch4s3, do you feel confident?

@Ch4s3
Copy link
Member

Ch4s3 commented Mar 6, 2017

@tra38, no unfortunately our SVD function was not super well implemented, and is a bit beyond my ability with linear algebra to fix. I intend to replace it with a native ext at some point.

@hakehuang
Copy link
Author

hakehuang commented Mar 6, 2017 via email

@hakehuang
Copy link
Author

hakehuang commented Mar 6, 2017 via email

@Ch4s3
Copy link
Member

Ch4s3 commented Mar 6, 2017

There aren't any good pure Ruby implementations that I'm aware of.

@mach-kernel
Copy link
Contributor

I am also having issues using LSI on small words, with Math::DomainError being raised. I skip training those words as my current solution. For background, the corpus I am using are directly pulled from credit card compliance information (e.g. has dollar amounts, random - characters, etc).

@lessaworld
Copy link

Just came across this same issue... I know it's not a long term solution, but since I'm just evaluating this project, instead of skipping the small words, I created a hack function to just go around the problem, for now.

def fix_hack (text)
text.split(" ").map! {|w| w.size < 3 ? w+"_" : w}.join(" ")
end

and then, I just wrap every mention of the content during training and classification. e.g.

lsi = ClassifierReborn::LSI.new
lsi.add_item fix_hack("This is a test"), "test"
...
c, s = lsi.classify_with_score fix_hack("It is a test")

@epugh
Copy link

epugh commented Aug 2, 2018

For me, brew install gsl and adding the GSL dependency:

gem 'classifier-reborn'  # lets get machine learning!
gem 'gsl', '~> 2.1', '>= 2.1.0.3'

has solved the sqrt issue and the other NaN issue, I think!

@Ch4s3
Copy link
Member

Ch4s3 commented Aug 2, 2018

@epugh Have you tried with small words ~3-4 chars in length?

@epugh
Copy link

epugh commented Aug 2, 2018

Yep, and with those, I just get a warning message, the code runs.

Here is my test set:

    strings = [["This text deals with dogs. Dogs.", :dog],
               ["This text involves dogs too. Dogs!", :dog],
               ["LOOKING FOR SPEAKER", :missing],
               ["Need speaker!", :missing],
               ["Need speakers!", :missing],
               ["n/a OSC Retreat.", :missing],
               ["na", :missing],
               ["spearks are needed", :missing],
               ["Matt Datastax.", :present]]
    strings.each { |x| classifier.add_item x.first, x.last }

    assert_same :missing, (classifier.classify ("speaker needed"))
    assert_not_same :missing, (classifier.classify ("Matt Overstreet Solr Stemmers"))
    assert_same :present, (classifier.classify ("Matt Overstreet Solr Stemmers"))

@epugh
Copy link

epugh commented Aug 2, 2018

So the "na" gives an error, and previously before I installed gsl, the "n/a" blew up!

@Ch4s3
Copy link
Member

Ch4s3 commented Aug 3, 2018

Unfortunately that's expected behavior, but not the desired behavior. Out plain ruby lsi implementation is pretty broken, and I lack the math background necessary to fix it.

@epugh
Copy link

epugh commented Aug 3, 2018

I wonder if the best path is to say "You must have GSL installed"? I;e accept the plain ruby issues...

@Ch4s3
Copy link
Member

Ch4s3 commented Aug 3, 2018

@epugh unfortunately we're a dependency of Jekyll, so we want to have a ruby only option to make it more accessible. However, for any sort of prod use beyond that, we strongly endorse GSL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants