-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the lsi meets `sqrt': Numerical argument is out of domain - "sqrt" (Math::DomainError) #153
Comments
The code could fix this specific issue (I haven't checked to be sure), but would break other code. That line was used to filter out words that have 2 or fewer characters...and while I'm not quite sure why it does this filtering, I'm afraid that the LSI might fail horribly when handling very small words. Current automated tests are failing since they are dependent on the current filtering behavior. If we can figure out why the previous programmers caused the "small words" to be filtered, we then can decide whether it is possible to add an exception that will allow us to accept digits. In any event, I would suggest writing a new automated test for handling edge cases where numerical digits matter, so that we don't accidentally reintroduce the same behavior in the future...while also making sure all previous automated tests pass as well. |
The LSI will in fact fail horribly with a NaN/NaN error if you remove this filter. |
can you give me some test examples? @Ch4s3 . I have fixed the unit test issues. the 1 byte judgement is a real user case in my application, and I believe this requirement is universal |
let me take a look tonight |
could you better describe your use case @hakehuang? |
I want to classify my build log, which usually appears as below:
Error: 0 means there are no error
Error: <other number> mean there are error.
2017-03-02 13:14 GMT+08:00 Chase Gilliam <notifications@github.com>:
… could you better describe your use case @hakehuang
<https://github.com/hakehuang>?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#153 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAw1xoZ1FvUTftzmVQZQ51LuHLIKirQEks5rhlBRgaJpZM4MN5bP>
.
|
It seems like you could do that more reliably with a regex or simple string match. |
yes o no,
some times the string goes this way:
Errors is: 0
Errors for this is : 3
it is very difficult to use a regex to match the diferences.
2017-03-03 0:59 GMT+08:00 Chase Gilliam <notifications@github.com>:
… It seems like you could do that more reliably with a regex or simple
string match.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#153 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAw1xt7BJM4F7H2-JzyqbH3FM9Ou2ZKmks5rhvWMgaJpZM4MN5bP>
.
|
I'm still not sure LSI is correct. Have you tried the Bayesian classifier? You can set it up not to use stop words. However if I were you, I would just write a simple parser and match on the number. You could also use scan foo = "Errors is: 0"
bar = "Errors for this is : 3"
foo_num = foo.scan.scan(/\d/)
bar_num = bar.scan(/\d/) |
there are many of such patterns, below I just list a few. and all those errors are mixed in a log, with many human readable context for debugging purpose. My idea is to have a log parser, which can classify all the error types, and give me a summary of all. I tried Bayesian and Naive Bayes, which works, but only LSI can give me a search function.
|
Since I found your use case interesting, I decided to try to replicate the original case, except that it...er...works. lsi = ClassifierReborn::LSI.new
lsi.add_item 'log message Error: 1', :Error
lsi.add_item 'log message Error: 0', :Pass
lsi.classify 'log message Error: 1'
#=> :Pass Obviously, it's giving us the wrong answer, and looking at the LSI object suggests that it is due to the program ignoring one-character objects (digits) and not including them in the => #<ClassifierReborn::LSI:0x007f7f79980828
@auto_rebuild=true,
@built_at_version=2,
@cache_node_vectors=nil,
@items=
{"log message Error: 1"=>
#<ClassifierReborn::ContentNode:0x007f7f79972200
@categories=[:Error],
@lsi_norm=GSL::Vector
[ 5.774e-01 5.774e-01 5.774e-01 ],
@lsi_vector=GSL::Vector
[ 6.309e-01 6.309e-01 6.309e-01 ],
@raw_norm=GSL::Vector
[ 5.774e-01 5.774e-01 5.774e-01 ],
@raw_vector=GSL::Vector
[ 6.309e-01 6.309e-01 6.309e-01 ],
@word_hash={:log=>1, :messag=>1, :error=>1}>,
"log message Error: 0"=>
#<ClassifierReborn::ContentNode:0x007f7f799713c8
@categories=[:Pass],
@lsi_norm=GSL::Vector
[ 5.774e-01 5.774e-01 5.774e-01 ],
@lsi_vector=GSL::Vector
[ 6.309e-01 6.309e-01 6.309e-01 ],
@raw_norm=GSL::Vector
[ 5.774e-01 5.774e-01 5.774e-01 ],
@raw_vector=GSL::Vector
[ 6.309e-01 6.309e-01 6.309e-01 ],
@word_hash={:log=>1, :messag=>1, :error=>1}>},
@language="en",
@version=2,
@word_list=
#<ClassifierReborn::WordList:0x007f7f799711c0
@location_table={:log=>0, :messag=>1, :error=>2}>> So there's still that issue to deal with. But we also have another issue at play. It's working fine on my machine while it's crashing on yours. My hypothesis for why it's crashing is based on the specific error message
You are using vector.rb because you do not have the GSL and the the GSL Ruby Gem (to interface with the GSL) installed. Basically, if you don't have GSL on your computer, we load up our own (slower) scientific calculation library instead, which included the file "vector.rb". So there must be a bug within classifier-reborn's vector.rb file that is causing this specific error message to occur. According to the docs though, it is recommended that you install GSL, since it will make LSI "at least 10x" faster, so if you plan on using LSI, I would suggest you set up GSL on your local machine. If you plan on not installing GSL, well...Unfortunately, I don't know enough about SVD to feel confident about debugging it. @Ch4s3, do you feel confident? |
@tra38, no unfortunately our SVD function was not super well implemented, and is a bit beyond my ability with linear algebra to fix. I intend to replace it with a native ext at some point. |
the Bayesian classifier has some other issue for my cases, which I am
trying to debugging now. I drop some hot fixes of mine. with this fix, the
bayes clasifier seems works fine for my cases.
diff --git a/lib/classifier-reborn/bayes.rb b/lib/classifier-reborn/bayes.rb
index 3d5bbf1..d658856 100644
--- a/lib/classifier-reborn/bayes.rb
+++ b/lib/classifier-reborn/bayes.rb
@@ -126,16 +126,23 @@ module ClassifierReborn
end
return score
end
+ # if the word is not in the list just omit it
category_keys.each do |category|
score[category.to_s] = 0
+ temp_s = 0
total = (@backend.category_word_count(category) || 1).to_f
word_hash.each do |word, _count|
- s = @backend.word_in_category?(category, word) ?
@backend.category_word_frequency(category, word) : 0.1
- score[category.to_s] += Math.log(s / total)
+ temp_s += @backend.word_in_category?(category, word) ?
@backend.category_word_frequency(category, word) : 0
+ end
+ if temp_s == 0
+ score[category.to_s] = Float::INFINITY
+ else
+ score[category.to_s] = Math.log(temp_s / total)
end
# now add prior probability for the category
- s = @backend.category_has_trainings?(category) ?
@backend.category_training_count(category) : 0.1
- score[category.to_s] += Math.log(s / @backend.total_trainings.to_f)
+ #s = @backend.category_has_trainings?(category) ?
@backend.category_training_count(category) : @backend.total_trainings.to_f
+ #score[category.to_s] += -1.0 * Math.log(s /
@backend.total_trainings.to_f)
+ #puts "#{category.to_s} scores #{score[category.to_s]}"
end
score
end
2017-03-04 2:44 GMT+08:00 Chase Gilliam <notifications@github.com>:
… I'm still not sure LSI is correct. Have you tried the Bayesian classifier?
You can set it up not to use stop words. However if I were you, I would
just write a simple parser and match on the number.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#153 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAw1xo7o2ye8JebwEHSTx4Ci3yji2mviks5riF-igaJpZM4MN5bP>
.
|
the SVD seems a big challenge for all AI users, do you know any Ruby
solutions for this? using a LAPACK backend seems not that good for cloud
deployment.
2017-03-06 12:39 GMT+08:00 Chase Gilliam <notifications@github.com>:
… @tra38 <https://github.com/tra38>, no unfortunately our SVD function was
not super well implemented, and is a bit beyond my ability with linear
algebra to fix. I intend to replace it with a native ext at some point.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#153 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAw1xjcb2dzee-P1PWrbhTUPfzla6v4Iks5ri44bgaJpZM4MN5bP>
.
|
There aren't any good pure Ruby implementations that I'm aware of. |
I am also having issues using LSI on small words, with |
Just came across this same issue... I know it's not a long term solution, but since I'm just evaluating this project, instead of skipping the small words, I created a hack function to just go around the problem, for now. def fix_hack (text) and then, I just wrap every mention of the content during training and classification. e.g. lsi = ClassifierReborn::LSI.new |
For me,
has solved the sqrt issue and the other NaN issue, I think! |
@epugh Have you tried with small words ~3-4 chars in length? |
Yep, and with those, I just get a warning message, the code runs. Here is my test set:
|
So the "na" gives an error, and previously before I installed gsl, the "n/a" blew up! |
Unfortunately that's expected behavior, but not the desired behavior. Out plain ruby lsi implementation is pretty broken, and I lack the math background necessary to fix it. |
I wonder if the best path is to say "You must have GSL installed"? I;e accept the plain ruby issues... |
@epugh unfortunately we're a dependency of Jekyll, so we want to have a ruby only option to make it more accessible. However, for any sort of prod use beyond that, we strongly endorse GSL. |
below is my scripts
trace log
I find the issue can be fixed with below change, please help to review
#154
The text was updated successfully, but these errors were encountered: