-
Notifications
You must be signed in to change notification settings - Fork 0
/
DecisionTree-3.4.3.html
2772 lines (2668 loc) · 293 KB
/
DecisionTree-3.4.3.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html>
<head>
<title>
DecisionTree-3.4.3.html
</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<style type="text/css">
p.morelinespace {
line-height: 130%;
font-weight: bold;
}
body {
background-color: #f0f0f8;
}
hr.myhr1 {
width:100%;
height:8px;
border:4px solid red;
}
</style>
</head>
<body>
<hr class="myhr1">
<div style="color:blue; font-size:300%">
<strong>DecisionTree</strong></div>
<div style="color:blue; font-size:150%"> Version 3.4.3, 2016-May-14
</div>
<hr class="myhr1">
<br>
<div style="font-size:125%; line-height:130%; font-weight: bold">
DecisionTree.py<br>
Version: 3.4.3<br>
Author: Avinash Kak (kak@purdue.edu)<br>
Date: 2016-May-14<br>
</div>
<br>
<table>
<tr>
<th style="text-align:left vertical-align:top">
<div style="font-size:125%">
<b>Download Version 3.4.3:</b>
<a HREF="https://engineering.purdue.edu/kak/distDT/DecisionTree-3.4.3.tar.gz?download">gztar</a>
<a HREF="https://engineering.purdue.edu/kak/distDT/DecisionTree-3.4.3.tar.bz2?download">bztar</a>
</div>
<br>
<br>
<br>
</th>
<td style="text-align:center vertical-align:top padding:0">
<div style="text-align:center">
Total number of downloads (all versions) from this website:
<?php
$file = fopen("HowManyCounts.txt", "r") or exit("Unable to open file!");
echo fgets($file);
fclose($file);
?>
<div style="color:red; font-size:80%">
This count is automatically updated at every rotation of
<br>
the weblogs (normally once every two to four days)
<br>
Last updated:
<?php
$file = fopen("LastUpdated.txt", "r") or exit("Unable to open file!");
echo fgets($file);
fclose($file);
?>
</div>
</div>
</td>
</tr>
<tr>
<td>
<div style="color:red">
<a HREF="DecisionTree-3.4.3_CodeOnly.html">View the main module file in your browser</a>
<br>
<a HREF="DecisionTreeWithBagging-3.4.3_CodeOnly.html">View the bagging code in your browser</a>
<br>
<a HREF="BoostedDecisionTree-3.4.3_CodeOnly.html">View the boosting code in your browser</a>
<br>
<a HREF="RandomizedTreesForBigData-3.4.3_CodeOnly.html">View the randomized trees code in your browser</a>
<br>
<a HREF="RegressionTree-3.4.3_CodeOnly.html">View the regression tree code in your browser</a>
<br>
</div>
</td>
<td>
</td>
</tr>
</table>
<br>
<br>
<span style="color:red; font-size:150%"><strong>CONTENTS:</strong></span>
<br>
<br>
<div style="font-size:100%; line-height:180%; font-weight: bold">
<a href="#1">CHANGES</a><br>
<a href="#2">USAGE</a><br>
<a href="#3">INTRODUCTION</a><br>
<a href="#4">WHAT PRACTICAL PROBLEM IS SOLVED BY THIS MODULE?</a><br>
<a href="#5">SYMBOLIC FEATURES VERSUS NUMERIC FEATURES<br></a>
<a href="#6">FEATURES WITH NOT SO "NICE" STATISTICAL PROPERTIES<br></a>
<a href="#7">TESTING THE QUALITY OF YOUR TRAINING DATA<br></a>
<a href="#8">HOW TO MAKE THE BEST CHOICES FOR THE CONSTRUCTOR PARAMETERS<br></a>
<a href="#25">SOLVING NEEDLE-IN-A-HAYSTACK AND BIG DATA PROBLEMS<br></a>
<a href="#9">DECISION TREE INTROSPECTION<br></a>
<a href="#10">METHODS</a><br>
<a href="#11">THE INTROSPECTION API<br></a>
<a href="#12">BULK CLASSIFICATION OF DATA RECORDS<br></a>
<a href="#13">HOW THE CLASSIFICATION RESULTS ARE DISPLAYED<br></a>
<a href="#14">USING BAGGING<br></a>
<a href="#24">USING BOOSTING<br></a>
<a href="#26">USING RANDOMIZED DECISION TREES<br></a>
<a href="#28">CONSTRUCTING REGRESSION TREES<br></a>
<a href="#22">GENERATING SYNTHETIC TRAINING DATA<br></a>
<a href="#15">THE Examples DIRECTORY<br></a>
<a href="#16">THE ExamplesBagging DIRECTORY<br></a>
<a href="#23">THE ExamplesBoosting DIRECTORY<br></a>
<a href="#27">THE ExamplesRandomizedTrees DIRECTORY<br></a>
<a href="#29">THE ExamplesRegression DIRECTORY<br></a>
<a href="#17">INSTALLATION</a><br>
<a href="#18">BUGS</a><br>
<a href="#19">ACKNOWLEDGMENTS</a><br>
<a href="#20">ABOUT THE AUTHOR</a><br>
<a href="#21">COPYRIGHT</a></div>
<br>
<br>
<span style="color:red; font-size:150%"><strong><a id="1">CHANGES:</a><br>
</strong></span> <br>
Version 3.4.3<br>
<br>
The 'csv_cleanup_needed' parameter in the module constructor is now<br>
optional in response to several requests received from the user<br>
community. The main point made by the users was that setting<br>
'csv_cleanup_needed' when there was no need for CSV clean-up extracted<br>
a performance penalty when ingesting large database files with tens of<br>
thousands of line records. In addition to making 'csv_cleanup_needed'<br>
optional, I have also tweaked up the code in the '<a href="#-cleanup_csv">cleanup_csv</a>()' method<br>
in order to extract data from a larger range of messy CSV files.<br>
<br>
Version 3.4.2<br>
<br>
As with the previous version, all the changes made are confined to the<br>
part of the code that deals with the calculation of the regression<br>
coefficients. Besides general code clean up, I have fixed a couple of<br>
bugs in this part of the code. With regard to the new constructor<br>
parameter 'jacobian_choice' introduced in the previous version, setting<br>
it to 0 is the best choice for most cases, which causes the regression<br>
coefficients to be estimated through linear least-squares minimization.<br>
<br>
Version 3.4.1<br>
<br>
All the changes made in this version relate to the construction of<br>
regression trees. I have improved the code that estimates the<br>
regression coefficients using a combination of linear least-squares and<br>
gradient descent. The RegressionTree class now has a new constructor<br>
parameter called 'jacobian_choice'.<br>
<br>
Version 3.4.0<br>
<br>
In addition to constructing decision trees, this version of the module<br>
also allows you to construct regression trees. The regression tree<br>
capability has been packed into a separate subclass, named<br>
RegressionTree, of the main <a href="#DecisionTree">DecisionTree</a> class. The subdirectory<br>
ExamplesRegression in the main installation directory illustrates how<br>
you can use this new functionality of the module.<br>
<br>
Version 3.3.2:<br>
<br>
This version fixes the problem encountered by people when using pickle<br>
to save a decision tree to a disk file. The problem was solved by<br>
moving the nested class <a href="#DTNode">DTNode</a> to the module level.<br>
<br>
Version 3.3.1:<br>
<br>
This version removes certain inefficiencies that had crept into file<br>
I/O operations in Version 3.3.0.<br>
<br>
Version 3.3.0<br>
<br>
This version incorporates four very significant upgrades/changes to the<br>
<a href="#DecisionTree">DecisionTree</a> module: (1) The CSV cleanup is now the default. So you do<br>
not have to set any special parameters in the constructor calls to<br>
initiate CSV cleanup. (2) In the form of a new Python class named<br>
RandomizedTreesForBigData, this module provides you with an easy-to-use<br>
programming interface for attempting needle-in-a-haystack solutions for<br>
the case when your training data is overwhelmingly dominated by a<br>
single class. You need to set the constructor parameter<br>
'looking_for_needles_in_haystack' to invoke the logic that constructs<br>
multiple decision trees, each using the minority class samples along<br>
with samples drawn randomly from the majority class. The final<br>
classification is made through a majority vote from all the decision<br>
trees. (3) Assuming you are faced with a big-data problem --- in the<br>
sense that you have been given a training database with a very large<br>
number of training records --- the class RandomizedTreesForBigData will<br>
also let you construct multiple decision trees by pulling training data<br>
randomly from your training database (without paying attention to the<br>
relative populations of the classes). The final classification<br>
decision for a test sample is based on a majority vote from all the<br>
decision trees thus constructed. See the ExamplesRandomizedTrees<br>
directory for how to use these new features of the module. And,<br>
finally, (4) Support for the old-style '.dat' training files has been<br>
dropped in this version.<br>
<br>
Version 3.2.4<br>
<br>
This version makes it easier to use a CSV training file that violates<br>
the assumption that a comma be used only to separate the different<br>
field values in a line record. Some large econometrics databases use<br>
double-quoted values for fields, and these values may contain commas<br>
(presumably for better readability). This version also allows you to<br>
specify the leftmost entry in the first CSV record that names all the<br>
fields. Previously, this entry was required to be an empty<br>
double-quoted string. I have also made some minor changes to the<br>
'get_training_data_from_csv()' method to make it more user friendly for<br>
large training files that may contain tens of thousands of records.<br>
When pulling training data from such files, this method prints out a<br>
dot on the terminal screen for every 10000 records it has processed.<br>
See the script 'dt_example_involving_csv_cleanup.py' in the Examples<br>
directory for how to invoke the added functionality for CSV files.<br>
<br>
Version 3.2.3<br>
<br>
Cleanup of the HTML page for the module. The implementation code<br>
and the API remain unchanged.<br>
<br>
Version 3.2.2<br>
<br>
Incorporates a fix for a minor bug related to exception handling for<br>
the case when the training data file named in the constructor has an<br>
incorrect suffix. Also incorporated in this version is a general<br>
cleanup of the exception handling code.<br>
<br>
Version 3.2.1<br>
<br>
Contains a bugfix for a probability calculator function. The bug was<br>
triggered when a feature did not take on any values whatsoever in all<br>
of the training data samples for a class --- a condition likely to be<br>
encountered only rarely, but nonetheless important.<br>
<br>
Version 3.2.0<br>
<br>
This version brings the boosting capability to the <a href="#DecisionTree">DecisionTree</a> module.<br>
<br>
Version 3.0.1<br>
<br>
This is a minor revision that smooths out the documentation at a couple<br>
of important places. I have also fixed the typos that I discovered<br>
after the previous version was released.<br>
<br>
Version 3.0<br>
<br>
Version 3.0 adds bagging to the <a href="#DecisionTree">DecisionTree</a> module. If your training<br>
dataset is large enough, you can ask the module to construct multiple<br>
decision trees using data bags extracted from your dataset. The module<br>
can show you the results returned by the individual decision trees and<br>
also the results obtained by taking a majority vote of the<br>
classification decisions made by the trees. You can specify any<br>
arbitrary extent of overlap between the data bags.<br>
<br>
Version 2.3.4<br>
<br>
There was an error in the packaging of version 2.3.3 of this module.<br>
The error related to how the `packages' keyword was specified in<br>
setup.py. This version fixes that error.<br>
<br>
Version 2.3.3:<br>
<br>
The purpose of this version is merely to mention that you do NOT need<br>
to double-quote the entries in your CSV training files. The older<br>
versions of this module required the rows of a CSV file to be in the<br>
following sort of a format:<br>
<br>
Version 2.3.2:<br>
<br>
The introspection capability in this version packs more of a punch.<br>
For each training data sample, you can now figure out not only the<br>
decision-tree nodes that are affected directly by that sample, but also<br>
those nodes that are affected indirectly through the generalization<br>
achieved by the probabilistic modeling of the data. The Examples<br>
directory of this version includes additional scripts that illustrate<br>
these enhancements to the introspection capability. See the section<br>
"The Introspection API" for a declaration of the introspection related<br>
methods, old and new.<br>
<br>
Version 2.3.1:<br>
<br>
This version has a bug fix for the decision-tree introspection<br>
capability that was added to the module in Version 2.3. Also changed in<br>
this version are the names of the two scripts in the Examples directory<br>
that illustrate introspection. The new names are "introspection_at_one<br>
_node.py" and "introspection_in_a_loop.py".<br>
<br>
Version 2.3:<br>
<br>
In response to requests from several users, this version includes a new<br>
capability: You can now ask the module to introspect about the<br>
classification decisions returned by the decision tree. Toward that<br>
end, the module includes a new class named <a href="#DTIntrospection">DTIntrospection</a>. Perhaps<br>
the most important bit of information you are likely to seek through DT<br>
introspection is the list of the training samples that fall directly in<br>
the portion of the feature space that is assigned to a node. CAVEAT:<br>
When training samples are non-uniformly distributed in the underlying<br>
feature space, IT IS POSSIBLE FOR A NODE TO EXIST EVEN <br>
WHEN NO TRAINING SAMPLES FALL IN THE PORTION OF THE <br>
FEATURE SPACE ASSIGNED TO THE NODE. (This is the entire <br>
point of the generalization achieved by probabilistic modeling of the <br>
training data.) For additional information related to DT introspection, <br>
see the section titled "DECISION TREE INTROSPECTION" in this <br>
documentation page.<br>
<br>
Version 2.2.6:<br>
<br>
Removed a serious bug in the method get_training_data_from_dat() that<br>
is used to read the training data from `.dat' files. The class naming<br>
convention internally in the code is that it is a concatenation of the<br>
header of the column that contains the class labels and the actual<br>
class name as mentioned in each row of the training data. There was a<br>
problem with this concatenation. Another very important bug fix was in<br>
the method probability_of_feature_value(). The probability calculation<br>
in the previous versions was erroneous for those features that acquired<br>
zero values. The users of this module may not have noticed this error<br>
in the past if the zero values for the features occurred relatively<br>
infrequently. This error has now been fixed.<br>
<br>
Version 2.2.5:<br>
<br>
If you are not using the synthetic data generation feature of this<br>
module, the changes made in this version do not affect you. The code<br>
that was changed is all in the class <a href="#TrainingDataGeneratorNumeric">TrainingDataGeneratorNumeric</a>. The<br>
changes to this class remove an important bug related to the ordering<br>
of the feature names that are read from a user-supplied parameter<br>
file. The basic Decision Tree construction and classification code<br>
remains unchanged.<br>
<br>
Version 2.2.4:<br>
<br>
This version should prove more robust when the probability distribution<br>
for the values of a feature is expected to be heavy-tailed; that is,<br>
when the supposedly rare observations can occur with significant<br>
probabilities. A new option in the <a href="#DecisionTree">DecisionTree</a> constructor lets the<br>
user specify the precision with which the probability distributions<br>
are estimated for such features.<br>
<br>
Version 2.2.3:<br>
<br>
This version fixes a bug that was caused by the explicitly set zero<br>
values for numerical features being misconstrued as "false" in the<br>
conditional statements in some of the method definitions.<br>
<br>
Version 2.2.2:<br>
<br>
In response to requests from users, this version includes scripts in<br>
the Examples directory that demonstrate how to carry out bulk<br>
classification of all your test data records placed in a CSV file in<br>
one fell swoop. Also included are scripts that demonstrate the same<br>
for the data records placed in the old-style `.dat' files. The main<br>
module code remains unchanged.<br>
<br>
Version 2.2.1:<br>
<br>
The changes made are all in the part of the module that is used for<br>
evaluating the quality of training data through a 10-fold cross<br>
validation test. The previous version used the default values for the<br>
constructor parameters when constructing the decision trees in each<br>
iteration of the test. The new version correctly uses the user-supplied<br>
values.<br>
<br>
Version 2.2:<br>
<br>
This version fixes a bug discovered in the best feature calculator<br>
function. This bug was triggered by certain conditions related to the<br>
distribution of values for the features in a training data file.<br>
Additionally, and VERY IMPORTANTLY, Version 2.2 allows you <br>
to test the quality of your training data by running a 10-fold<br>
cross-validation test on the data. This test divides all of the<br>
training data into ten parts, with nine parts used for training a<br>
decision tree and one part used for testing its ability to classify<br>
correctly. This selection of nine parts for training and one part for<br>
testing is carried out in all of the ten different possible ways. This<br>
testing functionality in Version 2.2 can also be used to find the best<br>
values to use for the constructor parameters entropy_threshold,<br>
max_depth_desired, and symbolic_to_numeric_cardinality_threshold.<br>
<br>
Version 2.1:<br>
<br>
This is a cleaned up version of v. 2.0 of the module. Should run more<br>
efficiently for large training data files that contain both numeric and<br>
symbolic features.<br>
<br>
Version 2.0:<br>
<br>
This was a major rewrite of the <a href="#DecisionTree">DecisionTree</a> module. This revision was<br>
prompted by a number of users wanting to see numeric features<br>
incorporated in the construction of decision trees. So here it is!<br>
This version allows you to use either purely symbolic features, or<br>
purely numeric features, or a mixture of the two. (A feature is numeric<br>
if it can take any floating-point value over an interval.)<br>
<br>
Version 1.7.1:<br>
<br>
This version includes a fix for a bug that was triggered by certain<br>
comment words in a training data file. This version also includes<br>
additional safety checks that are useful for catching errors and<br>
inconsistencies in large training data files that do not lend<br>
themselves to manual checking for correctness. As an example, the new<br>
version makes sure that the number of values you declare in each sample<br>
record matches the number of features declared at the beginning of the<br>
training data file.<br>
<br>
Version 1.7:<br>
<br>
This version includes safety checks on the consistency of the data you<br>
place in your training data file. When a training data file contains<br>
thousands of records, it is difficult to manually check that you used<br>
the same class names in your sample records that you declared at the<br>
top of your training file or that the values you have for your features<br>
are legal vis-a-vis the earlier declarations regarding such values in<br>
the training file. Another safety feature incorporated in this version<br>
is the non-consideration of classes that are declared at the top of the<br>
training file but that have no sample records in the file.<br>
<br>
Version 1.6.1:<br>
<br>
Fixed a bug in the method that generates synthetic test data.<br>
<br>
Version 1.6:<br>
<br>
This version includes several upgrades: The module now includes code<br>
for generating synthetic training and test data for experimenting with<br>
the <a href="#DecisionTree">DecisionTree</a> classifier. Another upgrade in the new version is<br>
that, after training, a decision tree can now be used in an interactive<br>
mode in which the user is asked to supply answers for the feature tests<br>
at the nodes as the classification process descends down the tree.<br>
<br>
Version 1.5:<br>
<br>
This is a Python 3.x compliant version of the <a href="#DecisionTree">DecisionTree</a> module.<br>
This version should work with both Python 2.x and Python 3.x.<br>
<br>
Version 1.0:<br>
<br>
This is a Python implementation of the author's Perl module<br>
Algorithm::<a href="#DecisionTree">DecisionTree</a>, Version 1.41. The Python version should work<br>
faster for large decision trees since it uses probability and entropy<br>
caching much more extensively than Version 1.41 of the Perl module.<br>
(Note: I expect my next release of the Perl module to catch up with<br>
this Python version in terms of performance.)<br>
<br>
<br>
<span style="color:red; font-size:150%"><strong><a id="2">USAGE:</a><br>
</strong></span> <br>
If your training data includes numeric features (a feature is numeric<br>
if it can take any floating point value over an interval), you are<br>
expected to supply your training data through a CSV file and your call<br>
for constructing an instance of the <a href="#DecisionTree">DecisionTree</a> class will look like:<br>
<br>
<div style="font-family:Courier; font-size:80%"> training_datafile = "stage3cancer.csv"<br>
<br>
dt = <a href="#DecisionTree">DecisionTree</a>.<a href="#DecisionTree">DecisionTree</a>( <br>
training_datafile = training_datafile,<br>
csv_class_column_index = 2,<br>
csv_columns_for_features = [3,4,5,6,7,8],<br>
entropy_threshold = 0.01,<br>
max_depth_desired = 8,<br>
symbolic_to_numeric_cardinality_threshold = 10,<br>
csv_cleanup_needed = 1,<br>
)<br>
</div> <br>
The constructor option `csv_class_column_index' informs the module as<br>
to which column of your CSV file contains the class label. THE <br>
COLUMN INDEXING IS ZERO BASED. The constructor option<br>
`csv_columns_for_features' specifies which columns are to be used for<br>
feature values. The first row of the CSV file must specify the names<br>
of the features. See examples of CSV files in the `Examples'<br>
subdirectory.<br>
<br>
The option `symbolic_to_numeric_cardinality_threshold' is also<br>
important. For the example shown above, if an ostensibly numeric<br>
feature takes on only 10 or fewer different values in your training<br>
datafile, it will be treated like a symbolic feature. The option<br>
`entropy_threshold' determines the granularity with which the entropies<br>
are sampled for the purpose of calculating entropy gain with a<br>
particular choice of decision threshold for a numeric feature or a<br>
feature value for a symbolic feature. <br>
<br>
The option `csv_cleanup_needed' is by default set to 0. If you set it<br>
to 1, that would cause all line records in your CSV file to be<br>
"sanitized" before they are used for constructing a decision tree. You<br>
need this option if your CSV file uses double-quoted field names and<br>
field values in the line records and if such double-quoted strings are<br>
allowed to include commas for, presumably, better readability.<br>
<br>
After you have constructed an instance of the <a href="#DecisionTree">DecisionTree</a> class as<br>
shown above, you read in the training data file and initialize the<br>
probability cache by calling:<br>
<br>
<div style="font-family:Courier; font-size:80%"> dt.get_training_data()<br>
dt.calculate_first_order_probabilities()<br>
dt.calculate_class_priors()<br>
</div> <br>
Next you construct a decision tree for your training data by calling:<br>
<br>
<div style="font-family:Courier; font-size:80%"> root_node = dt.construct_decision_tree_classifier()<br>
</div> <br>
where root_node is an instance of the <a href="#DTNode">DTNode</a> class that is also defined<br>
in the module file. Now you are ready to classify a new data record.<br>
Let's say that your data record looks like:<br>
<br>
<div style="font-family:Courier; font-size:80%"> test_sample = ['g2 = 4.2',<br>
'grade = 2.3',<br>
'gleason = 4',<br>
'eet = 1.7',<br>
'age = 55.0',<br>
'ploidy = diploid']<br>
</div> <br>
You can classify it by calling:<br>
<br>
<div style="font-family:Courier; font-size:80%"> classification = dt.classify(root_node, test_sample)<br>
</div> <br>
The call to `classify()' returns a reference to a hash whose keys are<br>
the class names and the values the associated classification<br>
probabilities. This hash also includes another key-value pair for the<br>
solution path from the root node to the leaf node at which the final<br>
classification was carried out.<br>
<br>
A decision tree can quickly become much too large (and much too slow to<br>
construct and to yield classification results) if the total number of<br>
features is large and/or if the number of different possible values for<br>
the symbolic features is large. You can control the size of the tree<br>
through the constructor options `entropy_threshold' and<br>
`max_depth_desired'. The latter option sets the maximum depth of your<br>
decision tree to max_depth_desired value. The parameter<br>
`entropy_threshold' sets the granularity with which the entropies are<br>
sampled. Its default value is 0.001. The larger the value you choose<br>
for entropy_threshold, the smaller the tree.<br>
<br>
<br>
<span style="color:red; font-size:150%"><strong><a id="3">INTRODUCTION:</a><br>
</strong></span> <br>
<a href="#DecisionTree">DecisionTree</a> is a Python module for constructing a decision tree from a<br>
training data file containing multidimensional data in the form of a<br>
table. In one form or another, decision trees have been around for over<br>
fifty years. From a statistical perspective, they are closely related<br>
to classification and regression by recursive partitioning of<br>
multidimensional data. Early work that demonstrated the usefulness of<br>
such partitioning of data for classification and regression can be<br>
traced to the work of Terry Therneau in the early 1980's in the<br>
statistics community, and to the work of Ross Quinlan in the mid 1990's<br>
in the machine learning community,<br>
<br>
For those not familiar with decision tree ideas, the traditional way to<br>
classify multidimensional data is to start with a feature space whose<br>
dimensionality is the same as that of the data. Each feature measures<br>
a specific attribute of an entity. You use the training data to carve<br>
up the feature space into different regions, each corresponding to a<br>
different class. Subsequently, when you try to classify a new data<br>
sample, you locate it in the feature space and find the class label of<br>
the region to which it belongs. One can also give the data point the<br>
same class label as that of the nearest training sample. This is<br>
referred to as the nearest neighbor classification. There exist<br>
hundreds of variations of varying power on these two basic approaches<br>
to the classification of multidimensional data.<br>
<br>
A decision tree classifier works differently. When you construct a<br>
decision tree, you select for the root node a feature test that<br>
partitions the training data in a way that causes maximal<br>
disambiguation of the class labels associated with the data. In terms<br>
of information content as measured by entropy, such a feature test<br>
would cause maximum reduction in class entropy in going from all of the<br>
training data taken together to the data as partitioned by the feature<br>
test. You then drop from the root node a set of child nodes, one for<br>
each partition of the training data created by the feature test at the<br>
root node. When your features are purely symbolic, you'll have one<br>
child node for each value of the feature chosen for the feature test at<br>
the root. When the test at the root involves a numeric feature, you<br>
find the decision threshold for the feature that best bipartitions the<br>
data and you drop from the root node two child nodes, one for each<br>
partition. Now at each child node you pose the same question that you<br>
posed when you found the best feature to use at the root: Which feature<br>
at the child node in question would maximally disambiguate the class<br>
labels associated with the training data corresponding to that child<br>
node?<br>
<br>
As the reader would expect, the two key steps in any approach to<br>
decision-tree based classification are the construction of the decision<br>
tree itself from a file containing the training data, and then using<br>
the decision tree thus obtained for classifying new data.<br>
<br>
What is cool about decision tree classification is that it gives you<br>
soft classification, meaning it may associate more than one class label<br>
with a given data record. When this happens, it may mean that your<br>
classes are indeed overlapping in the underlying feature space. It<br>
could also mean that you simply have not supplied sufficient training<br>
data to the decision tree classifier. For a tutorial introduction to<br>
how a decision tree is constructed and used, see<br>
<br>
<a href="https://engineering.purdue.edu/kak/Tutorials/DecisionTreeClassifiers.pdf">https://engineering.purdue.edu/kak/Tutorials/DecisionTreeClassifiers.pdf</a><br>
<br>
<br>
<span style="color:red; font-size:150%"><strong><a id="4">WHAT PRACTICAL PROBLEM IS SOLVED BY THIS MODULE?</a><br>
</strong></span> <br>
If you are new to the concept of a decision tree, their practical<br>
utility is best understood with an example that only involves symbolic<br>
features. However, as mentioned earlier, versions 2.0 and higher of<br>
this module handle both symbolic and numeric features.<br>
<br>
Consider the following scenario: Let's say you are running a small<br>
investment company that employs a team of stockbrokers who make<br>
buy/sell decisions for the customers of your company. Assume that your<br>
company has asked the traders to make each investment decision on the<br>
basis of the following five criteria:<br>
<br>
<div style="font-family:Courier; font-size:80%"> price_to_earnings_ratio (P_to_E)<br>
<br>
price_to_sales_ratio (P_to_S)<br>
<br>
return_on_equity (R_on_E)<br>
<br>
market_share (M_S)<br>
<br>
sentiment (S)<br>
</div> <br>
Since you are the boss, you keep track of the buy/sell decisions made<br>
by the individual traders. But one unfortunate day, all of your<br>
traders decide to quit because you did not pay them enough. So what<br>
are you to do? If you had a module like the one here, you could still<br>
run your company and do so in such a way that your company would, on<br>
the average, perform better than any of the individual traders who<br>
worked for you previously. This is what you would need to do: You<br>
would pool together the individual trader buy/sell decisions you<br>
accumulated during the last one year. This pooled information is<br>
likely to look like:<br>
<br>
<div style="font-family:Courier; font-size:80%"> example buy/sell P_to_E P_to_S R_on_E M_S S<br>
====================================================================<br>
<br>
example_1 buy high low medium low high<br>
example_2 buy medium medium low low medium<br>
example_3 sell low medium low high low<br>
....<br>
....<br>
</div> <br>
This would constitute your training data. Assuming CSV formatting for<br>
the data in a file called 'training.csv', you would need to feed this<br>
file into the module by calling:<br>
<br>
<div style="font-family:Courier; font-size:80%"> dt = <a href="#DecisionTree">DecisionTree</a>( training_datafile = "training.csv", .... )<br>
<br>
dt.get_training_data()<br>
<br>
dt.calculate_first_order_probabilities()<br>
<br>
dt.calculate_class_priors()<br>
</div> <br>
Subsequently, you would construct a decision tree by calling:<br>
<br>
<div style="font-family:Courier; font-size:80%"> root_node = dt.construct_decision_tree_classifier()<br>
</div> <br>
Now you and your company (with practically no employees) are ready to<br>
service the customers again. Suppose your computer needs to make a<br>
buy/sell decision about an investment prospect that is best described<br>
by:<br>
<br>
<div style="font-family:Courier; font-size:80%"> price_to_earnings_ratio = low<br>
price_to_sales_ratio = very_low<br>
return_on_equity = none<br>
market_share = medium <br>
sentiment = low<br>
</div> <br>
All that your computer would need to do would be to construct a data<br>
record like<br>
<br>
<div style="font-family:Courier; font-size:80%"> test_case = [ 'P_to_E=low', <br>
'P_to_S=very_low', <br>
'R_on_E=none',<br>
'M_S=medium',<br>
'S=low' ]<br>
</div> <br>
and call the decision tree classifier you just constructed by<br>
<br>
<div style="font-family:Courier; font-size:80%"> classification = dt.classify(root_node, test_case)<br>
<br>
print "Classification: ", classification<br>
</div> <br>
The answer returned will be 'buy' and 'sell', along with the associated<br>
probabilities. So if the probability of 'buy' is considerably greater<br>
than the probability of 'sell', that's what you should instruct your<br>
computer to do.<br>
<br>
The chances are that, on the average, this approach would beat the<br>
performance of any of your individual traders who worked for you<br>
previously since the buy/sell decisions made by the computer would be<br>
based on the collective wisdom of all your previous traders.<br>
DISCLAIMER: There is obviously a lot more to good investing than what<br>
is captured by the silly little example here. However, it does convey<br>
the sense in which the current module can be used.<br>
<br>
<br>
<span style="color:red; font-size:150%"><strong><a id="5">SYMBOLIC FEATURES VERSUS NUMERIC FEATURES:</a><br>
</strong></span> <br>
A feature is symbolic when its values are compared using string<br>
comparison operators. By the same token, a feature is numeric when its<br>
values are compared using numeric comparison operators. Having said<br>
that, features that take only a small number of numeric values in the<br>
training data can be treated symbolically provided you are careful<br>
about handling their values in the test data. At the least, you have<br>
to set the test data value for such a feature to its closest value in<br>
the training data. For those numeric features that the module is<br>
allowed to treat symbolically, this snapping of the values of the<br>
features in the test data to the small set of values in the training<br>
data is carried out automatically by the module. That is, after a user<br>
has told the module which numeric features to treat symbolically, the<br>
user need not worry about how the feature values appear in the test<br>
data.<br>
<br>
The constructor parameter symbolic_to_numeric_cardinality_threshold<br>
lets you tell the module when to consider an otherwise numeric feature<br>
symbolically. Suppose you set this parameter to 10, that means that all<br>
numeric looking features that take 10 or fewer different values in the<br>
training datafile will be considered to be symbolic features.<br>
<br>
See the tutorial at<br>
<br>
<a href="https://engineering.purdue.edu/kak/Tutorials/DecisionTreeClassifiers.pdf">https://engineering.purdue.edu/kak/Tutorials/DecisionTreeClassifiers.pdf</a><br>
<br>
for further information on the implementation issues related to the<br>
symbolic and numeric features.<br>
<br>
<br>
<span style="color:red; font-size:150%"><strong><a id="6">FEATURES WITH NOT SO "NICE" STATISTICAL PROPERTIES:</a><br>
</strong></span> <br>
For the purpose of estimating the probabilities, it is necessary to<br>
sample the range of values taken on by a numerical feature. For<br>
features with "nice" statistical properties, this sampling interval is<br>
set to the median of the differences between the successive feature<br>
values in the training data. (Obviously, as you would expect, you<br>
first sort all the values for a feature before computing the successive<br>
differences.) This logic will not work for the sort of a feature<br>
described below.<br>
<br>
Consider a feature whose values are heavy-tailed, and, at the same<br>
time, the values span a million to one range. What I mean by<br>
heavy-tailed is that rare values can occur with significant<br>
probabilities. It could happen that most of the values for such a<br>
feature are clustered at one of the two ends of the range. At the same<br>
time, there may exist a significant number of values near the end of<br>
the range that is less populated. (Typically, features related to<br>
human economic activities --- such as wealth, incomes, etc. --- are of<br>
this type.) With the logic described in the previous paragraph, you<br>
could end up with a sampling interval that is much too small, which<br>
could result in millions of sampling points for the feature if you are<br>
not careful.<br>
<br>
Beginning with Version 2.2.4, you have two options in dealing with such<br>
features. You can choose to go with the default behavior of the<br>
module, which is to sample the value range for such a feature over a<br>
maximum of 500 points. Or, you can supply an additional option to the<br>
constructor that sets a user-defined value for the number of points to<br>
use. The name of the option is "number_of_histogram_bins". The<br>
following script<br>
<br>
<div style="font-family:Courier; font-size:80%"> construct_dt_for_heavytailed.py<br>
</div> <br>
in the Examples directory shows an example of how to call the<br>
constructor of the module with the "number_of_histogram_bins" option.<br>
<br>
<br>
<span style="color:red; font-size:150%"><strong><a id="7">TESTING THE QUALITY OF YOUR TRAINING DATA:</a><br>
</strong></span> <br>
Starting with version 2.2, the module includes a new class named<br>
<a href="#EvalTrainingData">EvalTrainingData</a>, derived from the main class <a href="#DecisionTree">DecisionTree</a>, that runs a<br>
10-fold cross-validation test on your training data to test its ability<br>
to discriminate between the classes mentioned in the training file.<br>
<br>
The 10-fold cross-validation test divides all of the training data into<br>
ten parts, with nine parts used for training a decision tree and one<br>
part used for testing its ability to classify correctly. This selection<br>
of nine parts for training and one part for testing is carried out in<br>
all of the ten different possible ways. <br>
<br>
The following code fragment illustrates how you invoke the testing<br>
function of the <a href="#EvalTrainingData">EvalTrainingData</a> class:<br>
<br>
<div style="font-family:Courier; font-size:80%"> training_datafile = "training3.csv"<br>
eval_data = <a href="#DecisionTree">DecisionTree</a>.<a href="#EvalTrainingData">EvalTrainingData</a>(<br>
training_datafile = training_datafile,<br>
csv_class_column_index = 1,<br>
csv_columns_for_features = [2,3],<br>
entropy_threshold = 0.01,<br>
max_depth_desired = 3,<br>
symbolic_to_numeric_cardinality_threshold = 10,<br>
csv_cleanup_needed = 1,<br>
)<br>
<br>
eval_data.get_training_data()<br>
eval_data.evaluate_training_data()<br>
</div> <br>
The last statement above prints out a Confusion Matrix and the value of<br>
Training Data Quality Index on a scale of 100, with 100 designating<br>
perfect training data. The Confusion Matrix shows how the different<br>
classes were misidentified in the 10-fold cross-validation test.<br>
<br>
This testing functionality can also be used to find the best values to<br>
use for the constructor parameters entropy_threshold,<br>
max_depth_desired, and symbolic_to_numeric_cardinality_threshold.<br>
<br>
The following two scripts in the Examples directory illustrate the use<br>
of the <a href="#EvalTrainingData">EvalTrainingData</a> class for testing the quality of your data:<br>
<br>
<div style="font-family:Courier; font-size:80%"> evaluate_training_data1.py<br>
<br>
evaluate_training_data2.py<br>
</div> <br>
IMPORTANT: These data evaluation scripts produce reliable indicators of<br>
the quality of your data only when there is a rough parity between the<br>
number of training samples available for the different data classes.<br>
<br>
<br>
<span style="color:red; font-size:150%"><strong><a id="8">HOW TO MAKE THE BEST CHOICES FOR THE CONSTRUCTOR PARAMETERS:</a><br>
</strong></span> <br>
Assuming your training data is good, the quality of the results you get<br>
from a decision tree would depend on the choices you make for the<br>
constructor parameters entropy_threshold, max_depth_desired, and<br>
symbolic_to_numeric_cardinality_threshold. You can optimize your<br>
choices for these parameters by running the 10-fold cross-validation<br>
test that is made available in Versions 2.2 and higher through the new<br>
class <a href="#EvalTrainingData">EvalTrainingData</a> that is included in the module file. A<br>
description of how to run this test is in the section titled "TESTING<br>
THE QUALITY OF YOUR TRAINING DATA" of this document.<br>
<br>
<br>
<span style="color:red; font-size:150%"><strong><a id="25">SOLVING NEEDLE-IN-A-HAYSTACK AND BIG DATA PROBLEMS:</a><br>
</strong></span> <br>
Machine learning algorithms, in general, run into difficulties when<br>
there is gross imbalance in how many training samples are available for<br>
the different data classes --- especially when the different data<br>
classes are not linearly separable in the underlying feature<br>
space. Starting with Version 3.3.0 of this module, you can try to use<br>
the new Python class RandomizedTreesForBigData for solving such<br>
problems provided you set the following two constructor parameters:<br>
<br>
<div style="font-family:Courier; font-size:80%"> looking_for_needles_in_haystack<br>
<br>
how_many_trees<br>
</div> <br>
At the moment, RandomizedTreesForBigData is programmed for just binary<br>
classification. It first figures out the population imbalance between<br>
the majority data class and the minority data class in your dataset.<br>
Subsequently, by mixing the minority training samples with randomly<br>
drawn samples from the majority data class, it constructs a collection<br>
of training datasets for 'how_many_trees' number of decision trees. A<br>
new data sample to be classified is fed to each of these decision trees<br>
and the final classification based on majority voting by all the trees.<br>
You must set 'looking_for_needles_in_haystack = 1' in the constructor<br>
call for this logic to work. You'd also want to experiment with with<br>
the option 'how_many_trees' to figure out the best value for this<br>
parameter. Searching for a needle in a haystack is obviously a good<br>
metaphor for these types of data classification problems.<br>
<br>
The RandomizedTreesForBigData class can also be used for solving big<br>
data problems if you have access to a very large training database.<br>
You can solve such problems by constructing multiple decision trees,<br>
each based on a training dataset drawn randomly from the large training<br>
database (without paying attention to population imbalances).<br>
Subsequently, the final classification for a new data sample can be<br>
based on majority voting by all the trees. In order to use this<br>
functionality, you need to set the following two constructor parameters<br>
of this class:<br>
<br>
<div style="font-family:Courier; font-size:80%"> how_many_training_samples_per_tree<br>
<br>
how_many_trees<br>
</div> <br>
<br>
<span style="color:red; font-size:150%"><strong><a id="9">DECISION TREE INTROSPECTION:</a><br>
</strong></span> <br>
Starting with Version 2.3, you can ask the <a href="#DTIntrospection">DTIntrospection</a> class of the<br>
module to explain the classification decisions made at the different<br>
nodes of the decision tree.<br>
<br>
Perhaps the most important bit of information you are likely to seek<br>
through DT introspection is the list of the training samples that fall<br>
directly in the portion of the feature space that is assigned to a<br>
node.<br>
<br>
However, note that, when training samples are non-uniformly distributed<br>
in the underlying feature space, it is possible for a node to exist<br>
even when there are no training samples in the portion of the feature<br>
space assigned to the node. That is because the decision tree is<br>
constructed from the probability densities estimated from the training<br>
data. When the training samples are non-uniformly distributed, it is<br>
entirely possible for the estimated probability densities to be<br>
non-zero in a small region around a point even when there are no<br>
training samples specifically in that region. (After you have created<br>
a statistical model for, say, the height distribution of people in a<br>
community, the model may return a non-zero probability for the height<br>
values in a small interval even if the community does not include a<br>
single individual whose height falls in that interval.)<br>
<br>
That a decision-tree node can exist even where there are no training samples in<br>
the portion of the feature space that belongs to that node is an important<br>
indication of the generalization ability of a decision-tree based classifier.<br>
<br>
In light of the explanation provided above, before the <a href="#DTIntrospection">DTIntrospection</a><br>
class supplies any answers at all, it asks you to accept the fact that<br>
features can take on non-zero probabilities at a point in the feature<br>
space even though there are zero training samples at that point (or in<br>
a small region around that point). If you do not accept this<br>
rudimentary fact, the introspection class will not yield any answers<br>
(since you are not going to believe the answers anyway).<br>
<br>
The point made above implies that the path leading to a node in the<br>
decision tree may test a feature for a certain value or threshold<br>
despite the fact that the portion of the feature space assigned to that<br>
node is devoid of any training data.<br>
<br>
See the following three scripts in the Examples directory for how to<br>
carry out DT introspection:<br>
<br>
<div style="font-family:Courier; font-size:80%"> introspection_in_a_loop_interactive.py<br>
<br>
introspection_show_training_samples_at_all_nodes_direct_influence.py<br>
<br>
introspection_show_training_samples_to_nodes_influence_propagation.py<br>
</div> <br>
The first script places you in an interactive session in which you will<br>
first be asked for the node number you are interested in.<br>
Subsequently, you will be asked for whether or not you are interested<br>
in specific questions that the introspection can provide answers<br>
for. The second script descends down the decision tree and shows for<br>
each node the training samples that fall directly in the portion of the<br>
feature space assigned to that node. The third script shows for each<br>
training sample how it affects the decision-tree nodes either directly<br>
or indirectly through the generalization achieved by the probabilistic<br>
modeling of the data.<br>
<br>
The output of the script introspection_show_training_samples_at_all_<br>
nodes_direct_influence.py looks like:<br>
<br>
<div style="font-family:Courier; font-size:80%"> Node 0: the samples are: None<br>
Node 1: the samples are: ['sample_46', 'sample_58']<br>
Node 2: the samples are: ['sample_1', 'sample_4', 'sample_7', .....]<br>
Node 3: the samples are: []<br>
Node 4: the samples are: []<br>
...<br>
... <br>
</div> <br>
The nodes for which no samples are listed come into existence through<br>
the generalization achieved by the probabilistic modeling of the data.<br>
<br>
The output produced by the script introspection_show_training_samples_<br>
to_nodes_influence_propagation.py looks like<br>
<br>
<div style="font-family:Courier; font-size:80%"> sample_1: <br>
nodes affected directly: [2, 5, 19, 23] <br>
nodes affected through probabilistic generalization: <br>
2=> [3, 4, 25] <br>
25=> [26] <br>
5=> [6] <br>
6=> [7, 13] <br>
7=> [8, 11] <br>
8=> [9, 10] <br>
11=> [12] <br>
13=> [14, 18] <br>
14=> [15, 16] <br>
16=> [17] <br>
19=> [20] <br>
20=> [21, 22] <br>
23=> [24] <br>
<br>
sample_4: <br>
nodes affected directly: [2, 5, 6, 7, 11] <br>
nodes affected through probabilistic generalization: <br>
2=> [3, 4, 25] <br>
25=> [26] <br>
5=> [19] <br>
19=> [20, 23] <br>
20=> [21, 22] <br>
23=> [24] <br>
6=> [13] <br>
13=> [14, 18] <br>
14=> [15, 16] <br>
16=> [17] <br>
7=> [8] <br>
8=> [9, 10] <br>
11=> [12] <br>
<br>
... <br>
... <br>
...<br>
</div> <br>