forked from b-cube/nutch-crawler
-
Notifications
You must be signed in to change notification settings - Fork 0
/
CHANGES.txt
2134 lines (1265 loc) · 77.8 KB
/
CHANGES.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Nutch Change Log
Nutch 1.9 Release Change Log - 12/08/2014 (dd/mm/yyyy)
Release Report - http://s.apache.org/1.9-release
* NUTCH-1561 improve usability of parse-metatags and index-metadata (snagel)
* NUTCH-1708 use same id when indexing and deleting redirects (snagel)
* NUTCH-1818 Add deps-test-compile task for building plugins (jnioche)
* NUTCH-1817 Remove pom.xml from source (jnioche)
* NUTCH-926 Redirections from META tag don't get filtered (snagel)
* NUTCH-1422 Bypass signature comparison when a document is redirected (snagel)
* NUTCH-1502 Test for CrawlDatum state transitions (snagel)
* NUTCH-1804 Move JUnit dependency to test scope (jnioche)
* NUTCH-1811 bin/nutch junit to use junit 4 test runner (snagel)
* NUTCH-1799 ANT Eclipse task discovers all plugin jars automatically (jnioche)
* NUTCH-578 URL fetched with 403 is generated over and over again (snagel)
* NUTCH-1776 Log incorrect plugin.folder file path (Diaa via snagel)
* NUTCH-1566 bin/nutch to allow whitespace in paths (tejasp, snagel)
* NUTCH-1605 MIME type detector recognizes xlsx as zip file (snagel)
* NUTCH-1802 Move TestbedProxy to test environment (jnioche)
* NUTCH-1803 Put test dependencies in a separate lib dir (jnioche)
* NUTCH-385 Improve description of thread related configuration for Fetcher (jnioche,lufeng)
* NUTCH-1633 slf4j is provided by hadoop and should not be included in the job file (kaveh minooie via jnioche)
* NUTCH-1787 update and complete API doc overview page (snagel)
* NUTCH-1767 remove special treatment of "params" in relative links (snagel)
* NUTCH-1718 redefine http.robots.agent as "additional agent names" (snagel, Tejas Patil, Daniel Kugel)
* NUTCH-1794 IndexingFilterChecker to optionally dumpText (markus)
* NUTCH-1590 [SECURITY] Frame injection vulnerability in published Javadoc (jnioche)
* NUTCH-1793 HttpRobotRulesParser not configured properly (jnioche)
* NUTCH-1647 protocol-http throws 'unzipBestEffort returned null' for redirected pages (jnioche)
* NUTCH-1736 Can't fetch page if http response header contains Transfer-Encoding:chunked (ysc via jnioche)
* NUTCH-1782 NodeWalker to return current node (markus)
* NUTCH-1758 IndexChecker to send document to IndexWriters (jnioche)
* NUTCH-1786 CrawlDb should follow db.url.normalizers and db.url.filters (Diaa via markus)
* NUTCH-1757 ParserChecker to take custom metadata as input (jnioche)
* NUTCH-1676 Add rudimentary SSL support to protocol-http (jnioche, markus)
* NUTCH-1772 Injector does not need merging if no pre-existing crawldb (jnioche)
* NUTCH-1752 Cache robots.txt rules per protocol:host:port (snagel)
* NUTCH-1613 Timeouts in protocol-httpclient when crawling same host with >2 threads (brian44 via jnioche)
* NUTCH-1766 Generator to unlock crawldb and remove tempdir if generate job fails (Diaa via jnioche)
* NUTCH-207 Bandwidth target for fetcher rather than a thread count (jnioche)
* NUTCH-1182 fetcher to log hung threads (snagel)
* NUTCH-1759 Upgrade to Crawler Commons 0.4 (jnioche)
* NUTCH-1764 readdb to show command-line help if no action (-stats, -dump, etc.) given (Diaa via snagel)
* NUTCH-1700 Remove deprecated code from creativecommons plugin (lewismc)
* NUTCH-1761 Crawl script fails to find job file if not started from inside bin dir (David Hosking, jnioche)
* NUTCH-1603 ZIP parser complains about truncated PDF file (snagel)
* NUTCH-1720 Duplicate lines in HttpBase.java (Walter Tietze via jnioche)
* NUTCH-1750 Improvement of Fetcher's reportStatus (jnioche)
* NUTCH-1747 Use AtomicInteger as semaphore in Fetcher (jnioche)
* NUTCH-1735 code dedup fetcher queue redirects (snagel)
* NUTCH-1745 Upgrade to ElasticSearch 1.1.0 (jnioche)
* NUTCH-1645 Junit Test Case for Adaptive Fetch Schedule class (Yasin Kılınç, lufeng, Sertac TURKEL via snagel)
* NUTCH-1737 Upgrade to recent JUnit 4.x (lewismc)
* NUTCH-1733 parse-html to support HTML5 charset definitions (snagel)
* NUTCH-1671 indexchecker to add digest field (snagel, lufeng)
Nutch 1.8 - 11/03/2014 (dd/mm/yyyy)
Release Report - http://s.apache.org/oHY
* NUTCH-1706 IndexerMapReduce does not remove db_redir_temp (markus, snagel)
* NUTCH-1113 SegmentMerger can now be safely used to merge segments (Edward Drapkin, markus, snagel)
* NUTCH-1729 Upgrade to Tika 1.5 (jnioche)
* NUTCH-1707 DummyIndexingWriter (markus)
* NUTCH-1721 Upgrade to Crawler commons 0.3 (tejasp)
* NUTCH-1253 Incompatable neko and xerces versions (snagel, lewismc)
* NUTCH-1715 RobotRulesParser adds additional '*' to the robots name (tejasp)
* NUTCH-356 Plugin repository cache can lead to memory leak (Enrico Triolo, Doğacan Güney via markus)
* NUTCH-1413 Record response time (Yasin Kılınç, Talat Uyarer, snagel)
* NUTCH-1680 CrawlDbReader to dump minRetry value (markus)
* NUTCH-1699 Tika Parser - Image Parse Bug (Mehmet Zahid Yüzügüldü, snagel via lewismc)
* NUTCH-1695 Add NutchDocument.toString() to ease debugging (markus)
* NUTCH-1675 NutchField to support long (markus)
* NUTCH-1670 set same crawldb directory in mergedb parameter (lufeng via tejasp)
* NUTCH-1080 Type safe members, arguments for better readability (tejasp)
* NUTCH-1360 Suport the storing of IP address connected to when web crawling (lewismc, ferdy and Yasin Kılınç)
* NUTCH-1681 In URLUtil.java, toUNICODE method does not work correctly (İlhami KALKAN, snagel via markus)
* NUTCH-1668 Remove package org.apache.nutch.indexer.solr (jnioche)
* NUTCH-1621 Remove deprecated class o.a.n.crawl.Crawler (Rui Gao via jnioche)
* NUTCH-656 Generic Deduplicator (jnioche, snagel)
* NUTCH-1100 Avoid NPE in SOLRDedup (markus)
* NUTCH-1666 Optimisation for BasicURLNormalizer (jnioche)
* NUTCH-1656 ParseMeta not passed to CrawlDatum for not_modified (markus)
* NUTCH-1606 Check that Factory classes use the cache in a thread safe way (jnioche)
* NUTCH-1653 AbstractScoringFilter (jnioche)
* NUTCH-1562 Order of execution for scoring filters (jnioche, snagel)
* NUTCH-1640 Reuse ParseUtil instance in ParseSegment (Mitesh Singh Jat via jnioche)
* NUTCH-1639 bin/crawl fails on mac os (various contributors via snagel)
* NUTCH-1646 IndexerMapReduce to consider DB status (markus)
* NUTCH-1636 Indexer to normalize and filter repr URL (Iain Lopata via snagel)
* NUTCH-1637 URLUtil is missing getProtocol (markus)
* NUTCH-1622 Create Outlinks with metadata (jnioche)
* NUTCH-1629 Injector skips empty lines in seed files (kaveh minooie via jnioche)
* NUTCH-911 protocol-file to return proper protocol status (Peter Lundberg via snagel)
* NUTCH-806 Merge CrawlDBScanner with CrawlDBReader (jnioche)
* NUTCH-1587 misspelled property "threshold" in conf/log4j.properties (snagel)
* NUTCH-1604 ProtocolFactory not thread-safe (jnioche)
* NUTCH-1595 Upgrade to Tika 1.4 (jnioche, markus)
* NUTCH-1598 ElasticSearchIndexer to read ImmutableSettings from config (markus)
* NUTCH-1520 SegmentMerger looses records (markus)
* NUTCH-1602 improve the readability of metadata in readdb dump normal (lufeng)
* NUTCH-1596 HeadingsParseFilter not thread safe (snagel via markus)
* NUTCH-1597 HeadingsParseFilter to trim and remove exess whitespace (markus)
* NUTCH-1601 ElasticSearchIndexer fails to properly delete documents (markus)
* NUTCH-1600 Injector overwrite does not always work properly (markus)
* NUTCH-1581 CrawlDB csv output to include metadata (markus)
* NUTCH-1327 QueryStringNormalizer (markus)
* NUTCH-1593 Normalize option missing in SegmentMerger's usage (markus)
* NUTCH-1580 index-static returns object instead of value for index.static (Antoinette, lewismc, snagel)
* NUTCH-1126 JUnit test for urlfilter-prefix (Talat UYARER via markus)
Apache Nutch 1.7 Release - 06/20/2013 (mm/dd/yyyy)
Release report - http://s.apache.org/1zE
* NUTCH-1585 Ensure duplicate tags do not exist in microformat-reltag tag set. (lewismc)
* NUTCH-1583 Headings plugin to support multivalued headings (markus)
* NUTCH-1245 URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb (snagel)
* NUTCH-1527 Elasticsearch indexer (lufeng + markus)
* NUTCH-1475 Index-More Plugin -- A better fall back value for date field (James Sullivan, snagel via lewismc)
* NUTCH-1560 index-metadata to add all values of multivalued metadata (snagel)
* NUTCH-1467 Not able to parse mutliValued metatags (kiran via snagel)
* NUTCH-1430 Freegenerator records overwrite CrawlDB records with AdaptiveFetchSchedule (markus)
* NUTCH-1522 Upgrade to Tika 1.3 (jnioche)
* NUTCH-1578 Upgrade to Hadoop 1.2.0 (markus)
* NUTCH-1577 Add target for creating eclipse project (tejasp)
* NUTCH-1513 Support Robots.txt for Ftp urls (tejasp)
* NUTCH-1249 and NUTCH-1275 : Resolve all issues flagged up by adding javac -Xlint argument (tejasp)
* NUTCH-1053 Parsing of RSS feeds fails (tejasp)
* NUTCH-956 solrindex issues: add field tld to Solr schema (Alexis via lewismc, snagel)
* NUTCH-1277 Fix [fallthrough] javac warnings (tejasp)
* NUTCH-1514 Phase out the deprecated configuration properties (if possible) (tejasp)
* NUTCH-1334 NPE in FetcherOutputFormat (jnioche via tejasp)
* NUTCH-1549 Fix deprecated use of Tika MimeType API in o.a.n.util.MimeUtil (tejasp)
* NUTCH-346 Improve readability of logs/hadoop.log (Renaud Richardet via tejasp)
* NUTCH-829 duplicate hadoop temp files (Mike Baranczak, lewismc, tejasp)
* NUTCH-1501 Harmonize behavior of parsechecker and indexchecker (snagel + lewismc)
* NUTCH-1031 Delegate parsing of robots.txt to crawler-commons (tejasp)
* NUTCH-1547 BasicIndexingFilter - Problem to index full title (Feng)
* NUTCH-1389 parsechecker and indexchecker to report truncated content (snagel)
* NUTCH-1419 parsechecker and indexchecker to report protocol status (snagel + lewismc)
* NUTCH-1047 Pluggable indexing backends (jnioche)
* NUTCH-1536 Ant build file has hardcoded conf dir location (zm via lewismc)
* NUTCH-1420 Get rid of the dreaded � (markus via lewismc)
* NUTCH-1521 CrawlDbFilter pass null url to urlNormalizers (Lufeng via lewismc)
* NUTCH-1284 Add site fetcher.max.crawl.delay as log output by default (tejasp)
* NUTCH-1453 Substantiate tests for IndexingFilters (lufeng via lewismc)
* NUTCH-840 Port tests from parse-html to parse-tika (lewismc, jnioche)
* NUTCH-1509 Implement read/write in NutchField (markus)
* NUTCH-1507 Remove FetcherOutput (markus)
* NUTCH-1506 Add UPDATE action to NutchIndexAction (markus)
* NUTCH-1500 bin/crawl fails on step solrindex with wrong path to segment (Tristan Buckner, snagel)
* NUTCH-1274 Fix [cast] javac warnings (tejasp via lewismc)
* NUTCH-1494 RSS feed plugin seems broken (Sourajit Basak, tejasp and lewismc)
* NUTCH-1127 JUnit test for urlfilter-validator (tejasp via lewismc)
* NUTCH-1119 JUnit test for index-static (tejasp via lewismc)
* NUTCH-1510 Upgrade to Hadoop 1.1.1 (markus)
* NUTCH-1118 JUnit test for index-basic (tejasp via lewismc)
* NUTCH-1331 limit crawler to defined depth (jnioche)
Release 1.6 - 23/11/2012
* NUTCH-1370 Expose exact number of urls injected @runtime (snagel via lewismc)
* NUTCH-1117 JUnit test for index-anchor (lewismc)
* NUTCH-1451 Upgrade automaton jar to 1.11-8 (lewismc)
* NUTCH-1488 bin/nutch to run junit from any directory (snagel via lewismc)
* NUTCH-1493 Error adding field 'contentLength'='' during solrindex using index-more (Nathan Gass via lewismc)
* NUTCH-1491 Strip UTF-8 non-character codepoints in title (Nathan Gass via markus)
* NUTCH-1421 RegexURLNormalizer to only skip rules with invalid patterns (snagel)
* NUTCH-1341 NotModified time set to now but page not modified (markus)
* NUTCH-1215 UpdateDB should not require segment as input (markus)
* NUTCH-1383 IndexingFiltersChecker to show error message instead of null pointer exception (snagel)
* NUTCH-1476 SegmentReader getStats should set parsed = -1 if no parsing took place (snagel)
* NUTCH-1252 SegmentReader -get shows wrong data (snagel)
* NUTCH-1344 BasicURLNormalizer to normalize https same as http (snagel)
* NUTCH-706 Url regex normalizer: pattern for session id removal not to match "newsId" (Meghna Kukreja via snagel)
* NUTCH-1415 release packages to contain top level folder apache-nutch-x.x (snagel)
* NUTCH-1441 AnchorIndexingFilter should use plain HashSet (ferdy via lewismc)
* NUTCH-1470 Ensure test files are included for runtime testing (lewismc)
* NUTCH-1434 Indexer to delete robots noindex (markus)
* NUTCH-1443 Solr schema version is invalid (markus)
* NUTCH-1417 Remove o.a.n.metadata.Office (lewismc)
* NUTCH-1376 Add description parameter to every ant task (lewismc)
* NUTCH-1440 reconfigure non-existent stopwords_en.txt in schema-solr4.xml (shekhar sharma via lewismc)
* NUTCH-1439 Define boost field as type float in schema-solr4.xml (shekhar sharma via lewismc)
* NUTCH-1433 Upgrade to Tika 1.2 (jnioche)
* NUTCH-1388 Optionally maintain custom fetch interval despite AdaptiveFetchSchedule (markus)
* NUTCH-1430 Freegenerator records overwrite CrawlDB records with AdaptiveFetchSchedule (markus)
* NUTCH-1087 Deprecate crawl command and replace with example script (jnioche)
* NUTCH-1306 Add option to not commit and clarify existing solr.commit.size (ferdy)
* NUTCH-1405 Allow to overwrite CrawlDatum's with injected entries (markus)
* NUTCH-1412 Upgrade commons lang (markus)
* NUTCH-1251 SolrDedup to use proper Lucene catch-all query (Arkadi Kosmynin via markus)
* NUTCH-1407 BasicIndexingFilter to optionally add domain field (markus)
* NUTCH-1408 RobotRulesParser main doesn't take URL's (markus)
* NUTCH-1300 Indexer to filter normalize URL's (markus)
* NUTCH-1330 WebGraph OutlinkDB to preserve back up (markus)
* NUTCH-1319 HostNormalizer plugin (markus)
* NUTCH-1386 Headings filter not to add empty values (markus)
* NUTCH-1356 ParseUtil use ExecutorService instead of manually thread handling (ferdy via markus)
* NUTCH-1352 Improve regex urlfilters/normalizers synchronization (ferdy via markus)
* NUTCH-1024 Dynamically set fetchInterval by MIME-type (markus)
* NUTCH-1364 Add a counter in Generator for malformed urls (lewismc)
* NUTCH-1262 Map `duplicating` content-types to a single type (markus)
* NUTCH-1385 More robust plug-in order properties in nutch-site.xml (Andy Xue via markus)
* NUTCH-1336 Optionally not index db_notmodified pages (markus)
* NUTCH-1346 Follow outlinks to ignore external (markus)
* NUTCH-1320 IndexChecker and ParseChecker choke on IDN's (markus)
* NUTCH-1351 DomainStatistics to aggregate by TLD (markus)
* NUTCH-1381 Allow to override default subcollection field name (markus)
* NUTCH-XX Commit to add configuration for separation of ant distribution targets (lewismc + jnioche)
Release 1.5.1 - 07/10/2012
* NUTCH-1404 Nutch script fails to find job file in deploy mode (sidabatra, jnioche)
* NUTCH-1415 release packages to contain top level folder apache-nutch-x.x (snagel via lewismc)
* NUTCH-1400 Remove developer -core option for bin/nutch (jnioche)
* NUTCH-1384 Typo in ParseSegment's run-method (Matthias Agethle via markus)
* NUTCH-1398 Upgrade to Hadoop 1.0.3 (jnioche)
Release 1.5 - 04/15/2012
* NUTCH-1208 Don't include KEYS file in bin distribution (jnioche)
* NUTCH-1234 Upgrade to Tika 1.1 (jnioche, markus)
* NUTCH-809 Parse-metatags plugin (jnioche)
* NUTCH-1310 Nutch to send HTTP-accept header (markus)
* NUTCH-1305 Domain(blacklist)URLFilter to trim entries (markus)
* NUTCH-1307 Improve formatting of ant targets for clearer project help (lewismc)
* NUTCH-1299 LinkRank inverter to ignore records without Node (markus)
* NUTCH-1258 MoreIndexingFilter should be able to read Content-Type from both parse metadata and content metadata (jnioche, markus)
* NUTCH-1293 IndexingFiltersChecker to store detected content type in crawldatum metadata (markus)
* NUTCH-1291 Fetcher to stringify exception on // unexpected exception (markus)
* NUTCH-965 Skip parsing for truncated documents (alexis, lewismc, ferdy)
* NUTCH-1210 DomainBlacklistFilter (markus)
* NUTCH-1193 Incorrect url transform to lowercase: parameter solr (Eduardo dos Santos Leggiero via lewismc)
* NUTCH-1272 Wrong property name for index-static in nutch-default.xml (Daniel Baur via jnioche)
* NUTCH-1259 Store detected content-type in crawldatum metadata (jnioche, markus)
* NUTCH-1266 Subcollection to optionally write to configured fields (markus)
* NUTCH-1005 Parse headings plugin (markus)
* NUTCH-1264 Configurable indexing plugin index-metadata (jnioche)
* NUTCH-1242 Allow disabling of URL Filters in ParseSegment (Edward Drapkin via markus)
* NUTCH-1256 WebGraph to dump host + score (markus)
* NUTCH-1260 Fetcher should log fetching of redirects (Sebastian Nagel via markus)
* NUTCH-1255 Change ivy.xml of all plugins to remove "nutch.root" property (ferdy)
* NUTCH-1248 Generator to select on status (markus)
* NUTCH-1177 Generator to select on retry interval (markus)
* NUTCH-1246 Upgrade to Hadoop 1.0.0 (jnioche)
* NUTCH-1139 Indexer to delete gone documents (markus)
* NUTCH-1244 CrawlDBDumper to filter by regex (markus)
* NUTCH-1237 Improve javac arguements for more verbose ouput (lewismc)
* NUTCH-1236 Add link to site documentation to download older versions of Nutch (lewismc)
* NUTCH-1146 Prevent generation of _SUCCESS files in output (jnioche)
* NUTCH-1232 Remove site field from index-basic (markus)
* NUTCH-1239 Webgraph should remove deleted pages from segment input (markus)
* NUTCH-1238 Fetcher throughput threshold must start before feeder finished (markus)
* NUTCH-1138 remove LogUtil from trunk and nutch gora (lewismc)
* NUTCH-1231 Upgrade to Tika 1.0 (markus)
* NUTCH-1230 MimeType API deprecated and breaks with Tika 1.0 (markus)
* NUTCH-1235 Upgrade to new Hadoop 0.20.205.0 (markus)
* NUTCH-1217 Update NOTICE.txt to drop some copyrights (lewismc)
* NUTCH-1129 Add freegenerator, domainstats and crawldbscanner to log4j (markus)
* NUTCH-1184 Fetcher to parse and follow Nth degree outlinks (markus)
* NUTCH-1221 Migrate DomainStatistics to MapReduce API (markus)
* NUTCH-1216 Add trivial comment to lib/native/README.txt (lewismc)
* NUTCH-1214 DomainStats tool should be named for what it's doing (markus)
* NUTCH-1213 Pass additional SolrParams when indexing to Solr (ab)
* NUTCH-1211 URLFilterChecker command line help doesn't inform user of
STDIN requirements (mattmann)
* NUTCH-1209 Output from ParserChecker Url missing a newline (mattmann)
* NUTCH-1207 ParserChecker to output signature (markus)
* NUTCH-1090 InvertLinks should inform when ignoring internal links (Marek Backmann via markus)
* NUTCH-1174 Outlinks are not properly normalized (markus)
* NUTCH-1203 ParseSegment to show number of milliseconds per parse (markus)
* NUTCH-1185 Decrease solr.commit.size to 250 (markus)
* NUTCH-1180 UpdateDB to backup previous CrawlDB (markus)
* NUTCH-1173 DomainStats doesn't count db_not_modified (markus)
* NUTCH-1155 Host/domain limit in generator is generate.max.count+1 (markus)
* NUTCH-1061 Migrate MoreIndexingFilter from Apache ORO to java.util.regex (markus)
* NUTCH-1178 Incorrect CSV header CrawlDatumCsvOutputFormat (markus)
* NUTCH-1142 Normalization and filtering in WebGraph (markus)
* NUTCH-1153 LinkRank not to log all keys and not to write Hadoop _SUCCESS file (markus)
Release 1.4 - 11/4/2011
* NUTCH-1195 Add Solr 4x (trunk) example schema (ab)
* NUTCH-1192 Add '/runtime' to svn ignore (ferdy)
* NUTCH-1097 application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml (Ferdy via lewismc)
* NUTCH-797 Fix parse-tika and parse-html to use relative URL resolution per RFC-3986
(Robert Hohman, ab)
* NUTCH-1154 Upgrade to Tika 0.10. NOTE: Tika's new RTF parser may ignore more
text in malformed documents than previously - see TIKA-748 for details. (ab)
* NUTCH-1109 Add Sonar targets to Ant build.xml (lewismc)
* NUTCH-1152 Upgrade SolrJ to version 3.4.0 (ab)
* NUTCH-1136 Ant pmd target is broken (lewismc)
* NUTCH-1058 Upgrade Solr schema to version 1.4 (markus)
* NUTCH-1137 LinkDB invertlinks other options ignored when using -dir option (Sebastian Nagel, markus)
* NUTCH-1141 Configurable Fetcher queue depth (jnioche)
* NUTCH-1091 Remove commons logging dependency from Nutch branch and trunk (lewismc)
* NUTCH-672 allow unit tests to be run from bin/nutch (Todd Lipton via lewismc)
* NUTCH-937 Put plugins in classes/plugins in job file (Claudio Martella, Ferdy Galema, jnioche)
* NUTCH-623 Change plugin source directory "languageidentifier" to "language-identifier" (lewismc)
* NUTCH-1074 topN is ignored with maxNumSegments and generate.max.count (Robert Thomson via markus)
* NUTCH-1078 Upgrade all instances of commons logging to slf4j (with log4j backend) (lewismc)
* NUTCH-1115 Option to disable fixing embedded URL parameters in DomContentUtils (markus)
* NUTCH-1114 Attr file missing in domain filter (markus)
* NUTCH-1067 Configure minimum throughput for fetcher (markus)
* NUTCH-1102 Fetcher to rely on fetcher.parse directive (markus)
* NUTCH-1110 UpdateDB must not write _success file (markus)
* NUTCH-1105 Max content length option for index-basic (markus)
* NUTCH-940 static field plugin (Claudio Martella via lewismc)
* NUTCH-914 Implement Apache Project Branding Requirements (lewismc)
* NUTCH-1095 remove i18n from Nutch site to archive and legacy secton of wiki (lewismc)
* NUTCH-1101 Option to purge db_gone records with updatedb (markus)
* NUTCH-1096 Empty (not null) ContentLength results in failure of fetch (Ferdy Galema via jnioche)
* NUTCH-1073 Rename parameters 'fetcher.threads.per.host.by.ip' and 'fetcher.threads.per.host' (jnioche)
* NUTCH-1089 Short compressed pages caused exception in protocol-httpclient (Simone Frenzel via jnioche)
* NUTCH-1085 Nutch script does not require HADOOP_HOME (jnioche)
* NUTCH-1075 Delegate language identification to Tika (jnioche)
* NUTCH-1049 Add classes to bin/nutch script (markus)
* NUTCH-1051 Export WebGraph node scores for Solr.ExternalFileField (markus)
* NUTCH-1083 ParserChecker implements Tools (jnioche)
* NUTCH-1082 IndexingFiltersChecker utility does not list multi valued fields (markus)
* NUTCH-1004 Do not index empty values for title field (markus)
* NUTCH-914 Implement Apache Project Branding Requirements (lewismc via jnioche)
* NUTCH-1069 Readlinkdb broken on Hadoop > 0.20 (markus)
* NUTCH-1044 Redirected URLs and possibly all of their outlinked URLs have invalid scores (jnioche)
* NUTCH-1028 Log urls when parsing (markus)
* NUTCH-1065 New mvn.template (lewismc)
* NUTCH-1072 Display number and size of queues in Fetcher status (jnioche)
* NUTCH-1071 Crawldb update displays total number of URLs per status (jnioche)
* NUTCH-1045 MimeUtil to rely on default config provided by Tika (jnioche)
* NUTCH-1057 Fetcher thread time out configurable (markus)
* NUTCH-1037 Option to deduplicate anchors prior to indexing (markus)
* NUTCH-1050 Add segmentDir option to WebGraph (markus)
* NUTCH-1055 upgrade package.html file in language identifier plugin (lewismc)
* NUTCH-1059 Remove convdb command from /bin/nutch (lewismc)
* NUTCH-1019 Edit comment in org.apache.nutch.crawl.Crawl to reflect removal of legacy (lewismc)
* NUTCH-1023 Trivial error in error message for org.apache.nutch.crawl.LinkDbReader (lewismc)
* NUTCH-1043 Add pattern for filtering .js in default url filters (jnioche)
* NUTCH-1054 LinkDB optional during indexing (jnioche)
* NUTCH-1029 Readdb throws EOFException (markus)
* NUTCH-1036 Solr jobs should increment counters in Reporter (markus)
* NUTCH-987 Support HTTP auth for Solr communication (markus)
* NUTCH-1027 Degrade log level of `can't find rules for scope` (markus)
* NUTCH-783 IndexingFiltersChecker utility (jnioche via markus)
* NUTCH-1030 WebgraphDB program requires manually added directories (markus)
* NUTCH-1011 Normalize duplicate slashes in URL's (markus)
* NUTCH-993 NullPointerException at FetcherOutputFormat.checkOutputSpecs (Christian Guegi via jnioche)
* NUTCH-1013 Migrate RegexURLNormalizer from Apache ORO to java.util.regex (markus)
* NUTCH-1016 Strip UTF-8 non-character codepoints and add logging for SolrWriter (markus)
* NUTCH-1012 Cannot handle illegal charset $charset (markus)
* NUTCH-1022 Upgrade version number of Nutch agent in conf (markus)
* NUTCH-295 Description for fetcher.threads.fetch property (kubes via markus)
* NUTCH-1000 Add option not to commit to Solr (markus)
* NUTCH-1006 MetaEquiv with single quotes not accepted (markus)
* NUTCH-1010 ContentLength not trimmed (markus)
Release 1.3 - 6/4/2011
* NUTCH-995 Generate POM file using the Ivy makepom task (mattmann, jnioche, Gabriele Kahlout)
* NUTCH-1003 task 'package' does not reflect the new organisation of the code (jnioche)
* NUTCH-994 Fine tune Solr schema (markus)
* NUTCH-997 IndexingFitlers to store Date objects instead of Strings (jnioche)
* NUTCH-996 Indexer adds solr.commit.size+1 docs (markus)
* NUTCH-983 Upgrade SolrJ to 3.1 (markus, jnioche)
* NUTCH-989 Index-basic plugin and Solr schema now use date fieldType for tstamp field (markus)
* NUTCH-888 Remove parse-rss and add tests for rss to parse-tika (jnioche)
* NUTCH-991 SolrDedup must issue a commit (markus)
* NUTCH 986 SolrDedup fails due to date incorrect format (markus)
* NUTCH-977 SolrMappingReader uses hardcoded configuration parameter name for mapping file (markus)
* NUTCH-976 Rename properties solrindex.* to solr.* (markus)
* NUTCH-890 Fix IllegalAccessError with slf4j used in Solrj (markus)
* NUTCH-891 Subcollection plugin won't require blacklist any more (markus)
* NUTCH-972 CrawlDbMerger doesn't break on non-existent input (Gabriele Kahlout via jnioche)
* NUTCH-967 Upgrade to Tika 0.9 (jnioche)
* NUTCH-975 Fix missing/wrong headers in source files (markus)
* NUTCH-963 Add support for deleting Solr documents with STATUS_DB_GONE in CrawlDB (Claudio Martella, markus)
* NUTCH-825 Publish nutch artifacts to central maven repository (mattmann, jnioche)
* NUTCH-962 max. redirects not handled correctly: fetcher stops at max-1 redirects (Sebastian Nagel via ab)
* NUTCH-921 Reduce dependency of Nutch on config files (ab)
* NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab)
* NUTCH-872 Change the default fetcher.parse to FALSE (ab)
* NUTCH-564 External parser supports encoding attribute (Antony Bowesman, mattmann)
* NUTCH-964 Upgraded Xerces to 2.91, ERROR conf.Configuration - Failed to set setXIncludeAware (markus)
* NUTCH-927 Fetcher.timelimit.mins is invalid when depth is greater than 1 (Wade Lau via jnioche)
* NUTCH-824 Crawling - File Error 404 when fetching file with an hexadecimal character in the file name (Michela Becchi via jnioche)
* NUTCH-954 Strict application of Content-Length limit for http protocols (Alexis Detreglode via jnioche)
* NUTCH-950 DomainURLFilter throws NPE on bogus urls (Alexis Detreglode via jnioche)
* NUTCH-935 basicurlnormalizer removes unnecessary /./ in URLs (Stondet via markus)
* NUTCH-912 MoreIndexingFilter does not parse docx and xlsx date formats (Markus Jelsma, jnioche)
* NUTCH-886 A .gitignore file for Nutch (dogacan)
* NUTCH-930 Remove remaining dependencies on Lucene API (ab)
* NUTCH-883 Remove unused parameters from nutch-default.xml (jnioche)
* NUTCH-936 LanguageIdentifier should not set empty lang field on NutchDocument (Markus Jelsma via jnioche)
* NUTCH-787 ScoringFilters should not override the injected score (jnioche)
* NUTCH-949 Conflicting ANT jars in classpath (jnioche)
* NUTCH-863 Benchmark and a testbed proxy server (ab)
* NUTCH-844 Improve NutchConfiguration (ab)
* NUTCH-845 Native hadoop libs not available through maven (ab)
* NUTCH-843 Separate the build and runtime environments (ab)
* NUTCH-821 Use ivy in nutch builds (Enis Soztutar, jnioche)
* NUTCH-837 Remove search servers and Lucene dependencies (ab)
* NUTCH-836 Remove deprecated parse plugins (jnioche)
* NUTCH-939 Added -dir command line option to SolrIndexer (Claudio Martella via ab)
* NUTCH-948 Remove Lucene dependencies (ab)
Release 1.2 - 09/18/2010
* NUTCH-901 Make index-more plug-in configurable (Markus Jelsma via mattmann)
* NUTCH-908 Infinite Loop and Null Pointer Bugs in Searching (kubes via mattmann)
* NUTCH-906 Nutch OpenSearch sometimes raises DOMExceptions (Asheesh Laroia via ab)
* NUTCH-862 HttpClient null pointer exception (Sebastian Nagel via ab)
* NUTCH-905 Configurable file protocol parent directory crawling (Thorsten Scherler, mattmann, ab)
* NUTCH-877 Allow setting of slop values for non-quote phrase queries on query-basic plugin (kubes via jnioche)
* NUTCH-716 Make subcollection index filed multivalued (Dmitry Lihachev via jnioche)
* NUTCH-878 ScoringFilters should not override the injected score
* NUTCH-870 Injector should add the metadata before calling injectedScore (jnioche via mattmann)
* NUTCH-858 No longer able to set per-field boosts on lucene documents (ab)
* NUTCH-869 Add parse-html back (jnioche)
* NUTCH-871 MoreIndexingFilter missing date format (Max Lynch via mattmann)
* NUTCH-696 Timeout for Parser (ab, jnioche)
* NUTCH-857 DistributedBeans should not close their RPC counterparts (kubes)
* NUTCH-855 ScoringFilter and IndexingFilter: To allow for the propagation of URL Metatags
and their subsequent indexing (Scott Gonyea via mattmann)
* NUTCH-677 Segment merge filering based on segment content (Marcin Okraszewski via mattmann)
* NUTCH-774 Retry interval in crawl date is set to 0 (Reinhard Schwab via mattmann)
* NUTCH-697 Generate log output for solr indexer and dedup (Dmitry Lihachev, Jeroen van Vianen via mattmann)
* NUTCH-850 SolrDeleteDuplicates needs to clone the SolrRecord objects (jnioche)
* NUTCH-838 Add timing information to all Tool classes (Jeroen van Vianen, mattmann)
* NUTCH-835 Document deduplication failed using MD5Signature (Sebastian Nagel via ab)
* NUTCH-831 Allow configuration of how fields crawled by Nutch are stored / indexed /
tokenized (Jeroen van Vianen via mattmann)
* NUTCH-278 Fetcher-status might need clarification: kbit/s instead of kb/s shown (Alex McLintock via mattmann)
* NUTCH-833 Website is still Lucene branded (mattmann, Alex McLintock)
* NUTCH-832 Website menu has lots of broken links - in particular the API docs (Alex McLintock via mattmann)
Release 1.1 - 2010-06-06
* NUTCH-819 Included Solr schema.xml and solrindex-mapping.xml don't play together (ab)
* NUTCH-818 Bugfix : Parse-tika uses minorCodes instead of majorCodes in ParseStatus (jnioche)
* NUTCH-816 Add zip target to build.xml (mattmann)
* NUTCH-732 Subcollection plugin not working (Filipe Antunes, ab)
* NUTCH-815 Invalid blank line before If-Modified-Since header (Pascal Dimassimo via ab)
* NUTCH-814 SegmentMerger bug (Rob Bradshaw, ab)
* NUTCH-812 Crawl.java incorrectly uses the Generator API resulting in NPE (Phil Barnett via mattmann and ab)
* NUTCH-810 Upgrade to Tika 0.7 (jnioche)
* NUTCH-785 Copy metadata from origin URL when redirecting in Fetcher + call scfilters.initialScore on newly created URL (jnioche)
* NUTCH-779 Mechanism for passing metadata from parse to crawldb (jnioche)
* NUTCH-784 CrawlDBScanner (jnioche)
* NUTCH-762 Generator can generate several segments in one parse of the crawlDB (jnioche)
* NUTCH-740 Configuration option to override default language for fetched pages (Marcin Okraszewski via jnioche)
* NUTCH-803 Upgrade to Hadoop 0.20.2 (ab)
* NUTCH-787 Upgrade Lucene to 3.0.1. (Dawid Weiss via ab)
* NUTCH-796 Zero results problems difficult to troubleshoot due to lack of logging (ab)
* NUTCH-801 Remove RTF and MP3 parse plugins (jnioche)
* NUTCH-798 Upgrade to SOLR1.4 and its dependencies (jnioche)
* NUTCH-799 SOLRIndexer to commit once all reducers have finished (jnioche)
* NUTCH-782 Ability to order htmlparsefilters (jnioche)
* NUTCH-719 fetchQueues.totalSize incorrect in Fetcher (Steven Denny via jnioche)
* NUTCH-790 Some external javadoc links are broken (siren)
* NUTCH-766 Tika parser (jnioche via mattmann)
* NUTCH-786 Improvement to the list of suffix domains (jnioche)
* NUTCH-775 Enhance searcher interface (siren)
* NUTCH-781 Update Tika to v0.6 (jnioche)
* NUTCH-269 CrawlDbReducer: OOME because no upper-bound on inlinks count (stack + jnioche)
* NUTCH-655 Injecting Crawl metadata (jnioche)
* NUTCH-658 Use counters to report fetching and parsing status (jnioche)
* NUTCH-777 Upgrading to jetty6 broke unit tests (mattmann)
* NUTCH-767 Update Tika to v0.5 for the MimeType detection (Julien Nioche via ab)
* NUTCH-769 Fetcher to skip queues for URLS getting repeated exceptions
(Julien Nioche via ab)
* NUTCH-768 - Upgrade Nutch 1.0 to use Hadoop 0.20.1, also upgrades Xerces to
version 2.9.1. (kubes)
* NUTCH-712 ParseOutputFormat should catch java.net.MalformedURLException
coming from normalizers (Julien Nioche via ab)
* NUTCH-741 Job file includes multiple copies of nutch config files
(Kirby Bohling via ab)
* NUTCH-739 SolrDeleteDuplications too slow when using hadoop (Dmitry Lihachev via ab)
* NUTCH-738 Close SegmentUpdater when FetchedSegments is closed
(Martina Koch, Kirby Bohling via ab)
* NUTCH-746 NutchBeanConstructor does not close NutchBean upon contextDestroyed,
causing resource leak in the container. (Kirby Bohling via ab)
* NUTCH-772 Upgrade Nutch to use Lucene 2.9.1 (ab)
* NUTCH-760 Allow field mapping from Nutch to Solr index (David Stuart, ab)
* NUTCH-761 Avoid cloning CrawlDatum in CrawlDbReducer (Julien Nioche, ab)
* NUTCH-753 Prevent new Fetcher from retrieving the robots twice (Julien Nioche via ab)
* NUTCH-773 - Some minor bugs in AbstractFetchSchedule (Reinhard Schwab via ab)
* NUTCH-765 - Allow Crawl class to call Either Solr or Lucene Indexer (kubes)
* NUTCH-735 - crawl-tool.xml must be read before nutch-site.xml when
invoked using crawl command (Susam Pal via dogacan)
* NUTCH-721 - Fetcher2 Slow (Julien Nioche via dogacan)
* NUTCH-702 - Lazy Instanciation of Metadata in CrawlDatum (Julien Nioche via dogacan)
* NUTCH-707 - Generation of multiple segments in multiple runs returns only 1 segment
(Michael Chen, ab)
* NUTCH-730 - NPE in LinkRank if no nodes with which to create the WebGraph
(Dennis Kubes via ab)
* NUTCH-731 - Redirection of robots.txt in RobotRulesParser (Julien Nioche via ab)
* NUTCH-757 - RequestUtils getBooleanParameter() always returns false
(Niall Pemberton via ab)
* NUTCH-754 - Use GenericOptionsParser instead of FileSystem.parseArgs() (Julien
Nioche via ab)
* NUTCH-756 - CrawlDatum.set() does not reset Metadata if it is null (Julien Nioche
via ab)
* NUTCH-679 - Fetcher2 implementing Tool (Julien Nioche via ab)
* NUTCH-758 - Set subversion eol-style to "native" (Niall Pemberton via ab)
Release 1.0 - 2009-03-23
1. NUTCH-474 - Fetcher2 crawlDelay and blocking fix (Dogacan Guney via ab)
2. NUTCH-443 - Allow parsers to return multiple Parse objects.
(Dogacan Guney et al, via ab)
3. NUTCH-393 - Indexer should handle null documents returned by filters.
(Eelco Lempsink via ab)
4. NUTCH-456 - Parse msexcel plugin speedup (Heiko Dietze via siren)
5. NUTCH-446 - RobotRulesParser should ignore Crawl-delay values of other
bots in robots.txt (Dogacan Guney via siren)
6. NUTCH-482 - Remove redundant plugin lib-log4j (siren)
7. NUTCH-483 - Remove redundant commons-logging jar from ontology plugin
(siren)
8. NUTCH-161 - Change Plain text parser to
use parser.character.encoding.default property for fall back encoding
(KuroSaka TeruHiko, siren)
9. NUTCH-61 - Support for adaptive re-fetch interval and detection of
unmodified content. (ab)
10. NUTCH-392 - OutputFormat implementations should pass on Progressable.
(cutting via ab)
11. NUTCH-495 - Unnecessary delays in Fetcher2 (dogacan)
12. NUTCH-443 - allow parsers to return multiple Parse object, this will speed
up the rss parser (dogacan via mattmann). This update is a fix and semantics
change from the original patch for NUTCH-443. The original patch did not tell
the Indexer to read crawl_parse too so that it can pickup sub-urls' fetch
datums. This patch addresses that issue. Now, if Fetcher gets a null content,
instead of pushing an empty content, it filters the null content.
13. NUTCH-485 - Change HtmlParseFilter 's to return ParseResult object instead of
Parse object. (Gal Nitzan via dogacan)
14. NUTCH-489 - URLFilter-suffix management of the url path when the url contains
some query parameters. (Emmanuel Joke via dogacan)
15. NUTCH-502 - Bug in SegmentReader causes infinite loop.
(Ilya Vishnevsky via dogacan)
16. NUTCH-444 Possibly use a different library to parse RSS feed for improved
performance and compatibility. This patch introduced a new plugin, feed,
that includes an index filter and a parse plugin for feeds that uses ROME.
There was discussion to remove parse-rss, in light of the feed plugin,