- I've fixed the CRAN warning: format specifies type 'int' but the argument has type 'long long' in the following files & lines by replacing the
%3d
expression with%3lld
:- ./token_big_files.h:862:60
- ./term_matrix.h:456:75 and 647:75
- word_vecs_pointer_embedding.cpp:333:67 and 240:68
- I removed the "CXX_STD = CXX11" from the "Makevars" files, and the "[[Rcpp::plugins(cpp11)]]" from the ".cpp" files due to the following NOTE from CRAN, "NOTE Specified C++11: please drop specification unless essential" (see also: https://www.tidyverse.org/blog/2023/03/cran-checks-compiled-code/#note-regarding-systemrequirements-c11)
- I exported the batch_calculation() Rcpp function and created the batch_compute() R function
- I removed the
-mthreads
compilation option from the "Makevars.win" file
- I've included a function to omit a test for the Solaris OS during checking because I can not reproduce the error with the rhub solaris patched image
- I've included a function to omit a test for the Solaris OS during checking
- I modified the functionality_of_textTinyR_package.Rmd vignette in lines 732-746 based on a notification of the knitr package maintainer. I mistakenly had 4 chunk delimiters rather than 3.
- I modified the inner_cm() function to return a correlation of 0.0 in case that the output is NA or +/- Inf
- I've added the CITATION file in the inst directory
- Exception which applies to tokenize_transform_text() and tokenize_transform_vec_docs() functions on all Operating Systems (Linux, Macintosh, Windows) in case of parallelization ( OpenMP ) when I additionally write data to a folder or file ( 'path_2folder' or 'vocabulary_path_file' ). Both Rcpp functions of the 'tokenize_transform_text()' and 'tokenize_transform_vec_docs()' do have an OpenMP-critical-clause which ensures that data appended to a variable are protected ( only one thread at a time will enter the section ). See the code lines 258 and 312 of the 'export_all_funcs.cpp' file. However, this must not apply (parallelization) when the 'path_2folder' or the 'vocabulary_path_file' are not equal to "" (empty string). Due to the fact that writing to the file takes place internally I can not enclose the 'save' functions to an OpenMP-crtical-clause. Therefore, whenever I save to an output file set the number of threads to 1 and print out a warning so that the user knows that parallelization is disabled [ see issue : '#8' ]
- Stop the execution of the tokenize_transform_text() and tokenize_transform_vec_docs() functions whenever the user specifies the path_2folder parameter (valid path to a folder) and the 'output_token_single_file.txt' file already exists ( otherwise new data will be appended at the end of the file ) [ see issue : '#8' ]
- I added a note in the vignettes about the new version of the fastText R package ( the old version is archived )
- I commented out 14 tests in the test-utf_locale.R file for the Debian distribution (Linux) due to an error specific to 'Latin-1 locale'. See also my comments in the helper-function_for_tests.R file.
- I modified the porter2_stemmer.cpp and especially the Porter2Stemmer::stem() function as it was initially incorrectly modified
- I attempted to fix the clang-UBSAN error, however it's not reproducible with the latest install of clang==6.0, llvm==6.0 and CRAN configuration of ASAN, UBSAN. I had to comment this particular test case ( test-tokenization_transformation.R, lines 938-963 ). The clang-UBSAN error was the following :
test-tokenization_transformation.R : test id 329
/usr/local/bin/../include/c++/v1/string:2992:30: runtime error: addition of unsigned offset to 0x62500f06b1b9 overflowed to 0x62500f06b1b8 SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/local/bin/../include/c++/v1/string:2992:30 in
- I removed the -lboost_system flag from the Makevars file
- I modified the text_file_parser function (the documentation and examples too) due to an error
- boost-locale is no longer a system requirement for the textTinyR package
- The text_file_parser function can now accept besides a valid path to a file also a vector of character strings. Moreover, besides writing to a file it can also return a vector of character strings. Furthermore, the start_query and end_query parameters can take more than one query-terms as input.
- I added utility functions for Word Vector Representations (i.e. GloVe, fasttext), frequently referred to as doc2vec, and functions for the (pairwise) calculation of text document dissimilarities.
I added the global_term_weights() method in the sparse_term_matrix R6 class
I removed the threads parameter from the term_associations method of the sparse_term_matrix R6-class. I modified the OpenMP clauses of the .cpp files to address the ASAN errors.
I added the triplet_data() method in the sparse_term_matrix R6 class
I removed the ngram_sequential and ngram_overlap stemmers from the vocabulary_parser function. I fixed a bug in the char_n_grams of the token_stats.h source file.
I removed the ngram_sequential and ngram_overlap stemmers from the sparse_term_matrix and tokenize_transform_vec_docs functions. I overlooked the fact that the n-gram stemming is based on the whole corpus and not on each vector of the document(s), which is the case for the sparse_term_matrix and tokenize_transform_vec_docs functions. I added a zzz.R file with a packageStartupMessage to inform the users about the previous change in n-gram stemming. I also updated the package documentation and Vignette. I modified the secondary_n_grams of the tokenization.h source file due to a bug. I've used the enc2utf8 function) to encode (utf-8) the terms of the sparse matrix.
I modified the res_token_vector(), res_token_list() [ export_all_funcs.cpp file ] and append_2file() [ tokenization.h file ] functions, because the tokenize_transform_vec_docs() function returned an incorrect output in case that the path_2folder parameter was not the empty string.
I corrected the UBSAN-memory errors, which occured in the adj_Sparsity() function of the term_matrix.h header file (the errors happen, when passing empty vectors to the armadillo batch_insertion() function)
I included detailed installation instructions for the Macintosh OSx I modified the source code to correct the boost-locale errors, which occurred during testing on Macintosh OSx
I added the following system-flag in the Makevars.in file to avoid linking errors for the Mac OS: -lboost_system I modified the term_associations and Term_Matrix_Adjust methods to avoid indexing errors I corrected mistakes in the Vignette