update

tilo · Jul 8, 2024 · 62983fe · 62983fe
1 parent 766e440
commit 62983fe
Show file tree

Hide file tree

Showing 5 changed files with 85 additions and 22 deletions.
diff --git a/README.md b/README.md
@@ -40,7 +40,7 @@ Or install it yourself as:
   * [Row and Column Separators](docs/row_col_sep.md)
   * [Header Transformations](docs/header_transformations.md)
   * [Header Validations](docs/header_validations.md)
-  * Data Transformations
+  * [Data Transformations](docs/data_transformations.md)
   * [Value Converters](docs/value_converters.md)
 
   * [Notes](docs/notes.md)  <--- this info needs to be moved to individual pages

diff --git a/docs/basic_api.md b/docs/basic_api.md
@@ -120,3 +120,21 @@ $ hexdump -C spec/fixtures/bom_test_feff.csv
 * the escape character is `\`, as on UNIX and Windows systems.
 * quote charcters around fields are balanced, e.g. valid: `"field"`, invalid: `"field\"`
   e.g. an escaped `quote_char` does not denote the end of a field.
+
+
+## NOTES about File Encodings:
+ * if you have a CSV file which contains unicode characters, you can process it as follows:
+
+```ruby
+       File.open(filename, "r:bom|utf-8") do |f|
+         data = SmarterCSV.process(f);
+       end
+```
+* if the CSV file with unicode characters is in a remote location, similarly you need to give the encoding as an option to the `open` call:
+```ruby
+       require 'open-uri'
+       file_location = 'http://your.remote.org/sample.csv'
+       open(file_location, 'r:utf-8') do |f|   # don't forget to specify the UTF-8 encoding!!
+         data = SmarterCSV.process(f)
+       end
+```
diff --git a/docs/data_transformations.md b/docs/data_transformations.md
@@ -0,0 +1,32 @@
+# Data Transformations
+
+SmarterCSV automatically transforms the values in each colum in order to normalize the data.
+This behavior can be customized or disabled.
+
+## Remove Empty Values
+`remove_empty_values` is enabled by default
+It removes any values which are `nil` or would be empty strings.
+
+## Convert Values to Numeric
+`convert_values_to_numeric` is enabled by default. 
+SmarterCSV will convert strings containing Integers or Floats to the appropriate class.
+
+## Remove Zero Values
+`remove_zero_values` is disabled by default.
+When enabled, it removes key/value pairs which have a numeric value equal to zero.
+
+## Remove Values Matching
+`remove_values_matching` is disabled by default. 
+When enabled, this can help removing key/value pairs from result hashes which would cause problems. 
+
+e.g.
+ * `remove_values_matching: /^\$0\.0+$/` would remove $0.00 
+ * `remove_values_matching: /^#VALUE!$/` would remove errors from Excel spreadsheets 
+
+## Empty Hashes
+
+It can happen that after all transformations, a row of the CSV file would produce a completely empty hash.
+
+By default SmarterCSV uses `remove_empty_hashes: true` to remove these empty hashes from the result.
+
+This can be set to `true`, to keep these empty hashes in the results.
diff --git a/docs/header_transformations.md b/docs/header_transformations.md
@@ -4,35 +4,37 @@ By default SmarterCSV assumes that a CSV file has headers, and it automatically
 
 ## Header Normalization
 
-When processing the headers, it transforms them into Ruby symbols, stripping extra spaces, lower-casing them and replacing spaces with underscores. e.g. " \t Annual Sales  " becomes `:annual_sales`. 
+When processing the headers, it transforms them into Ruby symbols, stripping extra spaces, lower-casing them and replacing spaces with underscores. e.g. " \t Annual Sales  " becomes `:annual_sales`. (see Notes below)
 
 ## Duplicate Headers
 
 There can be a lot of variation in CSV files. It is possible that a CSV file contains multiple headers with the same name. 
 
 By default SmarterCSV handles duplicate headers by appending numbers 2..n to them.
 
+Consider this example:
+
 ```
 $ cat > /tmp/dupe.csv
 name,name,name
 Carl,Edward,Sagan
 ```
 
-when parsing these duplicate headers, it will return:
+When parsing these duplicate headers, SmarterCSV will return:
 
 ```
   data = SmarterCSV.process('/tmp/dupe.csv')
    => [{:name=>"Carl", :name2=>"Edward", :name3=>"Sagan"}]
 ```
 
-If you want to have an underscore between the header and the number, you can set `duplicate_header_suffix: ' '`.
+If you want to have an underscore between the header and the number, you can set `duplicate_header_suffix: '_'`.
 
 ```
   data = SmarterCSV.process('/tmp/dupe.csv', {duplicate_header_suffix: '_'})
    => [{:name=>"Carl", :name_2=>"Edward", :name_3=>"Sagan"}]
 ```
 
- To further disambiguate the headers, you can further use `key_mapping` to assign meaningful names, e.g. 
+ To further disambiguate the headers, you can further use `key_mapping` to assign meaningful names. Please note that the mapping uses the already transformed keys `name_2`, `name_3` as input.
 
 ```
   options = {
@@ -47,7 +49,19 @@ If you want to have an underscore between the header and the number, you can set
    => [{:first_name=>"Carl", :middle_name=>"Edward", :last_name=>"Sagan"}]
 ```
 
+## Key Mapping
+
+The above example already illustrates how intermediate keys can be mapped into something different.
+This transfoms some of the keys in the input, but other keys are still present.
+
+There is an additional option `remove_unmapped_keys` which can be enabled to only produce the mapped keys in the resulting hashes, and drops any other columns.
+
 
+### NOTES on Key Mapping:
+ * keys in the header line of the file can be re-mapped to a chosen set of symbols, so the resulting Hashes can be better used internally in your application (e.g. when directly creating MongoDB entries with them)
+ * if you want to completely delete a key, then map it to nil or to '', they will be automatically deleted from any result Hash
+ * if you have input files with a large number of columns, and you want to ignore all columns which are not specifically mapped with :key_mapping, then use option :remove_unmapped_keys => true
+
 ## CSV Files without Headers
 
 If you have CSV files without headers, it is important to set `headers_in_file: false`, otherwise you'll lose the first data line in your file.
@@ -64,3 +78,18 @@ For CSV files with headers, you can either:
 * completely replace the headers using `user_provided_headers` (please be careful with this powerful option, as it is not robust against changes in input format).
 * use the original unmodified headers from the CSV file, using `keep_original_headers`. This results in hash keys that are strings, and may be padded with spaces.
 
+
+# Notes
+
+### NOTES about CSV Headers:
+ * as this method parses CSV files, it is assumed that the first line of any file will contain a valid header
+ * the first line with the header might be commented out, in which case you will need to set `comment_regexp: /\A#/`
+ * any occurences of :comment_regexp or :row_sep will be stripped from the first line with the CSV header
+ * any of the keys in the header line will be downcased, spaces replaced by underscore, and converted to Ruby symbols before being used as keys in the returned Hashes
+ * you can not combine the :user_provided_headers and :key_mapping options
+ * if the incorrect number of headers are provided via :user_provided_headers, exception SmarterCSV::HeaderSizeMismatch is raised
+
+### NOTES on improper quotation and unwanted characters in headers:
+ * some CSV files use un-escaped quotation characters inside fields. This can cause the import to break. To get around this, use the `:force_simple_split => true` option in combination with `:strip_chars_from_headers => /[\-"]/` . This will also significantly speed up the import.
+   If you would force a different :quote_char instead (setting it to a non-used character), then the import would be up to 5-times slower than using `:force_simple_split`.
+
diff --git a/docs/notes.md b/docs/notes.md
@@ -1,31 +1,15 @@
 
 # Notes
 
-## NOTES about CSV Headers:
- * as this method parses CSV files, it is assumed that the first line of any file will contain a valid header
- * the first line with the header might be commented out, in which case you will need to set `comment_regexp: /\A#/`
-   This is no longer handled automatically since 1.5.0.
- * any occurences of :comment_regexp or :row_sep will be stripped from the first line with the CSV header
- * any of the keys in the header line will be downcased, spaces replaced by underscore, and converted to Ruby symbols before being used as keys in the returned Hashes
- * you can not combine the :user_provided_headers and :key_mapping options
- * if the incorrect number of headers are provided via :user_provided_headers, exception SmarterCSV::HeaderSizeMismatch is raised
 
 
-## NOTES on Key Mapping:
- * keys in the header line of the file can be re-mapped to a chosen set of symbols, so the resulting Hashes can be better used internally in your application (e.g. when directly creating MongoDB entries with them)
- * if you want to completely delete a key, then map it to nil or to '', they will be automatically deleted from any result Hash
- * if you have input files with a large number of columns, and you want to ignore all columns which are not specifically mapped with :key_mapping, then use option :remove_unmapped_keys => true
 
 ## NOTES on the use of Chunking and Blocks:
  * chunking can be VERY USEFUL if used in combination with passing a block to File.read_csv FOR LARGE FILES
  * if you pass a block to File.read_csv, that block will be executed and given an Array of Hashes as the parameter.
  * if the chunk_size is not set, then the array will only contain one Hash.
  * if the chunk_size is > 0 , then the array may contain up to chunk_size Hashes.
- * this can be very useful when passing chunked data to a post-processing step, e.g. through Resque
-
-## NOTES on improper quotation and unwanted characters in headers:
- * some CSV files use un-escaped quotation characters inside fields. This can cause the import to break. To get around this, use the `:force_simple_split => true` option in combination with `:strip_chars_from_headers => /[\-"]/` . This will also significantly speed up the import.
-   If you would force a different :quote_char instead (setting it to a non-used character), then the import would be up to 5-times slower than using `:force_simple_split`.
+ * this can be very useful when passing chunked data to a post-processing step, e.g. through Sidekiq
 
 ## NOTES about File Encodings:
  * if you have a CSV file which contains unicode characters, you can process it as follows: