diff --git a/README.md b/README.md index 7efe182..1a21d78 100644 --- a/README.md +++ b/README.md @@ -40,7 +40,7 @@ Or install it yourself as: * [Row and Column Separators](docs/row_col_sep.md) * [Header Transformations](docs/header_transformations.md) * [Header Validations](docs/header_validations.md) - * Data Transformations + * [Data Transformations](docs/data_transformations.md) * [Value Converters](docs/value_converters.md) * [Notes](docs/notes.md) <--- this info needs to be moved to individual pages diff --git a/docs/basic_api.md b/docs/basic_api.md index bac83af..d2aa7da 100644 --- a/docs/basic_api.md +++ b/docs/basic_api.md @@ -120,3 +120,21 @@ $ hexdump -C spec/fixtures/bom_test_feff.csv * the escape character is `\`, as on UNIX and Windows systems. * quote charcters around fields are balanced, e.g. valid: `"field"`, invalid: `"field\"` e.g. an escaped `quote_char` does not denote the end of a field. + + +## NOTES about File Encodings: + * if you have a CSV file which contains unicode characters, you can process it as follows: + +```ruby + File.open(filename, "r:bom|utf-8") do |f| + data = SmarterCSV.process(f); + end +``` +* if the CSV file with unicode characters is in a remote location, similarly you need to give the encoding as an option to the `open` call: +```ruby + require 'open-uri' + file_location = 'http://your.remote.org/sample.csv' + open(file_location, 'r:utf-8') do |f| # don't forget to specify the UTF-8 encoding!! + data = SmarterCSV.process(f) + end +``` diff --git a/docs/data_transformations.md b/docs/data_transformations.md new file mode 100644 index 0000000..33279a7 --- /dev/null +++ b/docs/data_transformations.md @@ -0,0 +1,32 @@ +# Data Transformations + +SmarterCSV automatically transforms the values in each colum in order to normalize the data. +This behavior can be customized or disabled. + +## Remove Empty Values +`remove_empty_values` is enabled by default +It removes any values which are `nil` or would be empty strings. + +## Convert Values to Numeric +`convert_values_to_numeric` is enabled by default. +SmarterCSV will convert strings containing Integers or Floats to the appropriate class. + +## Remove Zero Values +`remove_zero_values` is disabled by default. +When enabled, it removes key/value pairs which have a numeric value equal to zero. + +## Remove Values Matching +`remove_values_matching` is disabled by default. +When enabled, this can help removing key/value pairs from result hashes which would cause problems. + +e.g. + * `remove_values_matching: /^\$0\.0+$/` would remove $0.00 + * `remove_values_matching: /^#VALUE!$/` would remove errors from Excel spreadsheets + +## Empty Hashes + +It can happen that after all transformations, a row of the CSV file would produce a completely empty hash. + +By default SmarterCSV uses `remove_empty_hashes: true` to remove these empty hashes from the result. + +This can be set to `true`, to keep these empty hashes in the results. diff --git a/docs/header_transformations.md b/docs/header_transformations.md index 8feef42..0bcdb46 100644 --- a/docs/header_transformations.md +++ b/docs/header_transformations.md @@ -4,7 +4,7 @@ By default SmarterCSV assumes that a CSV file has headers, and it automatically ## Header Normalization -When processing the headers, it transforms them into Ruby symbols, stripping extra spaces, lower-casing them and replacing spaces with underscores. e.g. " \t Annual Sales " becomes `:annual_sales`. +When processing the headers, it transforms them into Ruby symbols, stripping extra spaces, lower-casing them and replacing spaces with underscores. e.g. " \t Annual Sales " becomes `:annual_sales`. (see Notes below) ## Duplicate Headers @@ -12,27 +12,29 @@ There can be a lot of variation in CSV files. It is possible that a CSV file con By default SmarterCSV handles duplicate headers by appending numbers 2..n to them. +Consider this example: + ``` $ cat > /tmp/dupe.csv name,name,name Carl,Edward,Sagan ``` -when parsing these duplicate headers, it will return: +When parsing these duplicate headers, SmarterCSV will return: ``` data = SmarterCSV.process('/tmp/dupe.csv') => [{:name=>"Carl", :name2=>"Edward", :name3=>"Sagan"}] ``` -If you want to have an underscore between the header and the number, you can set `duplicate_header_suffix: ' '`. +If you want to have an underscore between the header and the number, you can set `duplicate_header_suffix: '_'`. ``` data = SmarterCSV.process('/tmp/dupe.csv', {duplicate_header_suffix: '_'}) => [{:name=>"Carl", :name_2=>"Edward", :name_3=>"Sagan"}] ``` - To further disambiguate the headers, you can further use `key_mapping` to assign meaningful names, e.g. + To further disambiguate the headers, you can further use `key_mapping` to assign meaningful names. Please note that the mapping uses the already transformed keys `name_2`, `name_3` as input. ``` options = { @@ -47,7 +49,19 @@ If you want to have an underscore between the header and the number, you can set => [{:first_name=>"Carl", :middle_name=>"Edward", :last_name=>"Sagan"}] ``` +## Key Mapping + +The above example already illustrates how intermediate keys can be mapped into something different. +This transfoms some of the keys in the input, but other keys are still present. + +There is an additional option `remove_unmapped_keys` which can be enabled to only produce the mapped keys in the resulting hashes, and drops any other columns. + +### NOTES on Key Mapping: + * keys in the header line of the file can be re-mapped to a chosen set of symbols, so the resulting Hashes can be better used internally in your application (e.g. when directly creating MongoDB entries with them) + * if you want to completely delete a key, then map it to nil or to '', they will be automatically deleted from any result Hash + * if you have input files with a large number of columns, and you want to ignore all columns which are not specifically mapped with :key_mapping, then use option :remove_unmapped_keys => true + ## CSV Files without Headers If you have CSV files without headers, it is important to set `headers_in_file: false`, otherwise you'll lose the first data line in your file. @@ -64,3 +78,18 @@ For CSV files with headers, you can either: * completely replace the headers using `user_provided_headers` (please be careful with this powerful option, as it is not robust against changes in input format). * use the original unmodified headers from the CSV file, using `keep_original_headers`. This results in hash keys that are strings, and may be padded with spaces. + +# Notes + +### NOTES about CSV Headers: + * as this method parses CSV files, it is assumed that the first line of any file will contain a valid header + * the first line with the header might be commented out, in which case you will need to set `comment_regexp: /\A#/` + * any occurences of :comment_regexp or :row_sep will be stripped from the first line with the CSV header + * any of the keys in the header line will be downcased, spaces replaced by underscore, and converted to Ruby symbols before being used as keys in the returned Hashes + * you can not combine the :user_provided_headers and :key_mapping options + * if the incorrect number of headers are provided via :user_provided_headers, exception SmarterCSV::HeaderSizeMismatch is raised + +### NOTES on improper quotation and unwanted characters in headers: + * some CSV files use un-escaped quotation characters inside fields. This can cause the import to break. To get around this, use the `:force_simple_split => true` option in combination with `:strip_chars_from_headers => /[\-"]/` . This will also significantly speed up the import. + If you would force a different :quote_char instead (setting it to a non-used character), then the import would be up to 5-times slower than using `:force_simple_split`. + diff --git a/docs/notes.md b/docs/notes.md index ee1ea92..798624d 100644 --- a/docs/notes.md +++ b/docs/notes.md @@ -1,31 +1,15 @@ # Notes -## NOTES about CSV Headers: - * as this method parses CSV files, it is assumed that the first line of any file will contain a valid header - * the first line with the header might be commented out, in which case you will need to set `comment_regexp: /\A#/` - This is no longer handled automatically since 1.5.0. - * any occurences of :comment_regexp or :row_sep will be stripped from the first line with the CSV header - * any of the keys in the header line will be downcased, spaces replaced by underscore, and converted to Ruby symbols before being used as keys in the returned Hashes - * you can not combine the :user_provided_headers and :key_mapping options - * if the incorrect number of headers are provided via :user_provided_headers, exception SmarterCSV::HeaderSizeMismatch is raised -## NOTES on Key Mapping: - * keys in the header line of the file can be re-mapped to a chosen set of symbols, so the resulting Hashes can be better used internally in your application (e.g. when directly creating MongoDB entries with them) - * if you want to completely delete a key, then map it to nil or to '', they will be automatically deleted from any result Hash - * if you have input files with a large number of columns, and you want to ignore all columns which are not specifically mapped with :key_mapping, then use option :remove_unmapped_keys => true ## NOTES on the use of Chunking and Blocks: * chunking can be VERY USEFUL if used in combination with passing a block to File.read_csv FOR LARGE FILES * if you pass a block to File.read_csv, that block will be executed and given an Array of Hashes as the parameter. * if the chunk_size is not set, then the array will only contain one Hash. * if the chunk_size is > 0 , then the array may contain up to chunk_size Hashes. - * this can be very useful when passing chunked data to a post-processing step, e.g. through Resque - -## NOTES on improper quotation and unwanted characters in headers: - * some CSV files use un-escaped quotation characters inside fields. This can cause the import to break. To get around this, use the `:force_simple_split => true` option in combination with `:strip_chars_from_headers => /[\-"]/` . This will also significantly speed up the import. - If you would force a different :quote_char instead (setting it to a non-used character), then the import would be up to 5-times slower than using `:force_simple_split`. + * this can be very useful when passing chunked data to a post-processing step, e.g. through Sidekiq ## NOTES about File Encodings: * if you have a CSV file which contains unicode characters, you can process it as follows: