Skip to content

Commit

Permalink
update
Browse files Browse the repository at this point in the history
  • Loading branch information
tilo committed Jul 8, 2024
1 parent 766e440 commit 62983fe
Show file tree
Hide file tree
Showing 5 changed files with 85 additions and 22 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ Or install it yourself as:
* [Row and Column Separators](docs/row_col_sep.md)
* [Header Transformations](docs/header_transformations.md)
* [Header Validations](docs/header_validations.md)
* Data Transformations
* [Data Transformations](docs/data_transformations.md)
* [Value Converters](docs/value_converters.md)

* [Notes](docs/notes.md) <--- this info needs to be moved to individual pages
Expand Down
18 changes: 18 additions & 0 deletions docs/basic_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,3 +120,21 @@ $ hexdump -C spec/fixtures/bom_test_feff.csv
* the escape character is `\`, as on UNIX and Windows systems.
* quote charcters around fields are balanced, e.g. valid: `"field"`, invalid: `"field\"`
e.g. an escaped `quote_char` does not denote the end of a field.


## NOTES about File Encodings:
* if you have a CSV file which contains unicode characters, you can process it as follows:

```ruby
File.open(filename, "r:bom|utf-8") do |f|
data = SmarterCSV.process(f);
end
```
* if the CSV file with unicode characters is in a remote location, similarly you need to give the encoding as an option to the `open` call:
```ruby
require 'open-uri'
file_location = 'http://your.remote.org/sample.csv'
open(file_location, 'r:utf-8') do |f| # don't forget to specify the UTF-8 encoding!!
data = SmarterCSV.process(f)
end
```
32 changes: 32 additions & 0 deletions docs/data_transformations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Data Transformations

SmarterCSV automatically transforms the values in each colum in order to normalize the data.
This behavior can be customized or disabled.

## Remove Empty Values
`remove_empty_values` is enabled by default
It removes any values which are `nil` or would be empty strings.

## Convert Values to Numeric
`convert_values_to_numeric` is enabled by default.
SmarterCSV will convert strings containing Integers or Floats to the appropriate class.

## Remove Zero Values
`remove_zero_values` is disabled by default.
When enabled, it removes key/value pairs which have a numeric value equal to zero.

## Remove Values Matching
`remove_values_matching` is disabled by default.
When enabled, this can help removing key/value pairs from result hashes which would cause problems.

e.g.
* `remove_values_matching: /^\$0\.0+$/` would remove $0.00
* `remove_values_matching: /^#VALUE!$/` would remove errors from Excel spreadsheets

## Empty Hashes

It can happen that after all transformations, a row of the CSV file would produce a completely empty hash.

By default SmarterCSV uses `remove_empty_hashes: true` to remove these empty hashes from the result.

This can be set to `true`, to keep these empty hashes in the results.
37 changes: 33 additions & 4 deletions docs/header_transformations.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,35 +4,37 @@ By default SmarterCSV assumes that a CSV file has headers, and it automatically

## Header Normalization

When processing the headers, it transforms them into Ruby symbols, stripping extra spaces, lower-casing them and replacing spaces with underscores. e.g. " \t Annual Sales " becomes `:annual_sales`.
When processing the headers, it transforms them into Ruby symbols, stripping extra spaces, lower-casing them and replacing spaces with underscores. e.g. " \t Annual Sales " becomes `:annual_sales`. (see Notes below)

## Duplicate Headers

There can be a lot of variation in CSV files. It is possible that a CSV file contains multiple headers with the same name.

By default SmarterCSV handles duplicate headers by appending numbers 2..n to them.

Consider this example:

```
$ cat > /tmp/dupe.csv
name,name,name
Carl,Edward,Sagan
```

when parsing these duplicate headers, it will return:
When parsing these duplicate headers, SmarterCSV will return:

```
data = SmarterCSV.process('/tmp/dupe.csv')
=> [{:name=>"Carl", :name2=>"Edward", :name3=>"Sagan"}]
```

If you want to have an underscore between the header and the number, you can set `duplicate_header_suffix: ' '`.
If you want to have an underscore between the header and the number, you can set `duplicate_header_suffix: '_'`.

```
data = SmarterCSV.process('/tmp/dupe.csv', {duplicate_header_suffix: '_'})
=> [{:name=>"Carl", :name_2=>"Edward", :name_3=>"Sagan"}]
```

To further disambiguate the headers, you can further use `key_mapping` to assign meaningful names, e.g.
To further disambiguate the headers, you can further use `key_mapping` to assign meaningful names. Please note that the mapping uses the already transformed keys `name_2`, `name_3` as input.

```
options = {
Expand All @@ -47,7 +49,19 @@ If you want to have an underscore between the header and the number, you can set
=> [{:first_name=>"Carl", :middle_name=>"Edward", :last_name=>"Sagan"}]
```

## Key Mapping

The above example already illustrates how intermediate keys can be mapped into something different.
This transfoms some of the keys in the input, but other keys are still present.

There is an additional option `remove_unmapped_keys` which can be enabled to only produce the mapped keys in the resulting hashes, and drops any other columns.


### NOTES on Key Mapping:
* keys in the header line of the file can be re-mapped to a chosen set of symbols, so the resulting Hashes can be better used internally in your application (e.g. when directly creating MongoDB entries with them)
* if you want to completely delete a key, then map it to nil or to '', they will be automatically deleted from any result Hash
* if you have input files with a large number of columns, and you want to ignore all columns which are not specifically mapped with :key_mapping, then use option :remove_unmapped_keys => true

## CSV Files without Headers

If you have CSV files without headers, it is important to set `headers_in_file: false`, otherwise you'll lose the first data line in your file.
Expand All @@ -64,3 +78,18 @@ For CSV files with headers, you can either:
* completely replace the headers using `user_provided_headers` (please be careful with this powerful option, as it is not robust against changes in input format).
* use the original unmodified headers from the CSV file, using `keep_original_headers`. This results in hash keys that are strings, and may be padded with spaces.


# Notes

### NOTES about CSV Headers:
* as this method parses CSV files, it is assumed that the first line of any file will contain a valid header
* the first line with the header might be commented out, in which case you will need to set `comment_regexp: /\A#/`
* any occurences of :comment_regexp or :row_sep will be stripped from the first line with the CSV header
* any of the keys in the header line will be downcased, spaces replaced by underscore, and converted to Ruby symbols before being used as keys in the returned Hashes
* you can not combine the :user_provided_headers and :key_mapping options
* if the incorrect number of headers are provided via :user_provided_headers, exception SmarterCSV::HeaderSizeMismatch is raised

### NOTES on improper quotation and unwanted characters in headers:
* some CSV files use un-escaped quotation characters inside fields. This can cause the import to break. To get around this, use the `:force_simple_split => true` option in combination with `:strip_chars_from_headers => /[\-"]/` . This will also significantly speed up the import.
If you would force a different :quote_char instead (setting it to a non-used character), then the import would be up to 5-times slower than using `:force_simple_split`.

18 changes: 1 addition & 17 deletions docs/notes.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,15 @@

# Notes

## NOTES about CSV Headers:
* as this method parses CSV files, it is assumed that the first line of any file will contain a valid header
* the first line with the header might be commented out, in which case you will need to set `comment_regexp: /\A#/`
This is no longer handled automatically since 1.5.0.
* any occurences of :comment_regexp or :row_sep will be stripped from the first line with the CSV header
* any of the keys in the header line will be downcased, spaces replaced by underscore, and converted to Ruby symbols before being used as keys in the returned Hashes
* you can not combine the :user_provided_headers and :key_mapping options
* if the incorrect number of headers are provided via :user_provided_headers, exception SmarterCSV::HeaderSizeMismatch is raised


## NOTES on Key Mapping:
* keys in the header line of the file can be re-mapped to a chosen set of symbols, so the resulting Hashes can be better used internally in your application (e.g. when directly creating MongoDB entries with them)
* if you want to completely delete a key, then map it to nil or to '', they will be automatically deleted from any result Hash
* if you have input files with a large number of columns, and you want to ignore all columns which are not specifically mapped with :key_mapping, then use option :remove_unmapped_keys => true

## NOTES on the use of Chunking and Blocks:
* chunking can be VERY USEFUL if used in combination with passing a block to File.read_csv FOR LARGE FILES
* if you pass a block to File.read_csv, that block will be executed and given an Array of Hashes as the parameter.
* if the chunk_size is not set, then the array will only contain one Hash.
* if the chunk_size is > 0 , then the array may contain up to chunk_size Hashes.
* this can be very useful when passing chunked data to a post-processing step, e.g. through Resque

## NOTES on improper quotation and unwanted characters in headers:
* some CSV files use un-escaped quotation characters inside fields. This can cause the import to break. To get around this, use the `:force_simple_split => true` option in combination with `:strip_chars_from_headers => /[\-"]/` . This will also significantly speed up the import.
If you would force a different :quote_char instead (setting it to a non-used character), then the import would be up to 5-times slower than using `:force_simple_split`.
* this can be very useful when passing chunked data to a post-processing step, e.g. through Sidekiq

## NOTES about File Encodings:
* if you have a CSV file which contains unicode characters, you can process it as follows:
Expand Down

0 comments on commit 62983fe

Please sign in to comment.