-
-
Notifications
You must be signed in to change notification settings - Fork 190
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
7 changed files
with
880 additions
and
292 deletions.
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
|
||
### Examples | ||
|
||
Here are some examples to demonstrate the versatility of SmarterCSV. | ||
|
||
**It is generally recommended to rescue `SmarterCSV::Error` or it's sub-classes.** | ||
|
||
By default SmarterCSV determines the `row_sep` and `col_sep` values automatically. In cases where the automatic detection fails, an exception will be raised, e.g. `NoColSepDetected`. Rescuing from these exceptions will make sure that you don't miss processing CSV files, in case users upload CSV files with unexpected formats. | ||
|
||
In rare cases you may have to manually set these values, after going through the troubleshooting procedure described above. | ||
|
||
#### Example 1a: How SmarterCSV processes CSV-files as array of hashes: | ||
Please note how each hash contains only the keys for columns with non-null values. | ||
|
||
```ruby | ||
$ cat pets.csv | ||
first name,last name,dogs,cats,birds,fish | ||
Dan,McAllister,2,,, | ||
Lucy,Laweless,,5,, | ||
Miles,O'Brian,,,,21 | ||
Nancy,Homes,2,,1, | ||
$ irb | ||
> require 'smarter_csv' | ||
=> true | ||
> pets_by_owner = SmarterCSV.process('/tmp/pets.csv') | ||
=> [ {:first_name=>"Dan", :last_name=>"McAllister", :dogs=>"2"}, | ||
{:first_name=>"Lucy", :last_name=>"Laweless", :cats=>"5"}, | ||
{:first_name=>"Miles", :last_name=>"O'Brian", :fish=>"21"}, | ||
{:first_name=>"Nancy", :last_name=>"Homes", :dogs=>"2", :birds=>"1"} | ||
] | ||
``` | ||
#### Example 3: Populate a MySQL or MongoDB Database with SmarterCSV: | ||
```ruby | ||
# without using chunks: | ||
filename = '/tmp/some.csv' | ||
options = {:key_mapping => {:unwanted_row => nil, :old_row_name => :new_name}} | ||
n = SmarterCSV.process(filename, options) do |array| | ||
# we're passing a block in, to process each resulting hash / =row (the block takes array of hashes) | ||
# when chunking is not enabled, there is only one hash in each array | ||
MyModel.create( array.first ) | ||
end | ||
=> returns number of chunks / rows we processed | ||
``` | ||
#### Example 4: Processing a CSV File, and inserting batch jobs in Sidekiq: | ||
```ruby | ||
filename = '/tmp/input.csv' # CSV file containing ids or data to process | ||
options = { :chunk_size => 100 } | ||
n = SmarterCSV.process(filename, options) do |chunk| | ||
Sidekiq::Client.push_bulk( | ||
'class' => SidekiqIndividualWorkerClass, | ||
'args' => chunk, | ||
) | ||
# OR: | ||
# SidekiqBatchWorkerClass.process_async(chunk ) # pass an array of hashes to Sidekiq workers for parallel processing | ||
end | ||
=> returns number of chunks | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
|
||
# Batch Processing | ||
|
||
Processing CSV data in batches (chunks), allows you to parallelize the workload of importing data. | ||
This can come in handy when you don't want to slow-down the CSV import of large files. | ||
|
||
Setting the option `chunk_size` sets the max batch size. | ||
|
||
|
||
# Example 1: How SmarterCSV processes CSV-files as chunks, returning arrays of hashes: | ||
Please note how the returned array contains two sub-arrays containing the chunks which were read, each chunk containing 2 hashes. | ||
In case the number of rows is not cleanly divisible by `:chunk_size`, the last chunk contains fewer hashes. | ||
|
||
```ruby | ||
> pets_by_owner = SmarterCSV.process('/tmp/pets.csv', {:chunk_size => 2, :key_mapping => {:first_name => :first, :last_name => :last}}) | ||
=> [ [ {:first=>"Dan", :last=>"McAllister", :dogs=>"2"}, {:first=>"Lucy", :last=>"Laweless", :cats=>"5"} ], | ||
[ {:first=>"Miles", :last=>"O'Brian", :fish=>"21"}, {:first=>"Nancy", :last=>"Homes", :dogs=>"2", :birds=>"1"} ] | ||
] | ||
``` | ||
|
||
# Example 2: How SmarterCSV processes CSV-files as chunks, and passes arrays of hashes to a given block: | ||
Please note how the given block is passed the data for each chunk as the parameter (array of hashes), | ||
and how the `process` method returns the number of chunks when called with a block | ||
|
||
```ruby | ||
> total_chunks = SmarterCSV.process('/tmp/pets.csv', {:chunk_size => 2, :key_mapping => {:first_name => :first, :last_name => :last}}) do |chunk| | ||
chunk.each do |h| # you can post-process the data from each row to your heart's content, and also create virtual attributes: | ||
h[:full_name] = [h[:first],h[:last]].join(' ') # create a virtual attribute | ||
h.delete(:first) ; h.delete(:last) # remove two keys | ||
end | ||
puts chunk.inspect # we could at this point pass the chunk to a Resque worker.. | ||
end | ||
|
||
[{:dogs=>"2", :full_name=>"Dan McAllister"}, {:cats=>"5", :full_name=>"Lucy Laweless"}] | ||
[{:fish=>"21", :full_name=>"Miles O'Brian"}, {:dogs=>"2", :birds=>"1", :full_name=>"Nancy Homes"}] | ||
=> 2 | ||
``` | ||
|
||
# Example 3: Populate a MongoDB Database in Chunks of 100 records with SmarterCSV: | ||
```ruby | ||
# using chunks: | ||
filename = '/tmp/some.csv' | ||
options = {:chunk_size => 100, :key_mapping => {:unwanted_row => nil, :old_row_name => :new_name}} | ||
n = SmarterCSV.process(filename, options) do |chunk| | ||
# we're passing a block in, to process each resulting hash / row (block takes array of hashes) | ||
# when chunking is enabled, there are up to :chunk_size hashes in each chunk | ||
MyModel.collection.insert( chunk ) # insert up to 100 records at a time | ||
end | ||
|
||
=> returns number of chunks we processed | ||
``` | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
# Custom / Non-Standard CSV Formats | ||
|
||
Besides custom values for `col_sep`, `row_sep`, some other customizations of CSV files are: | ||
* the presence of a number of leading lines before the header or data section start. | ||
* the presence of comment lines, e.g. lines starting with `#` | ||
|
||
To handle these special cases, please use the following options. | ||
|
||
|
||
# Example 1: | ||
In this example, we use `skip_lines: 3` to skip and ignore the first 3 lines in the input | ||
|
||
|
||
|
||
|
||
|
||
# Example 2: reading an iTunes DB dump | ||
|
||
In this example, we use `comment_regexp` to filter out and ignore any lines starting with `#` | ||
|
||
|
||
```ruby | ||
# Consider a file with CRTL-A as col_separator, and with CTRL-B\n as record_separator (hello iTunes!) | ||
filename = '/tmp/strange_db_dump' | ||
options = { | ||
:col_sep => "\cA", :row_sep => "\cB\n", :comment_regexp => /^#/, | ||
:chunk_size => 100 , :key_mapping => {:export_date => nil, :name => :genre}, | ||
} | ||
n = SmarterCSV.process(filename, options) do |chunk| | ||
SidekiqWorkerClass.process_async(chunk) # pass an array of hashes to Sidekiq workers for parallel processing | ||
end | ||
=> returns number of chunks | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
|
||
# Automatic Detection | ||
|
||
SmarterCSV defaults to automatically detecting row and column separators based on the data in the given input, using the defaults `col_sep: :auto`, `row_sep: :auto`. | ||
|
||
These options can be overridden. | ||
|
||
# Column Separator | ||
|
||
The automatic detection of column separators considers: `',', "\t", ';', ':', '|'`. | ||
|
||
Some CSV files may contain an unusual column separqator, which could even be a control character. | ||
|
||
# Row Separator | ||
|
||
The automatic detection of row separators considers: `\n`, `\r\n`, `\r`. | ||
|
||
Some CSV files may contain an unusual row separqator, which could even be a control character. | ||
|
||
# Example 1: reading an iTunes DB dump | ||
|
||
```ruby | ||
# Consider a file with CRTL-A as col_separator, and with CTRL-B\n as record_separator (hello iTunes!) | ||
filename = '/tmp/strange_db_dump' | ||
options = { | ||
:col_sep => "\cA", :row_sep => "\cB\n", :comment_regexp => /^#/, | ||
:chunk_size => 100 , :key_mapping => {:export_date => nil, :name => :genre}, | ||
} | ||
n = SmarterCSV.process(filename, options) do |chunk| | ||
SidekiqWorkerClass.process_async(chunk) # pass an array of hashes to Sidekiq workers for parallel processing | ||
end | ||
=> returns number of chunks | ||
``` | ||
|
||
# Example 2: Reading a CSV-File with custom col_sep, row_sep | ||
|
||
```ruby | ||
filename = '/tmp/input_file.txt' | ||
recordsA = SmarterCSV.process(filename, {:col_sep => "#", :row_sep => "|"}) | ||
|
||
=> returns an array of hashes | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
|
||
# Using Value Converters | ||
|
||
Value Converters allow you to do custom transformations specific rows, to help you massage the data so it fits the expectations of your down-stream process, such as creating a DB record. | ||
|
||
If you use `key_mappings` and `value_converters`, make sure that the value converters references the keys based on the final mapped name, not the original name in the CSV file. | ||
|
||
```ruby | ||
$ cat spec/fixtures/with_dates.csv | ||
first,last,date,price | ||
Ben,Miller,10/30/1998,$44.50 | ||
Tom,Turner,2/1/2011,$15.99 | ||
Ken,Smith,01/09/2013,$199.99 | ||
|
||
$ irb | ||
> require 'smarter_csv' | ||
> require 'date' | ||
|
||
# define a custom converter class, which implements self.convert(value) | ||
class DateConverter | ||
def self.convert(value) | ||
Date.strptime( value, '%m/%d/%Y') # parses custom date format into Date instance | ||
end | ||
end | ||
|
||
class DollarConverter | ||
def self.convert(value) | ||
value.sub('$','').to_f | ||
end | ||
end | ||
|
||
options = {:value_converters => {:date => DateConverter, :price => DollarConverter}} | ||
data = SmarterCSV.process("spec/fixtures/with_dates.csv", options) | ||
first_record = data.first | ||
first_record[:date] | ||
=> #<Date: 1998-10-30 ((2451117j,0s,0n),+0s,2299161j)> | ||
first_record[:date].class | ||
=> Date | ||
first_record[:price] | ||
=> 44.50 | ||
first_record[:price].class | ||
=> Float | ||
``` |