Skip to content

Commit

Permalink
updating docs
Browse files Browse the repository at this point in the history
  • Loading branch information
tilo committed Jul 7, 2024
1 parent 866b638 commit 8f5db61
Show file tree
Hide file tree
Showing 7 changed files with 880 additions and 292 deletions.
476 changes: 184 additions & 292 deletions README.md

Large diffs are not rendered by default.

464 changes: 464 additions & 0 deletions doc/README.md

Large diffs are not rendered by default.

61 changes: 61 additions & 0 deletions doc/examples.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@

### Examples

Here are some examples to demonstrate the versatility of SmarterCSV.

**It is generally recommended to rescue `SmarterCSV::Error` or it's sub-classes.**

By default SmarterCSV determines the `row_sep` and `col_sep` values automatically. In cases where the automatic detection fails, an exception will be raised, e.g. `NoColSepDetected`. Rescuing from these exceptions will make sure that you don't miss processing CSV files, in case users upload CSV files with unexpected formats.

In rare cases you may have to manually set these values, after going through the troubleshooting procedure described above.

#### Example 1a: How SmarterCSV processes CSV-files as array of hashes:
Please note how each hash contains only the keys for columns with non-null values.

```ruby
$ cat pets.csv
first name,last name,dogs,cats,birds,fish
Dan,McAllister,2,,,
Lucy,Laweless,,5,,
Miles,O'Brian,,,,21
Nancy,Homes,2,,1,
$ irb
> require 'smarter_csv'
=> true
> pets_by_owner = SmarterCSV.process('/tmp/pets.csv')
=> [ {:first_name=>"Dan", :last_name=>"McAllister", :dogs=>"2"},
{:first_name=>"Lucy", :last_name=>"Laweless", :cats=>"5"},
{:first_name=>"Miles", :last_name=>"O'Brian", :fish=>"21"},
{:first_name=>"Nancy", :last_name=>"Homes", :dogs=>"2", :birds=>"1"}
]
```
#### Example 3: Populate a MySQL or MongoDB Database with SmarterCSV:
```ruby
# without using chunks:
filename = '/tmp/some.csv'
options = {:key_mapping => {:unwanted_row => nil, :old_row_name => :new_name}}
n = SmarterCSV.process(filename, options) do |array|
# we're passing a block in, to process each resulting hash / =row (the block takes array of hashes)
# when chunking is not enabled, there is only one hash in each array
MyModel.create( array.first )
end
=> returns number of chunks / rows we processed
```
#### Example 4: Processing a CSV File, and inserting batch jobs in Sidekiq:
```ruby
filename = '/tmp/input.csv' # CSV file containing ids or data to process
options = { :chunk_size => 100 }
n = SmarterCSV.process(filename, options) do |chunk|
Sidekiq::Client.push_bulk(
'class' => SidekiqIndividualWorkerClass,
'args' => chunk,
)
# OR:
# SidekiqBatchWorkerClass.process_async(chunk ) # pass an array of hashes to Sidekiq workers for parallel processing
end
=> returns number of chunks
```
53 changes: 53 additions & 0 deletions doc/examples/batch_processing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@

# Batch Processing

Processing CSV data in batches (chunks), allows you to parallelize the workload of importing data.
This can come in handy when you don't want to slow-down the CSV import of large files.

Setting the option `chunk_size` sets the max batch size.


# Example 1: How SmarterCSV processes CSV-files as chunks, returning arrays of hashes:
Please note how the returned array contains two sub-arrays containing the chunks which were read, each chunk containing 2 hashes.
In case the number of rows is not cleanly divisible by `:chunk_size`, the last chunk contains fewer hashes.

```ruby
> pets_by_owner = SmarterCSV.process('/tmp/pets.csv', {:chunk_size => 2, :key_mapping => {:first_name => :first, :last_name => :last}})
=> [ [ {:first=>"Dan", :last=>"McAllister", :dogs=>"2"}, {:first=>"Lucy", :last=>"Laweless", :cats=>"5"} ],
[ {:first=>"Miles", :last=>"O'Brian", :fish=>"21"}, {:first=>"Nancy", :last=>"Homes", :dogs=>"2", :birds=>"1"} ]
]
```

# Example 2: How SmarterCSV processes CSV-files as chunks, and passes arrays of hashes to a given block:
Please note how the given block is passed the data for each chunk as the parameter (array of hashes),
and how the `process` method returns the number of chunks when called with a block

```ruby
> total_chunks = SmarterCSV.process('/tmp/pets.csv', {:chunk_size => 2, :key_mapping => {:first_name => :first, :last_name => :last}}) do |chunk|
chunk.each do |h| # you can post-process the data from each row to your heart's content, and also create virtual attributes:
h[:full_name] = [h[:first],h[:last]].join(' ') # create a virtual attribute
h.delete(:first) ; h.delete(:last) # remove two keys
end
puts chunk.inspect # we could at this point pass the chunk to a Resque worker..
end

[{:dogs=>"2", :full_name=>"Dan McAllister"}, {:cats=>"5", :full_name=>"Lucy Laweless"}]
[{:fish=>"21", :full_name=>"Miles O'Brian"}, {:dogs=>"2", :birds=>"1", :full_name=>"Nancy Homes"}]
=> 2
```

# Example 3: Populate a MongoDB Database in Chunks of 100 records with SmarterCSV:
```ruby
# using chunks:
filename = '/tmp/some.csv'
options = {:chunk_size => 100, :key_mapping => {:unwanted_row => nil, :old_row_name => :new_name}}
n = SmarterCSV.process(filename, options) do |chunk|
# we're passing a block in, to process each resulting hash / row (block takes array of hashes)
# when chunking is enabled, there are up to :chunk_size hashes in each chunk
MyModel.collection.insert( chunk ) # insert up to 100 records at a time
end

=> returns number of chunks we processed
```


33 changes: 33 additions & 0 deletions doc/examples/custom_csv_formats.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Custom / Non-Standard CSV Formats

Besides custom values for `col_sep`, `row_sep`, some other customizations of CSV files are:
* the presence of a number of leading lines before the header or data section start.
* the presence of comment lines, e.g. lines starting with `#`

To handle these special cases, please use the following options.


# Example 1:
In this example, we use `skip_lines: 3` to skip and ignore the first 3 lines in the input





# Example 2: reading an iTunes DB dump

In this example, we use `comment_regexp` to filter out and ignore any lines starting with `#`


```ruby
# Consider a file with CRTL-A as col_separator, and with CTRL-B\n as record_separator (hello iTunes!)
filename = '/tmp/strange_db_dump'
options = {
:col_sep => "\cA", :row_sep => "\cB\n", :comment_regexp => /^#/,
:chunk_size => 100 , :key_mapping => {:export_date => nil, :name => :genre},
}
n = SmarterCSV.process(filename, options) do |chunk|
SidekiqWorkerClass.process_async(chunk) # pass an array of hashes to Sidekiq workers for parallel processing
end
=> returns number of chunks
```
42 changes: 42 additions & 0 deletions doc/examples/row_col_sep.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@

# Automatic Detection

SmarterCSV defaults to automatically detecting row and column separators based on the data in the given input, using the defaults `col_sep: :auto`, `row_sep: :auto`.

These options can be overridden.

# Column Separator

The automatic detection of column separators considers: `',', "\t", ';', ':', '|'`.

Some CSV files may contain an unusual column separqator, which could even be a control character.

# Row Separator

The automatic detection of row separators considers: `\n`, `\r\n`, `\r`.

Some CSV files may contain an unusual row separqator, which could even be a control character.

# Example 1: reading an iTunes DB dump

```ruby
# Consider a file with CRTL-A as col_separator, and with CTRL-B\n as record_separator (hello iTunes!)
filename = '/tmp/strange_db_dump'
options = {
:col_sep => "\cA", :row_sep => "\cB\n", :comment_regexp => /^#/,
:chunk_size => 100 , :key_mapping => {:export_date => nil, :name => :genre},
}
n = SmarterCSV.process(filename, options) do |chunk|
SidekiqWorkerClass.process_async(chunk) # pass an array of hashes to Sidekiq workers for parallel processing
end
=> returns number of chunks
```

# Example 2: Reading a CSV-File with custom col_sep, row_sep

```ruby
filename = '/tmp/input_file.txt'
recordsA = SmarterCSV.process(filename, {:col_sep => "#", :row_sep => "|"})

=> returns an array of hashes
```
43 changes: 43 additions & 0 deletions doc/examples/value_converters.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@

# Using Value Converters

Value Converters allow you to do custom transformations specific rows, to help you massage the data so it fits the expectations of your down-stream process, such as creating a DB record.

If you use `key_mappings` and `value_converters`, make sure that the value converters references the keys based on the final mapped name, not the original name in the CSV file.

```ruby
$ cat spec/fixtures/with_dates.csv
first,last,date,price
Ben,Miller,10/30/1998,$44.50
Tom,Turner,2/1/2011,$15.99
Ken,Smith,01/09/2013,$199.99

$ irb
> require 'smarter_csv'
> require 'date'

# define a custom converter class, which implements self.convert(value)
class DateConverter
def self.convert(value)
Date.strptime( value, '%m/%d/%Y') # parses custom date format into Date instance
end
end

class DollarConverter
def self.convert(value)
value.sub('$','').to_f
end
end

options = {:value_converters => {:date => DateConverter, :price => DollarConverter}}
data = SmarterCSV.process("spec/fixtures/with_dates.csv", options)
first_record = data.first
first_record[:date]
=> #<Date: 1998-10-30 ((2451117j,0s,0n),+0s,2299161j)>
first_record[:date].class
=> Date
first_record[:price]
=> 44.50
first_record[:price].class
=> Float
```

0 comments on commit 8f5db61

Please sign in to comment.