updating docs

tilo · Jul 7, 2024 · 8f5db61 · 8f5db61
1 parent 866b638
commit 8f5db61
Show file tree

Hide file tree

Showing 7 changed files with 880 additions and 292 deletions.
diff --git a/README.md b/README.md
diff --git a/doc/README.md b/doc/README.md
diff --git a/doc/examples.md b/doc/examples.md
@@ -0,0 +1,61 @@
+
+### Examples
+
+Here are some examples to demonstrate the versatility of SmarterCSV.
+
+**It is generally recommended to rescue `SmarterCSV::Error` or it's sub-classes.**
+
+By default SmarterCSV determines the `row_sep` and `col_sep` values automatically. In cases where the automatic detection fails, an exception will be raised, e.g. `NoColSepDetected`. Rescuing from these exceptions will make sure that you don't miss processing CSV files, in case users upload CSV files with unexpected formats.
+
+In rare cases you may have to manually set these values, after going through the troubleshooting procedure described above.
+
+#### Example 1a: How SmarterCSV processes CSV-files as array of hashes:
+Please note how each hash contains only the keys for columns with non-null values.
+
+```ruby
+     $ cat pets.csv
+     first name,last name,dogs,cats,birds,fish
+     Dan,McAllister,2,,,
+     Lucy,Laweless,,5,,
+     Miles,O'Brian,,,,21
+     Nancy,Homes,2,,1,
+     $ irb
+     > require 'smarter_csv'
+      => true
+     > pets_by_owner = SmarterCSV.process('/tmp/pets.csv')
+      => [ {:first_name=>"Dan", :last_name=>"McAllister", :dogs=>"2"},
+           {:first_name=>"Lucy", :last_name=>"Laweless", :cats=>"5"},
+           {:first_name=>"Miles", :last_name=>"O'Brian", :fish=>"21"},
+           {:first_name=>"Nancy", :last_name=>"Homes", :dogs=>"2", :birds=>"1"}
+         ]
+```
+
+
+#### Example 3: Populate a MySQL or MongoDB Database with SmarterCSV:
+```ruby
+    # without using chunks:
+    filename = '/tmp/some.csv'
+    options = {:key_mapping => {:unwanted_row => nil, :old_row_name => :new_name}}
+    n = SmarterCSV.process(filename, options) do |array|
+          # we're passing a block in, to process each resulting hash / =row (the block takes array of hashes)
+          # when chunking is not enabled, there is only one hash in each array
+          MyModel.create( array.first )
+    end
+
+     => returns number of chunks / rows we processed
+```
+
+#### Example 4: Processing a CSV File, and inserting batch jobs in Sidekiq:
+```ruby
+    filename = '/tmp/input.csv' # CSV file containing ids or data to process
+    options = { :chunk_size => 100 }
+    n = SmarterCSV.process(filename, options) do |chunk|
+      Sidekiq::Client.push_bulk(
+        'class' => SidekiqIndividualWorkerClass,
+        'args' => chunk,
+      )
+      # OR:
+      # SidekiqBatchWorkerClass.process_async(chunk ) # pass an array of hashes to Sidekiq workers for parallel processing
+    end
+    => returns number of chunks
+```
diff --git a/doc/examples/batch_processing.md b/doc/examples/batch_processing.md
@@ -0,0 +1,53 @@
+
+# Batch Processing
+
+Processing CSV data in batches (chunks), allows you to parallelize the workload of importing data.
+This can come in handy when you don't want to slow-down the CSV import of large files.
+
+Setting the option `chunk_size` sets the max batch size.
+
+
+# Example 1: How SmarterCSV processes CSV-files as chunks, returning arrays of hashes:
+Please note how the returned array contains two sub-arrays containing the chunks which were read, each chunk containing 2 hashes.
+In case the number of rows is not cleanly divisible by `:chunk_size`, the last chunk contains fewer hashes.
+
+```ruby
+     > pets_by_owner = SmarterCSV.process('/tmp/pets.csv', {:chunk_size => 2, :key_mapping => {:first_name => :first, :last_name => :last}})
+       => [ [ {:first=>"Dan", :last=>"McAllister", :dogs=>"2"}, {:first=>"Lucy", :last=>"Laweless", :cats=>"5"} ],
+            [ {:first=>"Miles", :last=>"O'Brian", :fish=>"21"}, {:first=>"Nancy", :last=>"Homes", :dogs=>"2", :birds=>"1"} ]
+          ]
+```
+
+# Example 2: How SmarterCSV processes CSV-files as chunks, and passes arrays of hashes to a given block:
+Please note how the given block is passed the data for each chunk as the parameter (array of hashes),
+and how the `process` method returns the number of chunks when called with a block
+
+```ruby
+     > total_chunks = SmarterCSV.process('/tmp/pets.csv', {:chunk_size => 2, :key_mapping => {:first_name => :first, :last_name => :last}}) do |chunk|
+         chunk.each do |h|   # you can post-process the data from each row to your heart's content, and also create virtual attributes:
+           h[:full_name] = [h[:first],h[:last]].join(' ')  # create a virtual attribute
+           h.delete(:first) ; h.delete(:last)              # remove two keys
+         end
+         puts chunk.inspect   # we could at this point pass the chunk to a Resque worker..
+       end
+
+       [{:dogs=>"2", :full_name=>"Dan McAllister"}, {:cats=>"5", :full_name=>"Lucy Laweless"}]
+       [{:fish=>"21", :full_name=>"Miles O'Brian"}, {:dogs=>"2", :birds=>"1", :full_name=>"Nancy Homes"}]
+        => 2
+```
+
+# Example 3: Populate a MongoDB Database in Chunks of 100 records with SmarterCSV:
+```ruby
+    # using chunks:
+    filename = '/tmp/some.csv'
+    options = {:chunk_size => 100, :key_mapping => {:unwanted_row => nil, :old_row_name => :new_name}}
+    n = SmarterCSV.process(filename, options) do |chunk|
+          # we're passing a block in, to process each resulting hash / row (block takes array of hashes)
+          # when chunking is enabled, there are up to :chunk_size hashes in each chunk
+          MyModel.collection.insert( chunk )   # insert up to 100 records at a time
+    end
+
+     => returns number of chunks we processed
+```
+
+
diff --git a/doc/examples/custom_csv_formats.md b/doc/examples/custom_csv_formats.md
@@ -0,0 +1,33 @@
+# Custom / Non-Standard CSV Formats
+
+Besides custom values for `col_sep`, `row_sep`, some other customizations of CSV files are:
+*  the presence of a number of leading lines before the header or data section start.
+*  the presence of comment lines, e.g. lines starting with `#`
+
+To handle these special cases, please use the following options.
+
+
+# Example 1:
+In this example, we use `skip_lines: 3` to skip and ignore the first 3 lines in the input
+
+
+
+
+
+# Example 2: reading an iTunes DB dump
+
+In this example, we use `comment_regexp` to filter out and ignore any lines starting with `#`
+
+
+```ruby
+    # Consider a file with CRTL-A as col_separator, and with CTRL-B\n as record_separator (hello iTunes!)
+    filename = '/tmp/strange_db_dump'   
+    options = {
+      :col_sep => "\cA", :row_sep => "\cB\n", :comment_regexp => /^#/,
+      :chunk_size => 100 , :key_mapping => {:export_date => nil, :name => :genre},
+    }
+    n = SmarterCSV.process(filename, options) do |chunk|
+      SidekiqWorkerClass.process_async(chunk) # pass an array of hashes to Sidekiq workers for parallel processing
+    end
+    => returns number of chunks
+```
diff --git a/doc/examples/row_col_sep.md b/doc/examples/row_col_sep.md
@@ -0,0 +1,42 @@
+
+# Automatic Detection
+
+SmarterCSV defaults to automatically detecting row and column separators based on the data in the given input, using the defaults `col_sep: :auto`, `row_sep: :auto`.
+
+These options can be overridden. 
+
+# Column Separator
+
+The automatic detection of column separators considers: `',', "\t", ';', ':', '|'`.
+
+Some CSV files may contain an unusual column separqator, which could even be a control character.
+
+# Row Separator
+
+The automatic detection of row separators considers: `\n`, `\r\n`, `\r`.
+
+Some CSV files may contain an unusual row separqator, which could even be a control character.
+
+# Example 1: reading an iTunes DB dump
+
+```ruby
+    # Consider a file with CRTL-A as col_separator, and with CTRL-B\n as record_separator (hello iTunes!)
+    filename = '/tmp/strange_db_dump'   
+    options = {
+      :col_sep => "\cA", :row_sep => "\cB\n", :comment_regexp => /^#/,
+      :chunk_size => 100 , :key_mapping => {:export_date => nil, :name => :genre},
+    }
+    n = SmarterCSV.process(filename, options) do |chunk|
+      SidekiqWorkerClass.process_async(chunk) # pass an array of hashes to Sidekiq workers for parallel processing
+    end
+    => returns number of chunks
+```
+
+# Example 2: Reading a CSV-File with custom col_sep, row_sep
+
+```ruby
+    filename = '/tmp/input_file.txt'
+    recordsA = SmarterCSV.process(filename, {:col_sep => "#", :row_sep => "|"})
+
+    => returns an array of hashes
+```
diff --git a/doc/examples/value_converters.md b/doc/examples/value_converters.md
@@ -0,0 +1,43 @@
+
+# Using Value Converters
+
+Value Converters allow you to do custom transformations specific rows, to help you massage the data so it fits the expectations of your down-stream process, such as creating a DB record.
+
+If you use `key_mappings` and `value_converters`, make sure that the value converters references the keys based on the final mapped name, not the original name in the CSV file.
+
+```ruby
+    $ cat spec/fixtures/with_dates.csv
+    first,last,date,price
+    Ben,Miller,10/30/1998,$44.50
+    Tom,Turner,2/1/2011,$15.99
+    Ken,Smith,01/09/2013,$199.99
+
+    $ irb
+    > require 'smarter_csv'
+    > require 'date'
+
+    # define a custom converter class, which implements self.convert(value)
+    class DateConverter
+      def self.convert(value)
+        Date.strptime( value, '%m/%d/%Y') # parses custom date format into Date instance
+      end
+    end
+
+    class DollarConverter
+      def self.convert(value)
+        value.sub('$','').to_f
+      end
+    end
+
+    options = {:value_converters => {:date => DateConverter, :price => DollarConverter}}
+    data = SmarterCSV.process("spec/fixtures/with_dates.csv", options)
+    first_record = data.first
+    first_record[:date]
+      => #<Date: 1998-10-30 ((2451117j,0s,0n),+0s,2299161j)>
+    first_record[:date].class
+      => Date
+    first_record[:price]
+      => 44.50
+    first_record[:price].class
+      => Float
+```