Skip to content

Tools for working with CSV data (or other basic tabular data formats)

License

Notifications You must be signed in to change notification settings

lyrasis/csv-data-tools

Repository files navigation

csv-data-tools

chars_in_field.rb

A silly utility script to output a list of unique characters used in specified field(s) of a CSV, or the whole CSV if no fields are specified.

Initial use case: find all characters used in CollectionSpace media file names, so I could replace any characters unsafe for use in S3 key names with something else.

Do ruby chars_in_field.rb -h for usage details.

check_structure.rb

ℹ️
Supports comma, tab, or pipe delimited files currently
  • Writes out a report on whether the rows of a CSV all have the same number of columns

  • If a file has any ragged rows, it tells you how many columns the weird rows have and gives you the row numbers of up to 3 rows having that number of columns.

  • Where the example field for a non-ok file is blank, that’s the expected/correct number of columns based on the number of headers

clean_convert_report.rb

This was a mess and I deleted it. It’s being replaced by simpler scripts that do specific parts of what this did.

convert_encoding.rb

⚠️
script not finished. Do not use yet

If you have not installed or used the charlock_holmes gem before, you need to: * install icu4c if you do not have it:

brew install icu4c

  • run the following command in your terminal before running this script:

bundle config build.charlock_holmes --with-icu-dir=/usr/local/opt/icu4c

file_info.rb

Given path to a directory, a file suffix, and an optional minimum file size (in bytes)…​

Prints to screen a the following info for all files with suffix in the directory:

  • size

  • result of unix file command (file encoding, presence of "very long lines," line terminators)

Do ruby file_info.rb for parameters/options.

Example output:

dbo_ZLocalUseAttribs.txt:
  size: 814242
  info: Little-endian UTF-16 Unicode text, with CRLF line terminators


dbo_ZResultFields.txt:
  size: 13274
  info: Little-endian UTF-16 Unicode text, with CRLF line terminators


dbo_dtproperties.txt:
  size: 34462
  info: Little-endian UTF-16 Unicode text, with very long lines, with CRLF line terminators

profiler.rb

Usage example: Defaults:

ruby profiler.rb -i ~/data/export/ -o ~/data/profiled

All .csv files in ~/data/export directory are included, and the default details mode is files. Default options sent to Ruby standard library CSV parser are:

{headers: true, header_converters: [:downcase], converters: [:stripplus],
  skip_blanks: true, empty_value: nil}

Usage example: compiled details for tab-separated .tsv files:

ruby profiler.rb -i ~/data/export/ -o ~/data/profiled -s tsv -c '{col_sep: "\t"} -d compiled'

All .tsv files in ~/data/export directory are included. The Ruby standard library CSV parser option col_sep: "\t" is merged into the default option hash shown above.

When --details file

One .csv file written to output directory per table column.

For example, if source file addresses.csv has a :city column, there is an address_city.csv file written.

The output CSV has one row per unique value found in the source column. The first column is the occurrence count of the value in the source column. The second column is the value.

When --details compile

Given a directory containing CSV files, writes out two CSV reports:

  • summary - a row for each column in source CSVs, with the following columns:

    • table - source CSV name

    • column - column name

    • column index - for putting them in the order in which they appear in source document

    • uniq vals - count of unique values found in column

    • null vals - count of empty cells in column

  • details - a row for each unique value in each column in source CSVs, with the following columns:

    • table - source CSV name

    • column - column name

    • column index - for putting them in the order in which they appear in source document

    • value - a unique value found in column (puts "NULL VALUE/EMPTY FIELD" to represent that)

    • occurrences - number of time value occurs in column

⚠️
There’s a known bug where not all apparently empty fields are getting counted as "NULL VALUE/EMPTY FIELD". The number that get left out is small and I didn’t have time to chase this down now, but will try to the next time I need this thing.

reformat_csv.pl

Reformats a list of CSVs, allowing you to change the separator and escape characters. Output is to STDOUT.

💡

This can handle parsing \n inside quoted fields that contain unescaped quotes. We did not find a Ruby CSV parsing solution that handled this particular flavor of CSV horror. (Thanks, potential client legacy system which shall not be named…​)

To run this script on all files in a directory, writing the reformatted files to another directory, see reformat_tables.rb.

Usage: reformat_csv.pl [options] FILES

Usage example: perl reformat_csv.pl --input_sep ';' --input_esc '#' ~/data/test.csv > ~/data/test_fix.csv

Options:
  • input_sep - Separator character in input CSVs (default: ,)

  • input_esc - Escape character in input CSVs (default: ")

  • output_sep - Separator character in output CSVs (default: ,)

  • output_esc - Escape character in output CSVs (default: ")

💡
To pass TAB as input_sep or output_sep, use the literal tab character by typing Ctrl-v, then TAB on the command line.

While handy, this program primarily exists to take adavantage of Text::CSV_XS’s ability to deal with unescaped quotes in fields. To do this, set input_esc to anything other than '"', for instance '#'.

Requires you have the Text::CSV_XS Perl module installed

reformat_tables.rb

This is a wrapper around reformat_csv.pl. It requires you have Perl and the Text::CSV_XS module installed.

The input/output sep and esc options are the same as described for reformat_csv.pl

The only required argument is --input (or -i), which specifies the directory containing the tabular data files you wish to reformat.

If no --output/-o value is given, a new directory called reformatted is created in your --input directory, and reformatted files are saved in new directory. Any other directory value can be provided. If the directory does not exist at run time, it will be created.

File suffix (--suffix/-s) defaults to csv.

Usage example:

ruby reformat_tables.rb -i ~/data/lafayette/export --input_sep ';' --input_esc "#" --output_sep ' '

Writes semicolon delimited .csv files with unescaped quotes to tab-delimited.

show_rows_with_column_ct.rb

Meant to be used to investigate specific files reported by clean_convert_report.rb as having bad structure (i.e. ragged columns: some row having different number of columns than other rows)

Given path to file, delimiter name, number of columns you want to see rows for, and option number of rows you want to see…​

Outputs to screen rows with the given number of columns.

This is useful for coming up with the specific find/replace mess you are going to have to implement to keep rows from being broken up in a ragged way.

Generally I use this iteratively with edits made to a migration-specific copy of clean_convert_report.rb to eliminate or minimize the number of ragged-column files I end up having to manually fix for a migration.

tables_and_columns.rb

Utility script for creating data review spreadsheet.

Given a directory containing tabular data files, outputs two CSVs:

  • tables.csv

    • table/filename

    • column count

    • row count

  • columns.csv

    • table/filename

    • column name

These become two tabs in a data review tracking/mapping Excel sheet.

Do ruby tables_and_columns.rb -h for parameters/options.

table_preview.rb

Useful for initial data review work.

Reads all files with given file suffix in the given directory. For each, prints out the file/table name, headers, and the first 25 rows of data, nicely formatted, in one text file you can scroll/search through. You don’t have to open a million files to get your head around the general shape and character of the data.

Requires csvlook from csvkit to be installed and available in your PATH

Do ruby table_preview.rb -h for parameters/options.

About

Tools for working with CSV data (or other basic tabular data formats)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published