Tools for working with CSV data (or other basic tabular data formats)
A silly utility script to output a list of unique characters used in specified field(s) of a CSV, or the whole CSV if no fields are specified.
Initial use case: find all characters used in CollectionSpace media file names, so I could replace any characters unsafe for use in S3 key names with something else.
Do ruby chars_in_field.rb -h
for usage details.
ℹ️
|
Supports comma, tab, or pipe delimited files currently |
-
Writes out a report on whether the rows of a CSV all have the same number of columns
-
If a file has any ragged rows, it tells you how many columns the weird rows have and gives you the row numbers of up to 3 rows having that number of columns.
-
Where the example field for a non-ok file is blank, that’s the expected/correct number of columns based on the number of headers
This was a mess and I deleted it. It’s being replaced by simpler scripts that do specific parts of what this did.
|
script not finished. Do not use yet |
❗
|
If you have not installed or used the
|
Given path to a directory, a file suffix, and an optional minimum file size (in bytes)…
Prints to screen a the following info for all files with suffix in the directory:
-
size
-
result of unix
file
command (file encoding, presence of "very long lines," line terminators)
Do ruby file_info.rb
for parameters/options.
Example output:
dbo_ZLocalUseAttribs.txt: size: 814242 info: Little-endian UTF-16 Unicode text, with CRLF line terminators dbo_ZResultFields.txt: size: 13274 info: Little-endian UTF-16 Unicode text, with CRLF line terminators dbo_dtproperties.txt: size: 34462 info: Little-endian UTF-16 Unicode text, with very long lines, with CRLF line terminators
Usage example: Defaults:
ruby profiler.rb -i ~/data/export/ -o ~/data/profiled
All .csv
files in ~/data/export
directory are included, and the default details mode is files
. Default options sent to Ruby standard library CSV parser are:
{headers: true, header_converters: [:downcase], converters: [:stripplus],
skip_blanks: true, empty_value: nil}
Usage example: compiled details for tab-separated .tsv files:
ruby profiler.rb -i ~/data/export/ -o ~/data/profiled -s tsv -c '{col_sep: "\t"} -d compiled'
All .tsv
files in ~/data/export
directory are included. The Ruby standard library CSV parser option col_sep: "\t"
is merged into the default option hash shown above.
One .csv
file written to output directory per table column.
For example, if source file addresses.csv
has a :city
column, there is an address_city.csv
file written.
The output CSV has one row per unique value found in the source column. The first column is the occurrence count of the value in the source column. The second column is the value.
Given a directory containing CSV files, writes out two CSV reports:
-
summary - a row for each column in source CSVs, with the following columns:
-
table - source CSV name
-
column - column name
-
column index - for putting them in the order in which they appear in source document
-
uniq vals - count of unique values found in column
-
null vals - count of empty cells in column
-
-
details - a row for each unique value in each column in source CSVs, with the following columns:
-
table - source CSV name
-
column - column name
-
column index - for putting them in the order in which they appear in source document
-
value - a unique value found in column (puts "NULL VALUE/EMPTY FIELD" to represent that)
-
occurrences - number of time value occurs in column
-
|
There’s a known bug where not all apparently empty fields are getting counted as "NULL VALUE/EMPTY FIELD". The number that get left out is small and I didn’t have time to chase this down now, but will try to the next time I need this thing. |
Reformats a list of CSVs, allowing you to change the separator and escape characters. Output is to STDOUT.
💡
|
This can handle parsing To run this script on all files in a directory, writing the reformatted files to another directory, see |
Usage: reformat_csv.pl [options] FILES
Usage example: perl reformat_csv.pl --input_sep ';' --input_esc '#' ~/data/test.csv > ~/data/test_fix.csv
-
input_sep - Separator character in input CSVs (default: ,)
-
input_esc - Escape character in input CSVs (default: ")
-
output_sep - Separator character in output CSVs (default: ,)
-
output_esc - Escape character in output CSVs (default: ")
💡
|
To pass TAB as input_sep or output_sep , use the literal tab character by typing Ctrl-v , then TAB on the command line.
|
While handy, this program primarily exists to take adavantage of Text::CSV_XS’s ability to deal with unescaped quotes in fields. To do this, set input_esc to anything other than '"', for instance '#'.
Requires you have the Text::CSV_XS Perl module installed
This is a wrapper around reformat_csv.pl
. It requires you have Perl and the Text::CSV_XS
module installed.
The input/output sep and esc options are the same as described for reformat_csv.pl
The only required argument is --input
(or -i
), which specifies the directory containing the tabular data files you wish to reformat.
If no --output
/-o
value is given, a new directory called reformatted
is created in your --input
directory, and reformatted files are saved in new directory. Any other directory value can be provided. If the directory does not exist at run time, it will be created.
File suffix (--suffix
/-s
) defaults to csv
.
Usage example:
ruby reformat_tables.rb -i ~/data/lafayette/export --input_sep ';' --input_esc "#" --output_sep ' '
Writes semicolon delimited .csv files with unescaped quotes to tab-delimited.
Meant to be used to investigate specific files reported by clean_convert_report.rb
as having bad structure (i.e. ragged columns: some row having different number of columns than other rows)
Given path to file, delimiter name, number of columns you want to see rows for, and option number of rows you want to see…
Outputs to screen rows with the given number of columns.
This is useful for coming up with the specific find/replace mess you are going to have to implement to keep rows from being broken up in a ragged way.
Generally I use this iteratively with edits made to a migration-specific copy of clean_convert_report.rb
to eliminate or minimize the number of ragged-column files I end up having to manually fix for a migration.
Utility script for creating data review spreadsheet.
Given a directory containing tabular data files, outputs two CSVs:
-
tables.csv
-
table/filename
-
column count
-
row count
-
-
columns.csv
-
table/filename
-
column name
-
These become two tabs in a data review tracking/mapping Excel sheet.
Do ruby tables_and_columns.rb -h
for parameters/options.
Useful for initial data review work.
Reads all files with given file suffix in the given directory. For each, prints out the file/table name, headers, and the first 25 rows of data, nicely formatted, in one text file you can scroll/search through. You don’t have to open a million files to get your head around the general shape and character of the data.
Requires csvlook
from csvkit to be installed and available in your PATH
Do ruby table_preview.rb -h
for parameters/options.