- Load the data into a SQLite database from zip archive downloaded from Data.gov.
- Take a database, a name and the sex as an argument and produce a plot over time of the frequency of the name.
- Find boy and girl names that are the most similar in terms of historical frequency.
-
PrepareData.jl : This code assumes that names.zip is in the same folder as the file.
To run, type-julia prepare.jl names.zip names.db
-
PlotData.jl : This code takes input from prepare.jl file.
To run, type-julia plot.jl names.db <name> <sex>
(Enter any name and sex. In the database, all names start with Uppercase. So in the argument, provide name with the first letter in uppercase) -
FindSimilarNames.jl :
- Open terminal and type-
- (For MacOS)
export JULIA_NUM_THREADS=4
(For Windows)set JULIA_NUM_THREADS=4
- To run, type:
julia FindSimilarNames.jl
-
FindSimilarNamesExtended.jl :
- Open terminal and type-
- (For MacOS)
export JULIA_NUM_THREADS=4
(For Windows)set JULIA_NUM_THREADS=4
- To run, type:
julia FindSimilarNamesExtended.jl
-
PrepareData.jl :
- Read the name of the input file and output file from the command line.
- Use the Julia ZipFile.jl library to scan the input zip file.
- Use SQlite.jl library to interface with SQLite3.
- Create the BabyNames table using SQLite.jl.
- Scan the input zip file, find files with names "yob????.txt"
- For each such file, scan the content using the CSV.jl package
- For each entry in the data file, write an entry in the table "names" recording the "year" (from the file name), "name", "sex" and "num" from the file content.
- Close the zip scanner and database connection
-
PlotData.jl :
- Parse the command line arguments to extract the input.
- Establish database connection to the database file using SQLite.jl library.
- Query the database to get the year, num pair for the provided name and sex.
- Sort the data on year.
- Plot the data using Gadfly library.
-
FindSimilarNames.jl :
- Load the data from names.db into a DataFrame.
- Determine the total number of distinct boy and girl names (using DataFrame). Let these counts be Nb (number of boy names) and Ng(number of girl names) and Ny(number of years)
- Build a bidirectional map from boy_name => boy_index, boy_index => boy_name (and the same for girl and year). These maps indicate at what position in the Fb matrix, the frequencies for a specific boy name is stored
- Initialize two matrices: Fb(Nb x Ny) and Fg(Ng x Ny). These matrices will contain the frequency of all the baby names
- Scan the DataFrame and add counts to matrices Fb and Fg. The name frequencies are now succinctly recorded
- Compute the total number of children born in each year. Represent it as the vector Ty (indexed using the year indexing)
- Compute the matrices Pb and Pg that contain the probability (ratio of the frequency of the name w.r.t. the total number of children in that year) of a given name per year. These normalized matrices take into account the differences between population sizes over time. Notice that normalization is per year (i.e you are ensuring that the sum of all values of Pb and Pg for the same year is 1)
- Further, compute matrices Qb and Qg that normalize the values across years such that the L2 norm of all row vectors is 1. This ensures that the cosine distance computation in the next step is much easier
- Compute the cosine distance (i.e the dot product) of all pairs of boy and girl names. Specifically, form index pairs from Qb and Qg and compute the dot product of the vectors Qb[i] and Qg[j]. Keep track of the larges value you encounter (maximum) and the index pair where the maximum is achieved.
- Display the names (not indexes) of the boy, girl pair with the largest cosine distance
-
FindSimilarNamesExtended.jl :
- Same as FindSimilarNames.jl but instead of computing the boy-girl pair that has the highest score, find the top-1000 such names.
names.zip file used in this project is provided above.