This checkpoint is designed to test your understanding of the content from the Data Serialization Formats Cumulative Lab.
Specifically, this will cover:
- Reading serialized CSV data from a file into a Python object
- Extracting information from nested data structures
In this repository under the file path data/salaries.csv
there is a CSV data file containing salary and demographic information. When loaded into Python as a list of dictionaries, each dictionary looks something like this:
{
'Age': '39',
'Education': 'E - Bachelors',
'Occupation': 'Adm-clerical',
'Relationship': 'Not-in-family',
'Race': 'White',
'Sex': 'Male',
'Target': '<=50K'
}
Most of this information is irrelevant for the current task; the one piece that you need to focus on is the Education
key-value pair.
Your task is to create a frequency table where the various education levels (values associated with the Education
keys) are encoded as keys, and the frequencies of those education levels are encoded as values.
In the cell below, import the module used for working with CSV data in Python:
# your code here
raise NotImplementedError
# PUT ALL WORK FOR THE ABOVE QUESTION ABOVE THIS CELL
# THIS UNALTERABLE CELL CONTAINS HIDDEN TESTS
The file path is data/salaries.csv
.
Make sure you follow these steps with the specified variable names in order to pass all tests:
- Create a file object
salary_data_file
by opening the file with that path - Instantiate a
DictReader
(documentation here) using that file object - Cast the
DictReader
to alist
and assign the result tosalary_data
- Close the
salary_data_file
# Replace None with appropriate code
# Open the file
salary_data_file = None
# Instantiate a DictReader and create salary_data
salary_data = None
# Close salary_data_file
None
# your code here
raise NotImplementedError
# Visually inspecting the first few records
for record in salary_data[:5]:
print(record)
# Checking salary_data_file
assert type(salary_data_file) != None
# PUT ALL WORK FOR THE ABOVE QUESTION ABOVE THIS CELL
# THIS UNALTERABLE CELL CONTAINS HIDDEN TESTS
# Checking salary_data
assert type(salary_data) == list
# PUT ALL WORK FOR THE ABOVE QUESTION ABOVE THIS CELL
# THIS UNALTERABLE CELL CONTAINS HIDDEN TESTS
Create a list unique_education_levels
that contains all unique values associated with the Education
key in these records, in alphabetical order.
Hint: You'll need to loop over all records (dictionaries) in salary_data
and find the value associated with the Education
key for each
Hint: The .sort
list method or sorted
built-in function can be used to sort strings into alphabetical order. Note that .sort
modifies the list in place and returns None
, whereas sorted
does not modify the list in place but returns a sorted version.
# Replace None with appropriate code (adding more lines as needed)
unique_education_levels = None
# your code here
raise NotImplementedError
print("Unique Education Levels:")
print(unique_education_levels)
# Checking unique_education_levels
assert type(unique_education_levels) == list
assert len(unique_education_levels) == 6
# PUT ALL WORK FOR THE ABOVE QUESTION ABOVE THIS CELL
# THIS UNALTERABLE CELL CONTAINS HIDDEN TESTS
Create a dictionary education_level_frequencies
where the keys are the unique education levels found above, and the values are the number of times that the education level appeared in the full dataset.
For example, the key A - No HS Diploma
should have the associated value 4253
, since that education level appears 4,253 times in the dataset.
# Replace None with appropriate code (add more lines as needed)
education_level_frequencies = None
# your code here
raise NotImplementedError
# Testing out your code
print("The most common education level appears", max(education_level_frequencies.values()), "times")
print("The least common education level appears", min(education_level_frequencies.values()), "times")
# Checking education_level_frequencies
# Should be a dictionary overall
assert type(education_level_frequencies) == dict
x = list(education_level_frequencies.keys())
height = list(education_level_frequencies.values())
# Should have string keys
assert type(x[0]) == str
# Should have integer values
assert type(height[0]) == int
# This plotting code should work
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(12, 5))
ax.bar(x, height)
ax.tick_params(axis='x', labelrotation=45)
ax.set_title("Distribution of Education Levels")
ax.set_ylabel("Count");
# PUT ALL WORK FOR THE ABOVE QUESTION ABOVE THIS CELL
# THIS UNALTERABLE CELL CONTAINS HIDDEN TESTS
Based on the above graph, which education level is most common in this dataset?
Set the value of the variable most_common
to the string value of that education level. You can just type in the answer rather than finding this with code, but make sure that the test cell passes — it checks that your answer is one of the valid answers (hopefully helping you avoid a spelling mistake).
# Replace None with appropriate code
most_common = None
# your code here
raise NotImplementedError
assert type(most_common) == str
assert most_common in [
'A - No HS Diploma',
'B - HS Diploma',
'C - Some College',
'D - Associates',
'E - Bachelors',
'F - Graduate Degree'
]
# PUT ALL WORK FOR THE ABOVE QUESTION ABOVE THIS CELL
# THIS UNALTERABLE CELL CONTAINS HIDDEN TESTS