Skip to content

Commit

Permalink
Add Pandas Reshaping Data tutorial
Browse files Browse the repository at this point in the history
  • Loading branch information
zstumgoren committed Feb 19, 2024
1 parent bb62b96 commit 00f7945
Show file tree
Hide file tree
Showing 2 changed files with 292 additions and 0 deletions.
9 changes: 9 additions & 0 deletions content/README.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,17 @@
"* Data Wrangling and Analysis with Pandas\n",
" * [Stacking Data](pandas_stacking.ipynb)\n",
" * [Group By](pandas_groupby.ipynb)\n",
" * [Reshaping Data](pandas_reshaping.ipynb) - Melting \"wide\" data, pivoting \"long\" data\n",
" * [Homicide Rates](pandas_basics_with_homicide_rates.ipynb) - Basic wrangling and analysis with the `pandas` library. Skills covered include creating a DataFrame from scratch, subsetting rows and columns, adding calculations to new columns and merging datasets.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ba7bfc34-35e9-43c6-a859-e88971d7b3cc",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
Expand Down
283 changes: 283 additions & 0 deletions content/pandas_reshaping.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,283 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "17e403d9-5ec0-43ab-8cf1-2b567718f548",
"metadata": {},
"source": [
"# Reshaping data\n",
"\n",
"Data comes in all shapes and sizes. Often, the way our data is structured is not conducive for various types of analysis and visualization.\n",
"\n",
"Say that we have a basic data set such as below, showing the the salary each person earned in the last few years.\n",
"\n",
"## Wide data"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ec354984-49af-49ac-bbd5-3f49c92cf3f1",
"metadata": {},
"outputs": [],
"source": [
"salaries = [\n",
" {'name': 'Sally', '2022': 70_000, '2023': 75_000},\n",
" {'name': 'Eloise', '2022': 60_000, '2023': 80_000},\n",
" {'name': 'Ayla', '2022': 80_000, '2023': 83_000},\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "75a3f7e7-6de1-42a2-9a4d-67cd4ff3b9af",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d8967afb-9a92-4b11-ad2f-ba0727b1ab16",
"metadata": {},
"outputs": [],
"source": [
"df = pd.DataFrame(salaries)\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "618a9b8b-1a32-48d0-a850-e73a23badfaa",
"metadata": {},
"source": [
"The above data is often referred to as \"wide\".\n",
"\n",
"Why do we call it wide? Well, we've kept this example simple but it's easy to imagine a real-world data set that has many more columns for prior years, for example going all the way back to 2010 or 2000. You may end up with dozens (or even hundreds of columns) depending on the data set.\n",
"\n",
"This structure of data can be very useful for certain types of calculations.\n",
"\n",
"Can you think of any calculations that you could perform on the data in its current form?\n",
"\n",
"One interesting analysis might involve calculating how big a raise each person received -- both in absolute dollars and as a percentage relative to their original salary. You could then determine who got the biggest pay bump in total dollars and as a percentage of their original salary.\n",
"\n",
"Let's do those calculations.\n",
"\n",
"We'll start by calculating the total dollar amount of each person's raise."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7fe5d3f2-f9bf-46b8-929d-f6ea34a78eae",
"metadata": {},
"outputs": [],
"source": [
"df['raise'] = df['2023'] - df['2022']\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "e937ec15-c6ea-40e6-ac27-5d65075f7d20",
"metadata": {},
"source": [
"Next, we'll figure out the percent increase in salary for each person.\n",
"\n",
"> Remember the formula for percent chnage `New-Old / Old`\n",
"\n",
"We've already calculated and stored `New - Old` in our `raise` column, so we just need to do the division step and multiply by 100."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8f32f60e-fae8-4295-9bb4-03ac2a4b0102",
"metadata": {},
"outputs": [],
"source": [
"df['pct_raise'] = df['raise'] / df['2022'] * 100\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "465bb8c7-4366-458f-bb6a-19ae41d71a2e",
"metadata": {},
"source": [
"And we could sort the data to see who received the biggest pay bump as a percentage of salary."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "98667d61-d07f-4297-9593-614759a4ee63",
"metadata": {},
"outputs": [],
"source": [
"df.sort_values('pct_raise', ascending=False)"
]
},
{
"cell_type": "markdown",
"id": "4c258295-c5f7-48ba-92e8-05b2e48c48c2",
"metadata": {},
"source": [
"## Long Data\n",
"\n",
"Wide data, as seen above, can be quite useful. But what if we wanted to figure out the total of salaries year-by-year?\n",
"\n",
"This type of calculation is a natural fit for the `pandas.DataFrame.groupby` method.\n",
"\n",
"Alas, our wide data structure is not optimal for the `groupby` operation.\n",
"\n",
"Instead, a \"long\" data set such as below would simplify things. \n",
"\n",
"Below, notice how the columns for each year in the \"wide\" data set have been transformed into values for a new `year` column.\n",
"\n",
"The table is literally longer (and less wide), hence the name \"long\" data."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f9001c43-c4d0-4fcc-8614-c7541ea4fa12",
"metadata": {},
"outputs": [],
"source": [
"from IPython.display import HTML, display\n",
"\n",
"html = [\n",
"'<table><tr><td>name</td><td>year</td><td>salary</td></tr>'\n",
"]\n",
"for row in salaries:\n",
" for year in ['2022', '2023']:\n",
" tr = f\"<tr><td>{row['name']}</td><td>{year}</td><td>{row[year]}</td></tr>\"\n",
" html.append(tr)\n",
"html.append('</table>')\n",
"display(HTML(''.join(html)))"
]
},
{
"cell_type": "markdown",
"id": "bfb8ce72-85ee-4864-8d8a-9a4541f6406f",
"metadata": {},
"source": [
"### Melting data\n",
"\n",
"To reshape data from wide to long, we can use a DataFrame's `melt` method.\n",
"\n",
"It can be a bit tricky to use, so here's an annotated snippet."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "015fc724-ef5a-40e4-b923-bd6f7d059b55",
"metadata": {},
"outputs": [],
"source": [
"long_data = df.melt(\n",
" id_vars=['name'], # Column(s) to use an identifier variable\n",
" var_name='year', # The name of the new column, or variable, we're generating\n",
" value_vars=['2022', '2023'] # The columns we'll use to populate the values for the new \"year\" column\n",
").rename({'value': 'salary'}, axis=1) # we'll rename the resulting \"value\" column for clarity \n",
"\n",
"long_data"
]
},
{
"cell_type": "markdown",
"id": "50a9b4ee-5f11-4482-b369-0a82819eed8a",
"metadata": {},
"source": [
"Ok, now we're ready to group our data by year to figure out which year had the highest salary figure.\n",
"\n",
"> Note, to improve readability, we're using parentheses to write a multi-line code statement. Python effectively treats this as one big line, without having to use line-break or escape characters."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ec7d0783-b014-4fc6-9d22-163363cc5a71",
"metadata": {},
"outputs": [],
"source": [
"(\n",
" long_data\n",
" .groupby('year')\n",
" .salary # Select the year column\n",
" .sum() # Sum the salaries for each year\n",
" .reset_index() # Restore our DataFrame\n",
" .sort_values('salary', ascending=False) # Sort in reverse order\n",
")"
]
},
{
"cell_type": "markdown",
"id": "47cd8e97-ac92-4e23-a734-f5d4f8ebc53d",
"metadata": {},
"source": [
"## Pivoting Data\n",
"\n",
"It's also, of course, possible to go from long to wide data. \n",
"\n",
"Let's say that our original data was in \"long\" format.\n",
"\n",
"In order to calculate each person's raise between 2022 and 2023, we would want our data to be in wide format, similar to how we started this exercise.\n",
"\n",
"In this case, the `pandas.DataFrame.pivot` method is our friend."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cf49c23b-2410-4507-8297-c65e40811a9f",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"long_data.pivot_table(\n",
" index='name',\n",
" columns='year',\n",
" values='salary'\n",
").reset_index()\n"
]
},
{
"cell_type": "markdown",
"id": "54b1e109-b743-40bc-9f26-fecf423f6d8b",
"metadata": {},
"source": [
"Pivots can be tricky to get right, and there are plenty of resources online that help explain the concept. \n",
"\n",
"Here's [one example](https://hausetutorials.netlify.app/posts/2020-05-17-reshape-python-pandas-dataframe-from-long-to-wide-with-pivottable/#long-to-wide-with-pivot_table) that includes some helpful visuals and a variety of related techniques such as aggregating column and row values when you pivot."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

0 comments on commit 00f7945

Please sign in to comment.