Make working with pandas data and AWS DynamoDB easy.
This package aims a making the transfer of data between pandas dataframes and DynamoDB as simple as possible. To meet this goal, the package offers two key features:
- Automatic conversion of pandas data types to DynamoDB supported data types.
- A simple, high level interface to put data from a dataframe into a DynamoDB table and get all or selected items from a table into a dataframe.
The project's documentation is available at https://dynamo-pandas.readthedocs.io/.
python>=3.8
pandas>=1.2
boto3
python -m pip install dynamo-pandas
This will install the package and its dependencies except for boto3
which is not installed by default to avoid unnecessary installation when building Lambda layers.
To include boto3
as part of the installation, add the boto3
"extra" this way:
python -m pip install dynamo-pandas[boto3]
Consider the pandas DataFrame below.
>>> print(players_df)
player_id last_play play_time rating bonus_points
0 player_one 2021-01-18 22:47:23 2 days 17:41:55 4.3 3
1 player_two 2021-01-19 19:07:54 0 days 22:07:34 3.8 1
2 player_three 2021-01-21 10:22:43 1 days 14:01:19 2.5 4
3 player_four 2021-01-22 13:51:12 0 days 03:45:49 4.8 <NA>
The columns of the dataframe use different data types, some of which are not natively supported by DynamoDB, like numpy.datetime64, timedelta64 and pandas' nullable integers.
>>> players_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 player_id 4 non-null object
1 last_play 4 non-null datetime64[ns]
2 play_time 4 non-null timedelta64[ns]
3 rating 4 non-null float64
4 bonus_points 3 non-null Int8
dtypes: Int8(1), datetime64[ns](1), float64(1), object(1), timedelta64[ns](1)
memory usage: 264.0+ bytes
Storing the rows of this dataframe to DynamoDB requires multiple data type conversions.
>>> from dynamo_pandas import put_df, get_df, keys
The put_df
function adds or updates the rows of a dataframe into the specified table, taking care of the required type conversions (the table must be already created and the primary key column(s) be present in the dataframe).
>>> put_df(players_df, table="players")
The get_df
function retrieves the items matching the speficied key(s) from the table into a dataframe.
>>> df = get_df(table="players", keys=[{"player_id": "player_three"}, {"player_id": "player_one"}])
>>> print(df)
bonus_points player_id last_play rating play_time
0 4 player_three 2021-01-21 10:22:43 2.5 1 days 14:01:19
1 3 player_one 2021-01-18 22:47:23 4.3 2 days 17:41:55
In the case where only a partition key is used, the keys
function simplifies the generation of the keys list.
>>> df = get_df(table="players", keys=keys(player_id=["player_two", "player_four"]))
>>> print(df)
bonus_points player_id last_play rating play_time
0 1.0 player_two 2021-01-19 19:07:54 3.8 0 days 22:07:34
1 NaN player_four 2021-01-22 13:51:12 4.8 0 days 03:45:49
The data types returned by the get_df
function are basic types and no automatic type conversion is attempted.
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 bonus_points 1 non-null float64
1 player_id 2 non-null object
2 last_play 2 non-null object
3 rating 2 non-null float64
4 play_time 2 non-null object
dtypes: float64(2), object(3)
memory usage: 208.0+ bytes
The dtype
parameter of the get_df
function allows specifying the desired data types.
>>> df = get_df(
... table="players",
... keys=keys(player_id=["player_two", "player_four"]),
... dtype={
... "bonus_points": "Int8",
... "last_play": "datetime64[ns, UTC]",
... "play_time": "timedelta64[ns]" # See note below.
... }
... )
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 bonus_points 1 non-null Int8
1 player_id 2 non-null object
2 last_play 2 non-null datetime64[ns, UTC]
3 rating 2 non-null float64
4 play_time 2 non-null timedelta64[ns]
dtypes: Int8(1), datetime64[ns, UTC](1), float64(1), object(1), timedelta64[ns](1)
memory usage: 196.0+ bytes
Note: Due to a known bug in pandas versions < 1.5, timedelta strings cannot be converted back to Timedelta type via this parameter (ref. pandas-dev/pandas#38509). If using pandas < 1.5, use the pandas.to_timedelta function instead:
>>> df.play_time = pd.to_timedelta(df.play_time)
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 bonus_points 1 non-null Int8
1 player_id 2 non-null object
2 last_play 2 non-null datetime64[ns, UTC]
3 rating 2 non-null float64
4 play_time 2 non-null timedelta64[ns]
dtypes: Int8(1), datetime64[ns, UTC](1), float64(1), object(1), timedelta64[ns](1)
memory usage: 196.0+ bytes
Omitting the keys
parameter performs a scan of the table and returns all the items.
>>> df = get_df(table="players")
>>> print(df)
bonus_points player_id last_play rating play_time
0 4.0 player_three 2021-01-21 10:22:43 2.5 1 days 14:01:19
1 NaN player_four 2021-01-22 13:51:12 4.8 0 days 03:45:49
2 3.0 player_one 2021-01-18 22:47:23 4.3 2 days 17:41:55
3 1.0 player_two 2021-01-19 19:07:54 3.8 0 days 22:07:34
Released under the terms of the MIT License.