Apache Hudi examples designed to be run on AWS Elastic Map Reduce (EMR) via. EMR Studio and/or EMR Notebook(s).
Reference background on key concepts. If you are new to working with Hudi it is worth reading about Hudi's timeline, file management, index, table types, query types, copy on write, merge on read.
If you are not familiar with the core Hudi concepts or are new to Hudi I highly recommend you watch AWS re:Invent 2019: Insert, upsert, and delete data in Amazon S3 using Amazon.
The samples in this repository are designed to run on EMR via. EMR Notebooks or EMR Studio. To set up your enviorment follow the AWS documentation for EMR Notebooks or EMR Studio.
You can upload the .ipynb files in this repository directly to the Jupyter enviorments provides by EMR Notebooks / Studio
The notebooks in copy_on_write is the best place to start. It covers working with data via. Hudi specific to copy on write tables. The notebook(s) covers
- Writing data to S3
- Reading data from S3
- Upserting data
- Incremental querying
- Point in Time querying
- Deleting Data
Both a Python and Scala notebooks are available.
The notebook in merge_on_read is the best next step once you understand the copy_on_write notebook(s). The merge_on_read notebook covers
- Writing data to S3
- Upserting data
- Snapshot queries
- Read optimized queries
- Compaction
Both a Python and Scala notebooks are available.
- Hudi SQL example(s)
- Hudi time travel example(s)