- Check out the Documentation
- Check out the main contributor DirkSCGM
This POC is designed to showcase a modern, polylithic data lake built using AWS technologies such as Glue, EMR, Step Functions, and Lambdas, as well as tools like Docker, Python, PySpark, Apache Hudi, and Terraform. Our data lake provides a single source of truth for data and allows for easy integration and analysis of data from various sources.
The repository is structured to reflect the software development life cycle, with sections for extract, transform, and load pipelines; configuration; infrastructure; and testing. We also provide detailed instructions for optimizing JDBC ETL pipelines and troubleshooting common issues.
Our data lake design offers many benefits, including improved data accessibility and flexibility, the ability to easily integrate data from various sources, and the ability to store and analyze large amounts of data at scale. This can help organizations gain a better understanding of their data and make more informed decisions.
In addition to the technical aspects of our data lake, this repository is also a place for collaboration, learning, and growth. We believe in the benefits of open source technologies and are always looking to improve our skills as data engineers.
As part of our commitment to learning and growth, we welcome contributions from the community. Whether you're a seasoned data engineer or just getting started, we encourage you to take a look at our codebase and offer suggestions for improvement.
We also believe in the power of collaboration and are always looking for ways to work with others in the field. If you're interested in partnering with us or contributing to this repository, we'd love to hear from you!
Overall, this repository is a place for us to share our knowledge and expertise, as well as learn from others in the community. We're excited to see what we can accomplish together!
Thanks for checking out our repository! 🙌