Skip to content

Latest commit

 

History

History
34 lines (17 loc) · 1.7 KB

README.md

File metadata and controls

34 lines (17 loc) · 1.7 KB

Useful links

Calculating Resource Needs:

The Links below all provide similar 'recipes' for determining the amount of resources your Spark Job will need, or how to set parameters to maximize 'bang for the buck'.

https://stackoverflow.com/questions/37871194/how-to-tune-spark-executor-number-cores-and-executor-memory

https://blog.cloudera.com/how-to-tune-your-apache-spark-jobs-part-2/

https://aws.amazon.com/blogs/big-data/best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr/

Tutorials and Real-World Examples:

Calculating the Spread of a security at the time of a transaction: https://databricks.com/blog/2019/10/09/democratizing-financial-time-series-analysis-with-databricks.html

Setting up a Stream of Tweets: https://www.linkedin.com/pulse/apache-spark-streaming-twitter-python-laurent-weichberger/

Analyzing hashtags on a Stream of Tweets: https://towardsdatascience.com/hands-on-big-data-streaming-apache-spark-at-scale-fd89c15fa6b0

Real-Time ingestion and ETL of unstructured Logs into a Data Warehouse: https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/8599738367597028/2070341989008551/3601578643761083/latest.html

Hooking Spark up on an AWS Kinesis Stream: https://aws.amazon.com/blogs/big-data/querying-amazon-kinesis-streams-directly-with-sql-and-spark-streaming/

**NLP on Spark to summarize Strategic Reports: ** https://databricks.com/notebooks/esg_notebooks/01_esg_report.html

RDD API Cheat Sheet:

https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PySpark_Cheat_Sheet_Python.pdf

SparkSQL API Cheat sheet:

https://intellipaat.com/mediaFiles/2019/03/PySpark-SQL-cheat-sheet.jpg