curated list of awesome tools and libraries for specific domains
- pandas
- editing data in jupyter notebooks: https://github.com/quantopian/qgrid
- python
- plotting raster https://github.com/fmaussion/salem
- raster handling http://xarray.pydata.org/en/stable/
- multi dimensional arrays http://xarray.pydata.org/en/stable/
- spatial data including joins (works with dask) http://geopandas.org
- cleaning of addresses: https://github.com/openvenues/libpostal
- postgis
- multi dimensional
- hadoop
- http://www.geomesa.org
- https://github.com/DataSystemsLab/GeoSpark
- https://github.com/harsha2010/magellan
- https://github.com/locationtech/geowave
- https://github.com/locationtech/geotrellis
- https://github.com/Esri/spatial-framework-for-hadoop and https://github.com/Esri/gis-tools-for-hadoop as well as their java api https://github.com/Esri/geometry-api-java
- https://github.com/ngageoint/mrgeo
- spark on windows
- http://www.nltk.org/book/
- https://github.com/keon/awesome-nlp
- https://github.com/JohnSnowLabs/spark-nlp
- https://github.com/databricks/spark-corenlp (check license extra carefully for commercial setup)
- pyspark with https://spacy.io
- https://explosion.ai
- https://github.com/clulab/processors
- https://github.com/google/sling
- https://github.com/facebookresearch/faiss
- https://github.com/bplank/bilstm-aux
- https://github.com/facebookresearch/fastText
- https://github.com/facebookresearch/InferSent
- https://github.com/google/sentencepiece
- https://github.com/zalandoresearch/flair
- https://github.com/Microsoft/BlingFire
- https://github.com/nmslib/nmslib
- T-SNE interactive https://js.tensorflow.org
- UMAP (faster alternative to T-SNE) https://github.com/lmcinnes/umap
- fast DBSCAN. https://github.com/scikit-learn-contrib/hdbscan
- parsing HTML
- clustering
- general operations
- monitoring https://checkmk.com/de
- certificates
- https://certbot.eff.org and https://letsencrypt.org for free and automated https/ssl certificates
- hadoop monitoring
- testing
- data quality
- packer base images
- data platform
small
- sktime and sktime-dl
- https://github.com/google/temporian
- prediction
- https://www.youtube.com/watch?v=0zpg9ODE6Ww
- https://www.youtube.com/watch?v=68ABAU_V8qI
- https://cran.r-project.org/web/packages/forecast/index.html
- https://github.com/facebook/prophet
- https://pythonawesome.com/probabilistic-time-series-modeling-in-python/
- bayesian structural https://cran.r-project.org/web/packages/bsts/index.html
- multi prediction
- https://github.com/ellisp/forecastxgb-r-package
- https://github.com/dmbee/seglearn
- https://pytorch-forecasting.readthedocs.io/en/latest/index.html
- feature extration
- https://github.com/blue-yonder/tsfresh
- https://robjhyndman.com/hyndsight/tspackages/
- https://www.featuretools.com (but also worth in general)
- e2e ML pipeline generation: https://evalml.alteryx.com, https://evalml.alteryx.com/en/stable/demos/fraud.html
- anomalies
hadoop
- handling & prediction
- https://github.com/sryza/spark-timeseries
- https://spark-summit.org/2016/events/huohua-a-distributed-time-series-analysis-framework-for-spark/
- https://github.com/twosigma/flint
- https://databricks.gitbooks.io/databricks-spark-reference-applications/content/timeseries/index.html
- correlation https://github.com/Sotera/correlation-approximation
- https://github.com/sryza/spark-timeseries
- anomaly detection
- examples
- storage
model metadata
- https://github.com/IDSIA/sacred
- http://studio.ml (also hyper opt)
- https://github.com/mitdbg/modeldb
- https://dataversioncontrol.com
- https://www.comet.ml
- https://aetros.com
- https://github.com/ModelChimp/modelchimp
- https://flyte.org/
model building
- feature engineering
- small
- http://scikit-learn.org/stable/
- production processes https://github.com/quantumblacklabs/kedro
- R
- python
- hadoop
- https://spark.apache.org/mllib/
- https://github.com/amidst/toolbox
- fast aggregation in spark https://github.com/tdunning/t-digest
- ensembling
- specific great models
- gradient boosted trees
- xgboost
- lightgbm
- catboost https://github.com/catboost/catboost
- https://github.com/fabsig/GPBoost
- gradient boosted trees
- visualization of results
model serving
- own API wrapper around original model code
- http://clipper.ai
- https://mlflow.org
- https://www.acumos.org
- https://polyaxon.com
- http://vespa.ai
- https://github.com/RedisLabsModules/redis-ml
- https://riseml.com
- https://github.com/Hydrospheredata/mist
- https://github.com/Azure/ai-toolkit-iot-edge
- https://www.dominodatalab.com and various other cloud data science work benches
- https://datmo.com
- https://aws.amazon.com/de/sagemaker/
model serialization
hyperparameter tuning
- https://github.com/kubeflow/katib
- https://sigopt.com
- https://github.com/scikit-optimize/scikit-optimize
- https://github.com/Yelp/MOE
e2e
- https://www.seldon.io
- http://pipeline.ai
- https://datmo.com
- https://docs.ray.io/en/master/serve/index.html
ml solutions
bridiging python / r and big data
- http://blog.madhukaraphatak.com/pipe-in-spark/
- sparklyR
- https://github.com/apple/turicreate out of core models on medium sized data
graph processing
- hadoop
- non hadoop
- https://neo4j.com (single master, multi slave cluster possible)
- tutorial
- telco hadoop geospatial
- https://www.youtube.com/watch?v=VtvP54Xo3Ek&feature=youtu.be
- streaming and declarative models: https://www.youtube.com/watch?v=Do7C4UJyWCM
- ml
- ml pipelines https://www.youtube.com/watch?v=cpR6Vkp7ImA
- shingles and pipelines https://www.youtube.com/watch?v=qkrh35IF2SU, https://github.com/PacktPublishing/Mastering-Spark-for-Data-Science
- gradient boosting comparision: https://www.youtube.com/watch?v=5CWwwtEM2TA
- streaming
- kafka https://www.youtube.com/watch?v=MNPI925PFD0
- spark streaming in depth https://www.youtube.com/watch?v=hyZU_bw1-ow
- python https://github.com/mrocklin/streamz
- SQL hadoop & BI https://www.youtube.com/watch?v=v40HWIlsE_w&t=0s&list=PLSAiKuajRe2kGgi-GhMVE8IXzr5Pb3b5y&index=13
- BMW self driving car & spark https://www.youtube.com/watch?v=ub2ufKrrAIs&t=0s&list=PLSAiKuajRe2kGgi-GhMVE8IXzr5Pb3b5y&index=27
- python
- https://python-graph-gallery.com for inspiration
- seaborn
- R
- ggplot2 + grest themes
- javascript
bi & dashboarding
- https://metabase.com
- https://looker.com
- python
- https://github.com/stitchfix/pyxley notebooks
- jupyter
- zeppelin
type safety
- stan
- pymc3
- https://github.com/uber/pyro
- https://www.cockroachlabs.com (spanner)
- https://www.snowflake.net/de/
- https://snowplowanalytics.com/products/snowplow-open-source/
- hbase-spark
- via phenix spark
- https://github.com/hortonworks/shc-release/tree/HDP-2.6.3.0-235-tag
- postgres on GPUs http://www.brytlyt.com
- improved cassandra scylla http://www.scylladb.com
- https://www.mapd.com/platform/
- https://clickhouse.yandex
- https://github.com/biokoda/actordb
- https://clemenswinter.com/2018/07/09/how-to-analyze-billions-of-records-per-second-on-a-single-desktop-pc/amp/
- streaming graphs https://github.com/NationalSecurityAgency/lemongraph
time series DBs
big real time analytics and data integration
- https://medium.com/@leventov/comparison-of-the-open-source-olap-systems-for-big-data-clickhouse-druid-and-pinot-8e042a5ed1c7
- https://www.quora.com/Should-I-use-Gobblin-or-Spark-Streaming-to-injest-data-from-Kafka-to-HDFS/answer/Prithiviraj-Damodaran
- typesafe configuration
- https://cir.is/docs/validation
- https://github.com/pureconfig/pureconfig
- https://github.com/actionml/universal-recommender
- https://github.com/DataSystemsLab/recdb-postgresql
- apache atlas
- cloudera navigator
- https://www.waterlinedata.com (hadoop only)
- https://alation.com (all)
- https://www.privitar.com
- data mining