Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore the test data and brainstorm RTDIP component ideas #11

Closed
luccalb opened this issue Oct 23, 2024 · 2 comments
Closed

Explore the test data and brainstorm RTDIP component ideas #11

luccalb opened this issue Oct 23, 2024 · 2 comments
Assignees

Comments

@luccalb
Copy link

luccalb commented Oct 23, 2024

Explore the test data provided by shell and brainstorm ideas for RTDIP components that ensure better data quality or identify trends/anomalies

@luccalb luccalb converted this from a draft issue Oct 23, 2024
@luccalb luccalb changed the title Explore the test data and brainstorm RDTIP component ideas Explore the test data and brainstorm RTDIP component ideas Oct 23, 2024
@Timm638
Copy link

Timm638 commented Oct 29, 2024

Some Brainstorming done with scitkit-learn as inspiration:

  • Dimensionality Reduction (Reduce redundant data, e. g. which sources correlate strongly with each other?)
  • Normalization of Data (By Z-Mean, Min-Max-Scaling, ...)
  • Other Preprocessing Methods: Map scalar data into bins, One-hot encoding
  • Trend Identification: Linear Regression, ARIMA

Other notes:

  • When we implement these functions, in which format should be work with the data? Convert everything into a pandas Dataframe and then back to the original format?

@chris-1187 chris-1187 self-assigned this Oct 31, 2024
@chris-1187
Copy link

RTDIP component ideas:

Persistent Agent with datastore (Probably pushed back as Shell does not see a need for a DB right now):

  • The thought was to instantiate an InfluxDB within the pipeline creation and store monitoring and other data there. InfluxDB is a minimal timeseries DB with a python API and Grafana support (for opt. visualisations).
  • Covered through Issue Store monitoring outputs in a standardized format #26.

Missing value imputation with imputeFD or MICE through Apache SystemDS (SystemML) integration:

  • Apache SystemDS is a ML system for the end-to-end data science lifecycle, including data cleansing
  • It runs on top of Apache Spark and can be integrated through it's python bindings
  • Optimized for big data and single node operations
  • Prerequisite: Flagged missing values -> defined Pattern

Opt. expansion: General SystemDS integration and ability to run any ML algorithm through it's DML (Data Manipulation Language) script language

@luccalb luccalb closed this as completed Nov 6, 2024
@github-project-automation github-project-automation bot moved this from Awaiting Review to Feature Archive in amos2024ws01-feature-board Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Feature Archive
Development

No branches or pull requests

3 participants