Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase parsing speed #546

Open
FlorianK13 opened this issue Jul 4, 2024 · 3 comments
Open

Increase parsing speed #546

FlorianK13 opened this issue Jul 4, 2024 · 3 comments
Assignees
Labels
🚀 feature New feature or request

Comments

@FlorianK13
Copy link
Member

This task contains several steps:

  1. Search different ways that might increase parsing speed. Parsing is done right now by the pandas.read_xml method here. Several alternatives are:
  • polars, duckdb, pyspark might have xml parsers and might be faster
  • use plain xml parsing from python (without pandas)
  • ...
  1. Writing to sqlite database right now is done by pandas.to_sql here. There might be other faster methods depending on step 1.
  2. Construct a benchmark in an own repository. Use a benchmark xml file from the Marktstammdatenregister and test different implementations for parsing them.
  3. Decide for a best method and implement it in open-mastr
@AlexandraImbrisca
Copy link

AlexandraImbrisca commented Sep 29, 2024

Hi! I started working on this task and decided to change the steps slightly:

  1. Construct the benchmark
    • Use the Marktstammdatenregister to construct a few datasets of various size - ✅ (link)
    • Create a script to automate the calculation and comparison of the parsing speed between various optimisations - ✅ (link)
  2. Explore faster methods of parsing the XML
    • Research the options and implement the changes
    • Run the benchmark and analyse the results
  3. Explore faster methods of writing to the sqlite database
    • Research the options and implement the changes
    • Run the benchmark and analyse the results
  4. Decide on the best method and add it to this repository

@FlorianK13
Copy link
Member Author

I was at DACH Energy Informatics Conference and took two points from there:

  • Many researchers use open-mastr
  • The feature request I heard most often was the question, if we can decrease the time it needs to download and parse the data
    I think people will be really happy if this issue is successful 😃

@nesnoj
Copy link
Collaborator

nesnoj commented Oct 15, 2024

I was at DACH Energy Informatics Conference and took two points from there:

  • Many researchers use open-mastr
  • The feature request I heard most often was the question, if we can decrease the time it needs to download and parse the data
    I think people will be really happy if this issue is successful 😃

Sounds great. I think we cannot do much about the dl speed but I'm really looking forward to the parsing enhancement @AlexandraImbrisca 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🚀 feature New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants