Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peformance regression in Malloy, slow "Runtime" #1983

Open
mtoy-googly-moogly opened this issue Oct 29, 2024 · 4 comments
Open

Peformance regression in Malloy, slow "Runtime" #1983

mtoy-googly-moogly opened this issue Oct 29, 2024 · 4 comments
Assignees

Comments

@mtoy-googly-moogly
Copy link
Collaborator

mtoy-googly-moogly commented Oct 29, 2024

This is from https://malloy-community.slack.com/archives/C025JAK8G0N/p1730177561599079?thread_ts=1730177000.040159&cid=C025JAK8G0N but i've copied the relevant details

This query

run: num -> {
    group_by: sub.sic4.SIC1
    aggregate: unique_tags is count(tag)
    where:  sub.form = "10-K" --sub.filed = @2018
    nest: by_cos is {
        group_by: sub.name
        limit:5
        aggregate: unique_tags is count(tag)
        --# bar_chart
        nest: by_year is {
            
            group_by: year_accepted is sub.filed.year
            aggregate: unique_tags is count(tag)
            limit: 10
        }
   }
}

Runs in 29 seconds in Malloy vscode version ending in 074, and in 200 seconds in current malloy

Here's the generated SQL ...

fast-version.txt
slow-version.txt

@mtoy-googly-moogly
Copy link
Collaborator Author

mtoy-googly-moogly commented Oct 29, 2024

It looks like the "fast" and "slow" sql run at the same speed, if we put them in run: duckdb.sql()

image

image

Both are in >200 seconds of "runtime"

Running against an older version of Malloy, both the "slow" and "fast" have 30 seconds of runtime, so the slowness is all in result processing, not in query execution.

@mtoy-googly-moogly mtoy-googly-moogly changed the title Peformance regression in Malloy generated SQL Peformance regression in Malloy, slow "Runtime" Oct 29, 2024
@whscullin
Copy link
Collaborator

This could very well be a duckdb regression, I'd need a local repro to be able to verify that (or any other hypothesis, I guess).

The "version" numbers are epoch seconds, so it'd be helpful to have more than the last 3 digits to reliably track down any specific version, the VS Code site doesn't seem to give me an easy way to track down old releases past the last few.

6 months ago we were on either DuckDB 0.9.x or 0.10.x, we're currently on 1.0.0, so a lot has happened on that front. I could do the hopeful thing and try updating to the latest DuckDB to see if it was something fixed post 1.0.0.

@mrtimo
Copy link

mrtimo commented Nov 2, 2024

Copied from slack: On the latest pre-release I’m getting 166 seconds now. An improvement. On Malloy version v0.2.1712075074 vscode I’m getting 30 seconds. On the most recent production release I’m getting 187 seconds.

@mrtimo
Copy link

mrtimo commented Nov 5, 2024

I've been testing today. Trying to create a test repo. My current hypothesis is that is had something to do with the way the .parquet files were created. It may have been one of two ways: 1) I went from .tsv to dataframe to parquet (using the pandas save as parquet function) or 2) I made the parquet files with an older version of duckdb. I'm thinking issues caused by #1 are why the SQL runs in the same time, but the presentation layer slows down when it is reading from pandas created parquet files. Will test on fresh parquet files soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants