-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve report performance with high-cardinality import joins #4848
base: master
Are you sure you want to change the base?
Conversation
Nope, all good. |
…thnames Ref: https://3.basecamp.com/5308029/buckets/39750953/card_tables/cards/8052057081 JOINs in ClickHouse are slow. In one degenerate case I found a user had over 20 million unique paths in an import, which resulted in extremely slow JOINs. This introduces a sort-of hacky solution to it by limiting the amount of data analyzed. Query timing without this change: ``` 9 rows in set. Elapsed: 11.383 sec. Processed 49.16 million rows, 5.75 GB (4.32 million rows/s., 505.29 MB/s.) Peak memory usage: 14.75 GiB. ``` After: ``` 9 rows in set. Elapsed: 0.572 sec. Processed 49.18 million rows, 5.75 GB (86.03 million rows/s., 10.06 GB/s.) Peak memory usage: 9.01 GiB. ```
… special_metrics.ex
This sets up selected_as aliases which will be used in a subsequent commit
dc8470b
to
c2be4e5
Compare
c2be4e5
to
2afbef2
Compare
# This function speeds up cases where grouping by a very high cardinality column by limiting the JOINed | ||
# main and imported results. | ||
# | ||
# For example when grouping by pathname and every path is dynamically generated, this can reduce | ||
# time spent JOINing billions of rows in ClickHouse by an order of magnitude. | ||
# | ||
# Mathematically, we can't limit main and imported results to LIMIT N since the true top N values | ||
# can arise from outside the top N items of either subquery. Instead we multiply the limit by a | ||
# reasonably large constant to ensure it _should_ be captured. | ||
# | ||
# This isn't always correct either as in degenerate cases but should be a reasonable trade-off that | ||
# doesn't affect user experience in any conceivable realworld cases. | ||
# | ||
# Note we only apply this optimization in cases where we can deterministically ORDER BY. This covers | ||
# opening Plausible dashboard but not more complicated use-cases | ||
defp paginate_optimization(q, query) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I tried to wrap my mind around this, I came up with the following function doc string.
# This function speeds up cases where grouping by a very high cardinality column by limiting the JOINed | |
# main and imported results. | |
# | |
# For example when grouping by pathname and every path is dynamically generated, this can reduce | |
# time spent JOINing billions of rows in ClickHouse by an order of magnitude. | |
# | |
# Mathematically, we can't limit main and imported results to LIMIT N since the true top N values | |
# can arise from outside the top N items of either subquery. Instead we multiply the limit by a | |
# reasonably large constant to ensure it _should_ be captured. | |
# | |
# This isn't always correct either as in degenerate cases but should be a reasonable trade-off that | |
# doesn't affect user experience in any conceivable realworld cases. | |
# | |
# Note we only apply this optimization in cases where we can deterministically ORDER BY. This covers | |
# opening Plausible dashboard but not more complicated use-cases | |
defp paginate_optimization(q, query) do | |
@doc """ | |
Clickhouse queries with a GROUP BY <column> can take several seconds when the column has cardinality of 1M or more. | |
Performing a JOIN on two such queries further increases the expense, into 10+ seconds. | |
It's not impossible to have high cardinality: when all site pathnames are unique, this can be reached at 1M pageviews. | |
If we're interested only in the top X results of an ordered query, where the X is small, | |
for example top 9 or top 100 in dashboard reports, | |
we can still maintain reasonable accuracy by ordering the subqueries and limiting them to the top C * X results from each. | |
C must be large enough that the top X groups in subquery A appear in subquery B, if they exist in table B at all. | |
If they exist in table B, but are cut off, we hope C is large enough that | |
the result for the last group of subquery B (at index: C * X - 1), | |
when added to any of the top X groups in subquery A, | |
makes no difference in the ordering of the top X groups. | |
This functions detects these expensive situations and applies a limit in the subqueries. | |
""" | |
defp paginate_optimization(q, query) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It occurred to me that we could verify whether C is large enough in the real world by having the optimized query run concurrently with the unoptimized one and comparing the top items OR by verifying that item at index C * X - 1 in either subquery is insignificant compared to the top X results.
test "page breakdown with paginate_optimization", %{ | ||
conn: conn, | ||
site: site | ||
} do | ||
site_import = insert(:site_import, site: site) | ||
|
||
populate_stats( | ||
site, | ||
site_import.id, | ||
[ | ||
build(:pageview, pathname: "/99", timestamp: ~N[2021-01-01 00:00:00]), | ||
build(:imported_pages, page: "/99", date: ~D[2021-01-01]) | ||
] ++ | ||
Enum.map(1..200, fn i -> | ||
build(:imported_pages, page: "/#{i}", pageviews: 1, date: ~D[2021-01-01]) | ||
end) | ||
) | ||
|
||
conn = | ||
post(conn, "/api/v2/query", %{ | ||
"site_id" => site.domain, | ||
"metrics" => ["pageviews"], | ||
"date_range" => "all", | ||
"dimensions" => ["event:page"], | ||
"include" => %{"imports" => true}, | ||
"pagination" => %{limit: 1} | ||
}) | ||
|
||
assert json_response(conn, 200)["results"] == [%{"dimensions" => ["/99"], "metrics" => [3]}] | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took the liberty of splitting the test in two to show that there is a sometimes lossy case. The greater the length of the enum in the second case, the greater the chance that it comes up 1, not 2.
test "page breakdown with paginate_optimization", %{ | |
conn: conn, | |
site: site | |
} do | |
site_import = insert(:site_import, site: site) | |
populate_stats( | |
site, | |
site_import.id, | |
[ | |
build(:pageview, pathname: "/99", timestamp: ~N[2021-01-01 00:00:00]), | |
build(:imported_pages, page: "/99", date: ~D[2021-01-01]) | |
] ++ | |
Enum.map(1..200, fn i -> | |
build(:imported_pages, page: "/#{i}", pageviews: 1, date: ~D[2021-01-01]) | |
end) | |
) | |
conn = | |
post(conn, "/api/v2/query", %{ | |
"site_id" => site.domain, | |
"metrics" => ["pageviews"], | |
"date_range" => "all", | |
"dimensions" => ["event:page"], | |
"include" => %{"imports" => true}, | |
"pagination" => %{limit: 1} | |
}) | |
assert json_response(conn, 200)["results"] == [%{"dimensions" => ["/99"], "metrics" => [3]}] | |
end | |
test "page breakdown with paginate_optimization (ideal case)", %{ | |
conn: conn, | |
site: site | |
} do | |
site_import = insert(:site_import, site: site) | |
populate_stats( | |
site, | |
site_import.id, | |
[ | |
build(:pageview, pathname: "/99", timestamp: ~N[2021-01-01 00:00:00]) | |
] ++ | |
Enum.map(1..100, fn i -> | |
build(:imported_pages, page: "/#{i}", pageviews: 1, visitors: 1, date: ~D[2021-01-01]) | |
end) | |
) | |
conn = | |
post(conn, "/api/v2/query", %{ | |
"site_id" => site.domain, | |
"metrics" => ["pageviews"], | |
"date_range" => "all", | |
"dimensions" => ["event:page"], | |
"include" => %{"imports" => true}, | |
"pagination" => %{limit: 1} | |
}) | |
assert json_response(conn, 200)["results"] == [%{"dimensions" => ["/99"], "metrics" => [2]}] | |
end | |
test "page breakdown with paginate_optimization (lossy case)", %{ | |
conn: conn, | |
site: site | |
} do | |
site_import = insert(:site_import, site: site) | |
populate_stats( | |
site, | |
site_import.id, | |
[ | |
build(:pageview, pathname: "/99", timestamp: ~N[2021-01-01 00:00:00]) | |
] ++ | |
Enum.map(1..200, fn i -> | |
build(:imported_pages, page: "/#{i}", pageviews: 1, visitors: 1, date: ~D[2021-01-01]) | |
end) | |
) | |
conn = | |
post(conn, "/api/v2/query", %{ | |
"site_id" => site.domain, | |
"metrics" => ["pageviews"], | |
"date_range" => "all", | |
"dimensions" => ["event:page"], | |
"include" => %{"imports" => true}, | |
"pagination" => %{limit: 1} | |
}) | |
assert json_response(conn, 200)["results"] == [%{"dimensions" => ["/99"], "metrics" => [1]}] or | |
json_response(conn, 200)["results"] == [ | |
%{"dimensions" => ["/99"], "metrics" => [2]} | |
] | |
end |
Ref: https://3.basecamp.com/5308029/buckets/39750953/card_tables/cards/8052057081
JOINs in ClickHouse are slow. In one degenerate case I found a user had over 20 million unique paths in an import, which resulted in extremely slow JOINs. This introduces a sort-of hacky solution to it by limiting the amount of data analyzed.
Query timing without this change:
After: