Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve report performance with high-cardinality import joins #4848

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

macobo
Copy link
Contributor

@macobo macobo commented Nov 21, 2024

Ref: https://3.basecamp.com/5308029/buckets/39750953/card_tables/cards/8052057081

JOINs in ClickHouse are slow. In one degenerate case I found a user had over 20 million unique paths in an import, which resulted in extremely slow JOINs. This introduces a sort-of hacky solution to it by limiting the amount of data analyzed.

Query timing without this change:

9 rows in set. Elapsed: 11.383 sec. Processed 49.16 million rows, 5.75 GB (4.32 million rows/s., 505.29 MB/s.)
Peak memory usage: 14.75 GiB.

After:

9 rows in set. Elapsed: 0.572 sec. Processed 49.18 million rows, 5.75 GB (86.03 million rows/s., 10.06 GB/s.)
Peak memory usage: 9.01 GiB.

@macobo
Copy link
Contributor Author

macobo commented Nov 25, 2024

Marking as a draft for a bit. Need to figure out if there's cases where the removed splitting-removes-pagination logic would break things that tests aren't covering.

Nope, all good.

@macobo macobo marked this pull request as draft November 25, 2024 10:02
@macobo macobo marked this pull request as ready for review November 25, 2024 10:05
@macobo macobo marked this pull request as draft November 25, 2024 11:37
…thnames

Ref: https://3.basecamp.com/5308029/buckets/39750953/card_tables/cards/8052057081

JOINs in ClickHouse are slow. In one degenerate case I found a user had
over 20 million unique paths in an import, which resulted in extremely slow
JOINs. This introduces a sort-of hacky solution to it by limiting the
amount of data analyzed.

Query timing without this change:
```
9 rows in set. Elapsed: 11.383 sec. Processed 49.16 million rows, 5.75 GB (4.32 million rows/s., 505.29 MB/s.)
Peak memory usage: 14.75 GiB.
```

After:
```
9 rows in set. Elapsed: 0.572 sec. Processed 49.18 million rows, 5.75 GB (86.03 million rows/s., 10.06 GB/s.)
Peak memory usage: 9.01 GiB.
```
This sets up selected_as aliases which will be used in a subsequent commit
@macobo macobo force-pushed the report-performance-joins branch 3 times, most recently from dc8470b to c2be4e5 Compare November 26, 2024 10:26
@macobo macobo marked this pull request as ready for review November 26, 2024 10:36
@macobo macobo changed the title Improve report performance in cases where site has a lot of unique pathnames Improve report performance with high-cardinality import joins Nov 26, 2024
Comment on lines +321 to +336
# This function speeds up cases where grouping by a very high cardinality column by limiting the JOINed
# main and imported results.
#
# For example when grouping by pathname and every path is dynamically generated, this can reduce
# time spent JOINing billions of rows in ClickHouse by an order of magnitude.
#
# Mathematically, we can't limit main and imported results to LIMIT N since the true top N values
# can arise from outside the top N items of either subquery. Instead we multiply the limit by a
# reasonably large constant to ensure it _should_ be captured.
#
# This isn't always correct either as in degenerate cases but should be a reasonable trade-off that
# doesn't affect user experience in any conceivable realworld cases.
#
# Note we only apply this optimization in cases where we can deterministically ORDER BY. This covers
# opening Plausible dashboard but not more complicated use-cases
defp paginate_optimization(q, query) do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I tried to wrap my mind around this, I came up with the following function doc string.

Suggested change
# This function speeds up cases where grouping by a very high cardinality column by limiting the JOINed
# main and imported results.
#
# For example when grouping by pathname and every path is dynamically generated, this can reduce
# time spent JOINing billions of rows in ClickHouse by an order of magnitude.
#
# Mathematically, we can't limit main and imported results to LIMIT N since the true top N values
# can arise from outside the top N items of either subquery. Instead we multiply the limit by a
# reasonably large constant to ensure it _should_ be captured.
#
# This isn't always correct either as in degenerate cases but should be a reasonable trade-off that
# doesn't affect user experience in any conceivable realworld cases.
#
# Note we only apply this optimization in cases where we can deterministically ORDER BY. This covers
# opening Plausible dashboard but not more complicated use-cases
defp paginate_optimization(q, query) do
@doc """
Clickhouse queries with a GROUP BY <column> can take several seconds when the column has cardinality of 1M or more.
Performing a JOIN on two such queries further increases the expense, into 10+ seconds.
It's not impossible to have high cardinality: when all site pathnames are unique, this can be reached at 1M pageviews.
If we're interested only in the top X results of an ordered query, where the X is small,
for example top 9 or top 100 in dashboard reports,
we can still maintain reasonable accuracy by ordering the subqueries and limiting them to the top C * X results from each.
C must be large enough that the top X groups in subquery A appear in subquery B, if they exist in table B at all.
If they exist in table B, but are cut off, we hope C is large enough that
the result for the last group of subquery B (at index: C * X - 1),
when added to any of the top X groups in subquery A,
makes no difference in the ordering of the top X groups.
This functions detects these expensive situations and applies a limit in the subqueries.
"""
defp paginate_optimization(q, query) do

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurred to me that we could verify whether C is large enough in the real world by having the optimized query run concurrently with the unoptimized one and comparing the top items OR by verifying that item at index C * X - 1 in either subquery is insignificant compared to the top X results.

Comment on lines +932 to +961
test "page breakdown with paginate_optimization", %{
conn: conn,
site: site
} do
site_import = insert(:site_import, site: site)

populate_stats(
site,
site_import.id,
[
build(:pageview, pathname: "/99", timestamp: ~N[2021-01-01 00:00:00]),
build(:imported_pages, page: "/99", date: ~D[2021-01-01])
] ++
Enum.map(1..200, fn i ->
build(:imported_pages, page: "/#{i}", pageviews: 1, date: ~D[2021-01-01])
end)
)

conn =
post(conn, "/api/v2/query", %{
"site_id" => site.domain,
"metrics" => ["pageviews"],
"date_range" => "all",
"dimensions" => ["event:page"],
"include" => %{"imports" => true},
"pagination" => %{limit: 1}
})

assert json_response(conn, 200)["results"] == [%{"dimensions" => ["/99"], "metrics" => [3]}]
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took the liberty of splitting the test in two to show that there is a sometimes lossy case. The greater the length of the enum in the second case, the greater the chance that it comes up 1, not 2.

Suggested change
test "page breakdown with paginate_optimization", %{
conn: conn,
site: site
} do
site_import = insert(:site_import, site: site)
populate_stats(
site,
site_import.id,
[
build(:pageview, pathname: "/99", timestamp: ~N[2021-01-01 00:00:00]),
build(:imported_pages, page: "/99", date: ~D[2021-01-01])
] ++
Enum.map(1..200, fn i ->
build(:imported_pages, page: "/#{i}", pageviews: 1, date: ~D[2021-01-01])
end)
)
conn =
post(conn, "/api/v2/query", %{
"site_id" => site.domain,
"metrics" => ["pageviews"],
"date_range" => "all",
"dimensions" => ["event:page"],
"include" => %{"imports" => true},
"pagination" => %{limit: 1}
})
assert json_response(conn, 200)["results"] == [%{"dimensions" => ["/99"], "metrics" => [3]}]
end
test "page breakdown with paginate_optimization (ideal case)", %{
conn: conn,
site: site
} do
site_import = insert(:site_import, site: site)
populate_stats(
site,
site_import.id,
[
build(:pageview, pathname: "/99", timestamp: ~N[2021-01-01 00:00:00])
] ++
Enum.map(1..100, fn i ->
build(:imported_pages, page: "/#{i}", pageviews: 1, visitors: 1, date: ~D[2021-01-01])
end)
)
conn =
post(conn, "/api/v2/query", %{
"site_id" => site.domain,
"metrics" => ["pageviews"],
"date_range" => "all",
"dimensions" => ["event:page"],
"include" => %{"imports" => true},
"pagination" => %{limit: 1}
})
assert json_response(conn, 200)["results"] == [%{"dimensions" => ["/99"], "metrics" => [2]}]
end
test "page breakdown with paginate_optimization (lossy case)", %{
conn: conn,
site: site
} do
site_import = insert(:site_import, site: site)
populate_stats(
site,
site_import.id,
[
build(:pageview, pathname: "/99", timestamp: ~N[2021-01-01 00:00:00])
] ++
Enum.map(1..200, fn i ->
build(:imported_pages, page: "/#{i}", pageviews: 1, visitors: 1, date: ~D[2021-01-01])
end)
)
conn =
post(conn, "/api/v2/query", %{
"site_id" => site.domain,
"metrics" => ["pageviews"],
"date_range" => "all",
"dimensions" => ["event:page"],
"include" => %{"imports" => true},
"pagination" => %{limit: 1}
})
assert json_response(conn, 200)["results"] == [%{"dimensions" => ["/99"], "metrics" => [1]}] or
json_response(conn, 200)["results"] == [
%{"dimensions" => ["/99"], "metrics" => [2]}
]
end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants