Improve report performance with high-cardinality import joins #4848

macobo · 2024-11-21T21:08:30Z

Ref: https://3.basecamp.com/5308029/buckets/39750953/card_tables/cards/8052057081

JOINs in ClickHouse are slow. In one degenerate case I found a user had over 20 million unique paths in an import, which resulted in extremely slow JOINs. This introduces a sort-of hacky solution to it by limiting the amount of data analyzed.

Query timing without this change:

9 rows in set. Elapsed: 11.383 sec. Processed 49.16 million rows, 5.75 GB (4.32 million rows/s., 505.29 MB/s.)
Peak memory usage: 14.75 GiB.

After:

9 rows in set. Elapsed: 0.572 sec. Processed 49.18 million rows, 5.75 GB (86.03 million rows/s., 10.06 GB/s.)
Peak memory usage: 9.01 GiB.

macobo · 2024-11-25T10:02:36Z

~~Marking as a draft for a bit. Need to figure out if there's cases where the removed splitting-removes-pagination logic would break things that tests aren't covering.~~

Nope, all good.

lib/plausible/stats/imported/imported.ex

…thnames Ref: https://3.basecamp.com/5308029/buckets/39750953/card_tables/cards/8052057081 JOINs in ClickHouse are slow. In one degenerate case I found a user had over 20 million unique paths in an import, which resulted in extremely slow JOINs. This introduces a sort-of hacky solution to it by limiting the amount of data analyzed. Query timing without this change: ``` 9 rows in set. Elapsed: 11.383 sec. Processed 49.16 million rows, 5.75 GB (4.32 million rows/s., 505.29 MB/s.) Peak memory usage: 14.75 GiB. ``` After: ``` 9 rows in set. Elapsed: 0.572 sec. Processed 49.18 million rows, 5.75 GB (86.03 million rows/s., 10.06 GB/s.) Peak memory usage: 9.01 GiB. ```

… special_metrics.ex

This sets up selected_as aliases which will be used in a subsequent commit

apata · 2024-11-27T14:57:53Z

lib/plausible/stats/imported/imported.ex

+  # This function speeds up cases where grouping by a very high cardinality column by limiting the JOINed
+  # main and imported results.
+  #
+  # For example when grouping by pathname and every path is dynamically generated, this can reduce
+  # time spent JOINing billions of rows in ClickHouse by an order of magnitude.
+  #
+  # Mathematically, we can't limit main and imported results to LIMIT N since the true top N values
+  # can arise from outside the top N items of either subquery. Instead we multiply the limit by a
+  # reasonably large constant to ensure it _should_ be captured.
+  #
+  # This isn't always correct either as in degenerate cases but should be a reasonable trade-off that
+  # doesn't affect user experience in any conceivable realworld cases.
+  #
+  # Note we only apply this optimization in cases where we can deterministically ORDER BY. This covers
+  # opening Plausible dashboard but not more complicated use-cases
+  defp paginate_optimization(q, query) do


As I tried to wrap my mind around this, I came up with the following function doc string.

Suggested change

# This function speeds up cases where grouping by a very high cardinality column by limiting the JOINed

# main and imported results.

#

# For example when grouping by pathname and every path is dynamically generated, this can reduce

# time spent JOINing billions of rows in ClickHouse by an order of magnitude.

#

# Mathematically, we can't limit main and imported results to LIMIT N since the true top N values

# can arise from outside the top N items of either subquery. Instead we multiply the limit by a

# reasonably large constant to ensure it _should_ be captured.

#

# This isn't always correct either as in degenerate cases but should be a reasonable trade-off that

# doesn't affect user experience in any conceivable realworld cases.

#

# Note we only apply this optimization in cases where we can deterministically ORDER BY. This covers

# opening Plausible dashboard but not more complicated use-cases

defp paginate_optimization(q, query) do

@doc """

Clickhouse queries with a GROUP BY <column> can take several seconds when the column has cardinality of 1M or more.

Performing a JOIN on two such queries further increases the expense, into 10+ seconds.

It's not impossible to have high cardinality: when all site pathnames are unique, this can be reached at 1M pageviews.

If we're interested only in the top X results of an ordered query, where the X is small,

for example top 9 or top 100 in dashboard reports,

we can still maintain reasonable accuracy by ordering the subqueries and limiting them to the top C * X results from each.

C must be large enough that the top X groups in subquery A appear in subquery B, if they exist in table B at all.

If they exist in table B, but are cut off, we hope C is large enough that

the result for the last group of subquery B (at index: C * X - 1),

when added to any of the top X groups in subquery A,

makes no difference in the ordering of the top X groups.

This functions detects these expensive situations and applies a limit in the subqueries.

"""

defp paginate_optimization(q, query) do

It occurred to me that we could verify whether C is large enough in the real world by having the optimized query run concurrently with the unoptimized one and comparing the top items OR by verifying that item at index C * X - 1 in either subquery is insignificant compared to the top X results.

apata · 2024-11-27T15:32:51Z

test/plausible_web/controllers/api/external_stats_controller/query_imported_test.exs

+    test "page breakdown with paginate_optimization", %{
+      conn: conn,
+      site: site
+    } do
+      site_import = insert(:site_import, site: site)
+
+      populate_stats(
+        site,
+        site_import.id,
+        [
+          build(:pageview, pathname: "/99", timestamp: ~N[2021-01-01 00:00:00]),
+          build(:imported_pages, page: "/99", date: ~D[2021-01-01])
+        ] ++
+          Enum.map(1..200, fn i ->
+            build(:imported_pages, page: "/#{i}", pageviews: 1, date: ~D[2021-01-01])
+          end)
+      )
+
+      conn =
+        post(conn, "/api/v2/query", %{
+          "site_id" => site.domain,
+          "metrics" => ["pageviews"],
+          "date_range" => "all",
+          "dimensions" => ["event:page"],
+          "include" => %{"imports" => true},
+          "pagination" => %{limit: 1}
+        })
+
+      assert json_response(conn, 200)["results"] == [%{"dimensions" => ["/99"], "metrics" => [3]}]
+    end


I took the liberty of splitting the test in two to show that there is a sometimes lossy case. The greater the length of the enum in the second case, the greater the chance that it comes up 1, not 2.

Suggested change

test "page breakdown with paginate_optimization", %{

conn: conn,

site: site

} do

site_import = insert(:site_import, site: site)

populate_stats(

site,

site_import.id,

[

build(:pageview, pathname: "/99", timestamp: ~N[2021-01-01 00:00:00]),

build(:imported_pages, page: "/99", date: ~D[2021-01-01])

] ++

Enum.map(1..200, fn i ->

build(:imported_pages, page: "/#{i}", pageviews: 1, date: ~D[2021-01-01])

end)

)

conn =

post(conn, "/api/v2/query", %{

"site_id" => site.domain,

"metrics" => ["pageviews"],

"date_range" => "all",

"dimensions" => ["event:page"],

"include" => %{"imports" => true},

"pagination" => %{limit: 1}

})

assert json_response(conn, 200)["results"] == [%{"dimensions" => ["/99"], "metrics" => [3]}]

end

test "page breakdown with paginate_optimization (ideal case)", %{

conn: conn,

site: site

} do

site_import = insert(:site_import, site: site)

populate_stats(

site,

site_import.id,

[

build(:pageview, pathname: "/99", timestamp: ~N[2021-01-01 00:00:00])

] ++

Enum.map(1..100, fn i ->

build(:imported_pages, page: "/#{i}", pageviews: 1, visitors: 1, date: ~D[2021-01-01])

end)

)

conn =

post(conn, "/api/v2/query", %{

"site_id" => site.domain,

"metrics" => ["pageviews"],

"date_range" => "all",

"dimensions" => ["event:page"],

"include" => %{"imports" => true},

"pagination" => %{limit: 1}

})

assert json_response(conn, 200)["results"] == [%{"dimensions" => ["/99"], "metrics" => [2]}]

end

test "page breakdown with paginate_optimization (lossy case)", %{

conn: conn,

site: site

} do

site_import = insert(:site_import, site: site)

populate_stats(

site,

site_import.id,

[

build(:pageview, pathname: "/99", timestamp: ~N[2021-01-01 00:00:00])

] ++

Enum.map(1..200, fn i ->

build(:imported_pages, page: "/#{i}", pageviews: 1, visitors: 1, date: ~D[2021-01-01])

end)

)

conn =

post(conn, "/api/v2/query", %{

"site_id" => site.domain,

"metrics" => ["pageviews"],

"date_range" => "all",

"dimensions" => ["event:page"],

"include" => %{"imports" => true},

"pagination" => %{limit: 1}

})

assert json_response(conn, 200)["results"] == [%{"dimensions" => ["/99"], "metrics" => [1]}] or

json_response(conn, 200)["results"] == [

%{"dimensions" => ["/99"], "metrics" => [2]}

]

end

macobo requested review from apata and RobertJoonas November 21, 2024 21:08

macobo marked this pull request as draft November 25, 2024 10:02

macobo marked this pull request as ready for review November 25, 2024 10:05

macobo commented Nov 25, 2024

View reviewed changes

lib/plausible/stats/imported/imported.ex Outdated Show resolved Hide resolved

macobo marked this pull request as draft November 25, 2024 11:37

macobo added 3 commits November 26, 2024 12:04

Splitting should no longer remove pagination. Handle special cases in…

71ef817

… special_metrics.ex

select_merge_as in imports

b6fb241

This sets up selected_as aliases which will be used in a subsequent commit

macobo force-pushed the report-performance-joins branch 3 times, most recently from dc8470b to c2be4e5 Compare November 26, 2024 10:26

Add explicit ORDER BY to import

2afbef2

macobo force-pushed the report-performance-joins branch from c2be4e5 to 2afbef2 Compare November 26, 2024 10:32

macobo marked this pull request as ready for review November 26, 2024 10:36

macobo changed the title ~~Improve report performance in cases where site has a lot of unique pathnames~~ Improve report performance with high-cardinality import joins Nov 26, 2024

apata reviewed Nov 27, 2024

View reviewed changes

apata approved these changes Nov 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve report performance with high-cardinality import joins #4848

Improve report performance with high-cardinality import joins #4848

macobo commented Nov 21, 2024

macobo commented Nov 25, 2024 •

edited

Loading

apata Nov 27, 2024

apata Nov 27, 2024

apata Nov 27, 2024

Improve report performance with high-cardinality import joins #4848

Are you sure you want to change the base?

Improve report performance with high-cardinality import joins #4848

Conversation

macobo commented Nov 21, 2024

macobo commented Nov 25, 2024 • edited Loading

apata Nov 27, 2024

Choose a reason for hiding this comment

apata Nov 27, 2024

Choose a reason for hiding this comment

apata Nov 27, 2024

Choose a reason for hiding this comment

macobo commented Nov 25, 2024 •

edited

Loading