Skip to content

Commit

Permalink
Add labels and references for SLIs/SLOs
Browse files Browse the repository at this point in the history
  • Loading branch information
timebertt committed Feb 10, 2024
1 parent a2b4cbc commit f715ac6
Show file tree
Hide file tree
Showing 5 changed files with 60 additions and 27 deletions.
18 changes: 9 additions & 9 deletions content/20-fundamentals.md
Original file line number Diff line number Diff line change
Expand Up @@ -294,9 +294,9 @@ I.e., the aim of these tests is not to measure the scalability of Kubernetes but
At the time of writing, the Kubernetes community defines three official SLIs with corresponding SLOs that are satisfied when the load is below the recommended thresholds:
[@k8scommunity]

1. The latency of processing mutating API calls for single objects (`create`, `update`, `patch`, `delete`) for every (resource, verb) pair (excluding virtual and extended resources), measured as the 99th percentile per cluster-day, is at maximum 1 second.
2. The latency of processing non-streaming read-only API calls (`get`, `list`) for every (resource, scope) pair (excluding virtual and extended resources), measured as the 99th percentile per cluster-day, is at maximum 1 second (for requests reading a single object) or at maximum 30 seconds (for requests reading all objects from a single namespace or all objects in the cluster).
3. The latency of starting pods without persistent volumes that don't required cluster autoscaling or preemption, excluding image pulling and init containers, until observed by a watch request, measured as the 99th percentile per cluster-day, is at maximum 5 seconds.
I. \slok{mutating}The latency of processing mutating API calls for single objects (`create`, `update`, `patch`, `delete`) for every (resource, verb) pair (excluding virtual and extended resources), measured as the 99th percentile per cluster-day, is at maximum 1 second.
II. \slok{read}The latency of processing non-streaming read-only API calls (`get`, `list`) for every (resource, scope) pair (excluding virtual and extended resources), measured as the 99th percentile per cluster-day, is at maximum 1 second (for requests reading a single object) or at maximum 30 seconds (for requests reading all objects from a single namespace or all objects in the cluster).
III. \slok{startup}The latency of starting pods without persistent volumes that don't required cluster autoscaling or preemption, excluding image pulling and init containers, until observed by a watch request, measured as the 99th percentile per cluster-day, is at maximum 5 seconds.

More SLIs and SLOs are being worked on but have not been defined precisely yet and are thus not guaranteed.
These SLIs include in-cluster network programming and execution latency, in-cluster DNS programming and lookup latency, and API-related latencies of watch requests, admission plugins, and webhooks.
Expand All @@ -318,13 +318,13 @@ As a prerequisite for these performance indicators to be meaningful, the officia
Most importantly, the control plane must facilitate reasonable API request processing latency.
To consider a controller setup as performing adequately, the following SLOs need to be satisfied:

1. The time of enqueuing object keys for reconciliation for every controller, measured as the 99th percentile per cluster-day, is at maximum 1 second.
2. The latency of realizing the desired state of objects for every controller, excluding reconciliation time of controlled objects, until observed by a watch request, measured as the 99th percentile per cluster-day, is at maximum $x$, where $x$ depends on the controller.
1. \sloc{queue}The time of enqueuing object keys for reconciliation for every controller, measured as the 99th percentile per cluster-day, is at maximum 1 second.
2. \sloc{recon}The latency of realizing the desired state of objects for every controller, excluding reconciliation time of controlled objects, until observed by a watch request, measured as the 99th percentile per cluster-day, is at maximum $x$, where $x$ depends on the controller.

The queue duration (SLI 1) is comparable to the API request latency SLIs of Kubernetes.
The queue duration (SLI \refsloc*{queue}) is comparable to the API request latency SLIs of Kubernetes (SLI \refslok*{mutating}, \refslok*{read}).
It captures the system's responsiveness, where a low queue duration results in a better user experience.
If the time object keys are queued for reconciliation is too high, changes to the objects' desired state are not processed promptly, and changes to objects' observed state are not recorded promptly.
The reconciliation latency (SLI 2) is comparable to Kubernetes' pod startup latency SLI.
The reconciliation latency (SLI \refsloc*{recon}) is comparable to Kubernetes' pod startup latency SLI (SLI \refslok*{startup}).
It measures how fast the system can bring the desired state of objects to reality.
However, it strongly depends on the type of controller.
For example, a simple controller owning a small set of objects should only take 5 seconds at maximum to configure them as desired, while a controller orchestrating a large set of objects or external infrastructure might take up to 1 minute to reach the desired state.
Expand All @@ -351,8 +351,8 @@ This doesn't have a direct impact on the SLIs.
However, when consuming more memory than available, the controller might fail due to out-of-memory faults.
When the load on a controller grows by increasing the object churn rate (\refdimn{churn}), more watch events for relevant objects are transferred over the network.
The processing of the additional watch events also results in a higher CPU usage for decoding and for performing reconciliations.
If the number of worker routines is not high enough to facilitate the needed rate of reconciliations, the queue time (SLI 1) increases.
Also, if performing reconciliations is computationally intensive, the extra CPU usage might exhaust the available CPU cycles, increasing the reconciliation latency (SLI 2).
If the number of worker routines is not high enough to facilitate the needed rate of reconciliations, the queue time (SLI \refsloc*{queue}) increases.
Also, if performing reconciliations is computationally intensive, the extra CPU usage might exhaust the available CPU cycles, increasing the reconciliation latency (SLI \refsloc*{recon}).

More resources can be added to the setup to expand the load capacity of the controller setup or to fulfill the SLOs under increased load.
One option is to allocate more memory for the controller, which can increase the maximum number of objects that a controller's watch cache can store.
Expand Down
18 changes: 9 additions & 9 deletions content/60-evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,11 +97,11 @@ This is used to verify that configured SLOs are satisfied during a load test exp

For the measurements to be meaningful, the Kubernetes cluster SLOs themselves, as described in [@sec:kubernetes-scalability], must be satisfied.
I.e., it must be ensured that the cluster itself, where the controllers are running, is performing well.
While the latency of API requests (SLI 1 and 2) is relevant for the experiment setup, pod startup latency (SLI 3) is irrelevant as the load tests do not trigger pod startups.
While the latency of API requests (SLI \refslok*{mutating}, \refslok*{read}) is relevant for the experiment setup, pod startup latency (SLI \refslok*{startup}) is irrelevant as the load tests do not trigger pod startups.

```yaml
queries:
- name: latency-mutating # SLO 1
- name: latency-mutating # SLO I
type: instant
slo: 1
query: |
Expand All @@ -112,7 +112,7 @@ queries:
subresource!~"log|exec|portforward|attach|proxy"
}[$__range]
))) > 0
- name: latency-read-resource # SLO 2 - resource scope
- name: latency-read-resource # SLO II - resource scope
type: instant
slo: 1
query: |
Expand All @@ -123,7 +123,7 @@ queries:
subresource!~"log|exec|portforward|attach|proxy"
}[$__range]
))) > 0
- name: latency-read-namespace-cluster # SLO 2 - namespace and cluster scope
- name: latency-read-namespace-cluster # SLO II - namespace and cluster scope
type: instant
slo: 30
query: |
Expand Down Expand Up @@ -182,8 +182,8 @@ queries:
: Queries for measuring controller load {#lst:load-queries}

To ensure the controller setup is performing well under the generated load, the SLIs for controllers defined in [@sec:controller-scalability] are also measured.
The time that object keys are enqueued for reconciliation (SLI 1) is directly derived from the queue-related metrics exposed by controller-runtime.
For SLI 2, the experiment tool measures the time until changes to the desired state of `Websites` are reconciled and ready.
The time that object keys are enqueued for reconciliation (SLI \refsloc*{queue}) is directly derived from the queue-related metrics exposed by controller-runtime.
For SLI \refsloc*{recon}, the experiment tool measures the time until changes to the desired state of `Websites` are reconciled and ready.

The API server automatically increases an object's generation for its creation and for every specification change.
The experiment tool stores the time it triggered the change for all object generations.
Expand Down Expand Up @@ -225,11 +225,11 @@ queries:

[@Lst:controller-slo-queries] shows the queries verifying the described controller SLOs.
Similar to verifying the control plane's SLOs, the measurements are taken over the load test duration instead of per cluster-day.
Note that the measurement for SLI 2 is stricter than the definition in [@sec:controller-scalability].
Note that the measurement for SLI \refsloc*{recon} is stricter than the definition in [@sec:controller-scalability].
Initially, the reconciliation latency SLI excluded the reconciliation time of controlled objects, as they are outside the scope of the measured controller's responsibility.
However, in the load tests, the reconciliation time of controlled objects – namely, `Deployments` – is short because they do not run any replicas.
The `Deployment` controller must only observe the object and set the `Available` condition to `True`.
Hence, the measurement used for verifying SLO 2 includes the reconciliation time `Deployments` of `Websites` for simplicity.
Hence, the measurement used for verifying SLO \refsloc*{recon} includes the reconciliation time `Deployments` of `Websites` for simplicity.
Furthermore, the user's performance expectations are the same regardless of whether the controller uses sharding.
Therefore, the measurement includes the sharding assignment latency related to the sharder's webhook or the sharder's controller, respectively.

Expand Down Expand Up @@ -434,7 +434,7 @@ TODO (but optional)

- similar to scale out scenario
- horizontal autoscaling of controller according to load
- HPA on queue duration (SLI 1)
- HPA on queue duration (SLI \refsloc*{queue})
- evaluate coordination on object movements

## Discussion
Expand Down
2 changes: 1 addition & 1 deletion pandoc/defaults.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from: markdown+link_attributes+native_divs+raw_tex+tex_math_dollars+inline_code_attributes+grid_tables+fenced_code_attributes
from: markdown+link_attributes+native_divs+raw_tex+tex_math_dollars+inline_code_attributes+grid_tables+fenced_code_attributes+fancy_lists

metadata:
link-citations: true
Expand Down
43 changes: 38 additions & 5 deletions pandoc/includes/header.tex
Original file line number Diff line number Diff line change
Expand Up @@ -275,12 +275,12 @@
\newcommand{\req}[1]{%
\refstepcounter{req}%
\label{req:#1}%
Req. \arabic{req}:
Req.~\arabic{req}:
}
\makeatletter
\newcommand{\refreq}{\@ifstar\refreq@star\refreq@nostar}
\newcommand{\refreq@nostar}[1]{%
req. \ref{req:#1}%
req.~\ref{req:#1}%
}
\newcommand{\refreq@star}[1]{%
\ref{req:#1}%
Expand All @@ -292,12 +292,12 @@
\newcommand{\evt}[1]{%
\refstepcounter{evt}%
\label{evt:#1}%
Event \arabic{evt}:
Event~\arabic{evt}:
}
\makeatletter
\newcommand{\refevt}{\@ifstar\refevt@star\refevt@nostar}
\newcommand{\refevt@nostar}[1]{%
evt. \ref{evt:#1}%
evt.~\ref{evt:#1}%
}
\newcommand{\refevt@star}[1]{%
\ref{evt:#1}%
Expand All @@ -313,13 +313,46 @@
\makeatletter
\newcommand{\refdimn}{\@ifstar\refdimn@star\refdimn@nostar}
\newcommand{\refdimn@nostar}[1]{%
dimension \ref{dimn:#1}%
dimension~\ref{dimn:#1}%
}
\newcommand{\refdimn@star}[1]{%
\ref{dimn:#1}%
}
\makeatother

% Kubernetes SLOs (roman numerals)
\newcounter{slok}
\renewcommand{\theslok}{\Roman{slok}}
\newcommand{\slok}[1]{%
\refstepcounter{slok}%
\label{slok:#1}%
}
\makeatletter
\newcommand{\refslok}{\@ifstar\refslok@star\refslok@nostar}
\newcommand{\refslok@nostar}[1]{%
SLO~\ref{slok:#1}%
}
\newcommand{\refslok@star}[1]{%
\ref{slok:#1}%
}
\makeatother

% Controller SLOs (arabic numerals)
\newcounter{sloc}
\newcommand{\sloc}[1]{%
\refstepcounter{sloc}%
\label{sloc:#1}%
}
\makeatletter
\newcommand{\refsloc}{\@ifstar\refsloc@star\refsloc@nostar}
\newcommand{\refsloc@nostar}[1]{%
SLO~\ref{sloc:#1}%
}
\newcommand{\refsloc@star}[1]{%
\ref{sloc:#1}%
}
\makeatother

%%% spacing
\usepackage{setspace}
\onehalfspacing % spacing between lines
Expand Down
6 changes: 3 additions & 3 deletions results/basic/apiserver-slos.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
queries:
- name: latency-mutating # SLO 1
- name: latency-mutating # SLO I
type: instant
slo: 1
query: |
Expand All @@ -10,7 +10,7 @@ queries:
subresource!~"log|exec|portforward|attach|proxy"
}[$__range]
))) > 0
- name: latency-read-resource # SLO 2 - resource scope
- name: latency-read-resource # SLO II - resource scope
type: instant
slo: 1
query: |
Expand All @@ -21,7 +21,7 @@ queries:
subresource!~"log|exec|portforward|attach|proxy"
}[$__range]
))) > 0
- name: latency-read-namespace-cluster # SLO 2 - namespace and cluster scope
- name: latency-read-namespace-cluster # SLO II - namespace and cluster scope
type: instant
slo: 30
query: |
Expand Down

0 comments on commit f715ac6

Please sign in to comment.