Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add promql-to-scrape #74

Merged
merged 5 commits into from
Nov 17, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions cloud/observability/promql-to-dd-go/prometheus/http.go
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,9 @@ func (c *HttpClient) Do(ctx context.Context, req *http.Request) (*http.Response,
if ctx != nil {
req = req.WithContext(ctx)
}

req.Header.Set("User-Agent", "promql-to-dd")

resp, err := c.Client.Do(req)
defer func() {
if resp != nil {
Expand Down
11 changes: 11 additions & 0 deletions cloud/observability/promql-to-scrape/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
FROM golang:1.21-alpine

WORKDIR /usr/src/app

COPY go.mod go.sum ./
RUN go mod download && go mod verify

COPY . .
RUN go build -v -o /usr/local/bin/promql-to-scrape ./cmd/promql-to-scrape/main.go

ENTRYPOINT ["/usr/local/bin/promql-to-scrape"]
47 changes: 47 additions & 0 deletions cloud/observability/promql-to-scrape/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# promql-to-scrape

This basic application is meant to provide an example for how one could use the Temporal Cloud Observability endpoint to expose a typical Prometheus `/metrics` endpoint.

**This example is provided as-is, without support. It is intended as reference material only.**

## How to Use

Grab your client cert and key and place them at `client.crt`, `tls.key`, and your Temporal Cloud account number that has the observability endpoint enabled.

```
go mod tidy
go build -o promql-to-scrape cmd/promql-to-scrape/main.go
./promql-to-scrape -client-cert client.crt -client-
key tls.key -prom-endpoint https://<account>.tmprl.cloud/prometheus --config-file examples/config.yaml --debug
~~~
time=2023-11-16T17:43:20.260-06:00 level=DEBUG msg="successful metric retrieval" time=3.529039083s
```

This means you can now hit http://localhost:9001/metrics on your machine and see your metrics.

### Important Usability Information

**Important:** When you go to scrape this, you should do so with a **60s** scrape interval, unless you are meaningfully modifying this code. The example queries all assume a 1 minute rate and you'll want these to be equal.

**Very Important:** The data you will see here is approximately 1 minute delayed (should you conform to the guidance above). Due to the aggregation that happens before metrics are presented to you, it's necessary for us to send the queries from this application to look 60 seconds in the past. Otherwise data aggregation would not be complete, and there would be no results for each query.

## Deployment

Some example Kubernetes manifests are provided in the `/examples` directory. Filling in your certificates and account should get you going pretty quickly.

## Generating Config

There is a second binary you can build that can help you build a default configuration of queries to scrape and export.

```
go build -o genconfig cmd/genconfig/main.go
./genconfig -client-cert client.crt -client-key tls.key -prom-endpoint https://<account>.tmprl.cloud/prometheus
...
```

This will generate an example config at `config.yaml` that you may use. It looks for all the existing metrics and generates a reasonable query for you to export.
- For counters, a `rate(counter[1m])`
- For gauges, it simply queries for `gauge`
- For histograms, it does a p99 aggregated by `temporal_namespace` and `operation`. `histogram_quantile(0.99, sum(rate(metric[1m])) by (le, operation, temporal_namespace)`

Modify at your own risk. You may find you'd like to add a global latency across all namespaces for instance. You can add those queries to your config file.
84 changes: 84 additions & 0 deletions cloud/observability/promql-to-scrape/cmd/genconfig/main.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
package main

import (
"flag"
"fmt"
"log"
"os"
"sort"

"github.com/temporalio/samples-server/cloud/observability/promql-to-scrape/internal"

"gopkg.in/yaml.v3"
)

func main() {
set := flag.NewFlagSet("app", flag.ExitOnError)
promURL := set.String("prom-endpoint", "", "Required Prometheus API endpoint for the server eg. https://<account>.tmprl.cloud/prometheus")
serverRootCACert := set.String("server-root-ca-cert", "", "Optional path to root server CA cert")
clientCert := set.String("client-cert", "", "Required path to client cert")
clientKey := set.String("client-key", "", "Required path to client key")
serverName := set.String("server-name", "", "Optional server name to use for verifying the server's certificate")
insecureSkipVerify := set.Bool("insecure-skip-verify", false, "Skip verification of the server's certificate and host name")

if err := set.Parse(os.Args[1:]); err != nil {
log.Fatalf("failed parsing args: %s", err)
} else if *clientCert == "" || *clientKey == "" {
log.Fatalf("-client-cert and -client-key are required")
}

client, err := internal.NewAPIClient(
internal.APIConfig{
TargetHost: *promURL,
ServerRootCACert: *serverRootCACert,
ClientCert: *clientCert,
ClientKey: *clientKey,
ServerName: *serverName,
InsecureSkipVerify: *insecureSkipVerify,
},
)
if err != nil {
log.Fatalf("Failed to create Prometheus client: %s", err)
}

counters, gauges, histograms, err := client.ListMetrics("temporal_cloud_v0")
if err != nil {
log.Fatalf("Failed to pull metric names: %s", err)
}
fmt.Println(counters)
fmt.Println(gauges)
fmt.Println(histograms)

conf := internal.Config{}

for _, counter := range counters {
conf.Metrics = append(conf.Metrics, internal.Metric{
MetricName: fmt.Sprintf("%s:rate1m", counter),
Query: fmt.Sprintf("rate(%s[1m])", counter),
})
}
for _, gauge := range gauges {
conf.Metrics = append(conf.Metrics, internal.Metric{
MetricName: gauge,
Query: gauge,
})
}
for _, histogram := range histograms {
conf.Metrics = append(conf.Metrics, internal.Metric{
MetricName: fmt.Sprintf("%s:histogram_quantile_p99_1m", histogram),
Query: fmt.Sprintf("histogram_quantile(0.99, sum(rate(%s[1m])) by (le, operation, temporal_namespace))", histogram),
})
}

sort.Sort(internal.ByMetricName(conf.Metrics))

yamlData, err := yaml.Marshal(&conf)
if err != nil {
log.Fatalf("error marshalling yaml: %v", err)
}

err = os.WriteFile("config.yaml", yamlData, 0644)
if err != nil {
log.Fatalf("error: %v", err)
}
}
59 changes: 59 additions & 0 deletions cloud/observability/promql-to-scrape/cmd/promql-to-scrape/main.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
package main

import (
"flag"
"log"
"os"

"github.com/temporalio/samples-server/cloud/observability/promql-to-scrape/internal"

"golang.org/x/exp/slog"
)

func main() {
set := flag.NewFlagSet("promql-to-scrape", flag.ExitOnError)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Could share the duplicate args between commands.

promURL := set.String("prom-endpoint", "", "Required Prometheus API endpoint for the server eg. https://<account>.tmprl.cloud/prometheus")
configFile := set.String("config-file", "", "Config file for promql-to-scrape")
serverRootCACert := set.String("server-root-ca-cert", "", "Optional path to root server CA cert")
clientCert := set.String("client-cert", "", "Required path to client cert")
clientKey := set.String("client-key", "", "Required path to client key")
serverName := set.String("server-name", "", "Optional server name to use for verifying the server's certificate")
insecureSkipVerify := set.Bool("insecure-skip-verify", false, "Skip verification of the server's certificate and host name")
serverAddr := set.String("bind", "0.0.0.0:9001", "address:port to expose the metrics server on")
debugLogging := set.Bool("debug", false, "Toggle debug logging")

if err := set.Parse(os.Args[1:]); err != nil {
log.Fatalf("failed parsing args: %v", err)
} else if *clientCert == "" || *clientKey == "" || *configFile == "" || *promURL == "" {
log.Fatalf("-client-cert, -client-key, -config-file, -prom-endpoint are required")
}

logLevel := slog.LevelInfo
if *debugLogging {
logLevel = slog.LevelDebug
}
h := slog.NewTextHandler(os.Stderr, &slog.HandlerOptions{Level: logLevel})
slog.SetDefault(slog.New(h))

client, err := internal.NewAPIClient(
internal.APIConfig{
TargetHost: *promURL,
ServerRootCACert: *serverRootCACert,
ClientCert: *clientCert,
ClientKey: *clientKey,
ServerName: *serverName,
InsecureSkipVerify: *insecureSkipVerify,
},
)
if err != nil {
log.Fatalf("failed to create Prometheus client: %v", err)
}

conf, err := internal.LoadConfig(*configFile)
if err != nil {
log.Fatalf("failed to load config file: %v", err)
}

s := internal.NewPromToScrapeServer(client, conf, *serverAddr)
s.Start()
}
43 changes: 43 additions & 0 deletions cloud/observability/promql-to-scrape/examples/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
metrics:
- metric_name: temporal_cloud_v0_frontend_service_error_count:rate1m
query: rate(temporal_cloud_v0_frontend_service_error_count[1m])
- metric_name: temporal_cloud_v0_frontend_service_pending_requests
query: temporal_cloud_v0_frontend_service_pending_requests
- metric_name: temporal_cloud_v0_frontend_service_request_count:rate1m
query: rate(temporal_cloud_v0_frontend_service_request_count[1m])
- metric_name: temporal_cloud_v0_poll_success_count:rate1m
query: rate(temporal_cloud_v0_poll_success_count[1m])
- metric_name: temporal_cloud_v0_poll_success_sync_count:rate1m
query: rate(temporal_cloud_v0_poll_success_sync_count[1m])
- metric_name: temporal_cloud_v0_poll_timeout_count:rate1m
query: rate(temporal_cloud_v0_poll_timeout_count[1m])
- metric_name: temporal_cloud_v0_resource_exhausted_error_count:rate1m
query: rate(temporal_cloud_v0_resource_exhausted_error_count[1m])
- metric_name: temporal_cloud_v0_schedule_action_success_count:rate1m
query: rate(temporal_cloud_v0_schedule_action_success_count[1m])
- metric_name: temporal_cloud_v0_schedule_buffer_overruns_count:rate1m
query: rate(temporal_cloud_v0_schedule_buffer_overruns_count[1m])
- metric_name: temporal_cloud_v0_schedule_missed_catchup_window_count:rate1m
query: rate(temporal_cloud_v0_schedule_missed_catchup_window_count[1m])
- metric_name: temporal_cloud_v0_service_latency_bucket:histogram_quantile_p99_1m
query: histogram_quantile(0.99, sum(rate(temporal_cloud_v0_service_latency_bucket[1m])) by (le, operation, temporal_namespace))
- metric_name: temporal_cloud_v0_service_latency_count:rate1m
query: rate(temporal_cloud_v0_service_latency_count[1m])
- metric_name: temporal_cloud_v0_service_latency_sum:rate1m
query: rate(temporal_cloud_v0_service_latency_sum[1m])
- metric_name: temporal_cloud_v0_state_transition_count:rate1m
query: rate(temporal_cloud_v0_state_transition_count[1m])
- metric_name: temporal_cloud_v0_total_action_count:rate1m
query: rate(temporal_cloud_v0_total_action_count[1m])
- metric_name: temporal_cloud_v0_workflow_cancel_count:rate1m
query: rate(temporal_cloud_v0_workflow_cancel_count[1m])
- metric_name: temporal_cloud_v0_workflow_continued_as_new_count:rate1m
query: rate(temporal_cloud_v0_workflow_continued_as_new_count[1m])
- metric_name: temporal_cloud_v0_workflow_failed_count:rate1m
query: rate(temporal_cloud_v0_workflow_failed_count[1m])
- metric_name: temporal_cloud_v0_workflow_success_count:rate1m
query: rate(temporal_cloud_v0_workflow_success_count[1m])
- metric_name: temporal_cloud_v0_workflow_terminate_count:rate1m
query: rate(temporal_cloud_v0_workflow_terminate_count[1m])
- metric_name: temporal_cloud_v0_workflow_timeout_count:rate1m
query: rate(temporal_cloud_v0_workflow_timeout_count[1m])
49 changes: 49 additions & 0 deletions cloud/observability/promql-to-scrape/examples/configmap.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: promql-to-scrape-config
data:
config.yaml: |
metrics:
- metric_name: temporal_cloud_v0_frontend_service_error_count:rate1m
query: rate(temporal_cloud_v0_frontend_service_error_count[1m])
- metric_name: temporal_cloud_v0_frontend_service_pending_requests
query: temporal_cloud_v0_frontend_service_pending_requests
- metric_name: temporal_cloud_v0_frontend_service_request_count:rate1m
query: rate(temporal_cloud_v0_frontend_service_request_count[1m])
- metric_name: temporal_cloud_v0_poll_success_count:rate1m
query: rate(temporal_cloud_v0_poll_success_count[1m])
- metric_name: temporal_cloud_v0_poll_success_sync_count:rate1m
query: rate(temporal_cloud_v0_poll_success_sync_count[1m])
- metric_name: temporal_cloud_v0_poll_timeout_count:rate1m
query: rate(temporal_cloud_v0_poll_timeout_count[1m])
- metric_name: temporal_cloud_v0_resource_exhausted_error_count:rate1m
query: rate(temporal_cloud_v0_resource_exhausted_error_count[1m])
- metric_name: temporal_cloud_v0_schedule_action_success_count:rate1m
query: rate(temporal_cloud_v0_schedule_action_success_count[1m])
- metric_name: temporal_cloud_v0_schedule_buffer_overruns_count:rate1m
query: rate(temporal_cloud_v0_schedule_buffer_overruns_count[1m])
- metric_name: temporal_cloud_v0_schedule_missed_catchup_window_count:rate1m
query: rate(temporal_cloud_v0_schedule_missed_catchup_window_count[1m])
- metric_name: temporal_cloud_v0_service_latency_bucket:histogram_quantile_p99_1m
query: histogram_quantile(0.99, sum(rate(temporal_cloud_v0_service_latency_bucket[1m])) by (le, operation, temporal_namespace))
- metric_name: temporal_cloud_v0_service_latency_count:rate1m
query: rate(temporal_cloud_v0_service_latency_count[1m])
- metric_name: temporal_cloud_v0_service_latency_sum:rate1m
query: rate(temporal_cloud_v0_service_latency_sum[1m])
- metric_name: temporal_cloud_v0_state_transition_count:rate1m
query: rate(temporal_cloud_v0_state_transition_count[1m])
- metric_name: temporal_cloud_v0_total_action_count:rate1m
query: rate(temporal_cloud_v0_total_action_count[1m])
- metric_name: temporal_cloud_v0_workflow_cancel_count:rate1m
query: rate(temporal_cloud_v0_workflow_cancel_count[1m])
- metric_name: temporal_cloud_v0_workflow_continued_as_new_count:rate1m
query: rate(temporal_cloud_v0_workflow_continued_as_new_count[1m])
- metric_name: temporal_cloud_v0_workflow_failed_count:rate1m
query: rate(temporal_cloud_v0_workflow_failed_count[1m])
- metric_name: temporal_cloud_v0_workflow_success_count:rate1m
query: rate(temporal_cloud_v0_workflow_success_count[1m])
- metric_name: temporal_cloud_v0_workflow_terminate_count:rate1m
query: rate(temporal_cloud_v0_workflow_terminate_count[1m])
- metric_name: temporal_cloud_v0_workflow_timeout_count:rate1m
query: rate(temporal_cloud_v0_workflow_timeout_count[1m])
47 changes: 47 additions & 0 deletions cloud/observability/promql-to-scrape/examples/deployment.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: promql-to-scrape
labels:
app: promql-to-scrape
spec:
replicas: 1
selector:
matchLabels:
app: promql-to-scrape
template:
metadata:
labels:
app: promql-to-scrape
spec:
containers:
- name: promql-to-scrape
image: ghcr.io/temporalio/promql-to-scrape:7c0e91a
args:
- --client-cert=/var/run/secrets/ca_crt
- --client-key=/var/run/secrets/ca_key
- --prom-endpoint=https://<account>.tmprl.cloud/prometheus
- --config-file=/etc/promql-to-scrape/config.yaml
- --debug
ports:
- containerPort: 9001
volumeMounts:
- name: secrets
mountPath: /var/run/secrets
readOnly: true
- name: config-volume
mountPath: /etc/promql-to-scrape
resources:
limits:
cpu: "100m"
memory: "256Mi"
volumes:
- name: secrets
secret:
secretName: promql-to-scrape-secrets
- name: config-volume
configMap:
name: promql-to-scrape-config
items:
- key: config.yaml
path: config.yaml
10 changes: 10 additions & 0 deletions cloud/observability/promql-to-scrape/examples/secret.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
apiVersion: v1
kind: Secret
type: Opaque
metadata:
name: promql-to-scrape-secrets
labels:
app: promql-to-scrape
data:
ca_crt: "<cert | base64>"
ca_key: "<key | base64>"
17 changes: 17 additions & 0 deletions cloud/observability/promql-to-scrape/go.mod
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
module github.com/temporalio/samples-server/cloud/observability/promql-to-scrape

go 1.21

require (
github.com/prometheus/client_golang v1.17.0
github.com/prometheus/common v0.45.0
golang.org/x/exp v0.0.0-20231110203233-9a3e6036ecaa
gopkg.in/yaml.v3 v3.0.1
)

require (
github.com/json-iterator/go v1.1.12 // indirect
github.com/kr/text v0.2.0 // indirect
github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd // indirect
github.com/modern-go/reflect2 v1.0.2 // indirect
)
Loading
Loading