Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup monitoring #275

Open
41 of 51 tasks
mvgijssel opened this issue May 21, 2023 · 23 comments
Open
41 of 51 tasks

Setup monitoring #275

mvgijssel opened this issue May 21, 2023 · 23 comments
Assignees

Comments

@mvgijssel
Copy link
Member

mvgijssel commented May 21, 2023

Use a SaaS offering to notify when the provisioner is not behaving as expected. Try

Setup notifications with PagerDuty.

Interesting setup using Datadog synthetic tests in the CI https://www.datadoghq.com/blog/run-synthetic-tests-in-circeci-pipelines-with-datadog/

Try to use open telemetry inside of pytest so it’s easy to switch vendors. Teleport has support for this https://goteleport.com/docs/management/diagnostics/tracing/.

This is a great list of vendors: https://github.com/magsther/awesome-opentelemetry#ui

TODO

  • Setup SaaS for monitoring, tracing and logs
  • Setup provisioner system monitoring
  • Setup teleport health check
  • Setup teleport alert when health check fails
  • Resolve alert automatically
  • Send notification when alert fires
  • Setup New Relic agent
  • Use Renovate bazel modules
  • Remove logz io telegraf setup
  • Setup github exporter to track deployment metrics
  • Connect github exporter to New Relic metrics
  • Fix broken master
  • Update renovate to also capture docker-compose.yml.j2
  • Replace teleport connection test with deploy_test
  • Use 1Password for deploy identity
  • Add tests for Docker
  • Add tests for new relic agent
  • Add tests for new relic container
  • Add test for teleport health
  • Remove provisioner telegraf code
  • Remove secrets from 1Password Vault
  • Remove logz io account
  • Use 1Password on GitHub actions instead of BuildBuddy secret
  • Create BuildBuddy protobuf client
  • Run //provisioner:validate against production
  • Setup testinfra tracing
  • Run //provisioner:validate on a schedule
  • Setup alert when provisioner validation fails
  • Setup alert when master branch fails
  • Add workspace.bzlmod to renovatebot
  • Setup schedule for renovatebot
  • Setup deployment marker for New Relic
  • Setup Teleport tracing in New Relic
  • Setup resource constraints docker compose files
  • track cpu temperature raspberry pi (https://github.com/lukasmalkmus/rpi_exporter)
  • Remove buildbuddy grpc client
  • Reply to BuildBuddy community about invocation api (https://buildbuddy.slack.com/archives/CUY16GNK1/p1686226616704259?thread_ts=1685995570.849599&cid=CUY16GNK1)
  • Setup cronjob for regular provisioner reboots when necessary (stop this one during deployment?)
  • Setup cronjob for regular docker system prune
  • Forward all docker container logs to New Relic
  • Forward all system logs from provisioner to New Relic
  • Add tests for New Relic log forwarding
  • Fix cron setup by fixing teleport pyinfra connector
  • Connect teleport service to teleport logs
  • Setup latency/success SLI's for Teleport service
  • Setup cpu temperature alert for Raspberry Pi
  • Update bootstrap doc
  • Ensure provisioner-validate ci timeout is higher
  • Ensure provisioner-validate has a test timeout per test so total test suite does not timeout and prevent metrics from submission
  • Deal with case when provisioner is offline, error when data/metrics don’t come in?
  • Teleport health issue not triggered to PagerDuty?
@mvgijssel mvgijssel self-assigned this May 21, 2023
@mvgijssel mvgijssel converted this from a draft issue May 21, 2023
@mvgijssel
Copy link
Member Author

For the teleport health check can use the Telegraf http_response input!

@mvgijssel
Copy link
Member Author

Setup basic health dashboard for Teleport

{
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": {
          "type": "grafana",
          "uid": "-- Grafana --"
        },
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "target": {
          "limit": 100,
          "matchAny": false,
          "tags": [],
          "type": "dashboard"
        },
        "type": "dashboard"
      }
    ]
  },
  "editable": true,
  "fiscalYearStartMonth": 0,
  "graphTooltip": 0,
  "id": 333761,
  "links": [],
  "liveNow": false,
  "panels": [
    {
      "datasource": {
        "type": "prometheus",
        "uid": "Fn0r6zw4z"
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 0
      },
      "id": 4,
      "options": {
        "alertInstanceLabelFilter": "",
        "alertName": "",
        "dashboardAlerts": false,
        "groupBy": [],
        "groupMode": "default",
        "maxItems": 20,
        "sortOrder": 1,
        "stateFilter": {
          "error": true,
          "firing": true,
          "inactive": true,
          "noData": true,
          "normal": true,
          "pending": true
        }
      },
      "title": "Alerts",
      "type": "alertlist"
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "Fn0r6zw4z"
      },
      "fieldConfig": {
        "defaults": {
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {
                "color": "green",
                "value": null
              },
              {
                "color": "red",
                "value": 1
              }
            ]
          },
          "unit": "short"
        },
        "overrides": []
      },
      "gridPos": {
        "h": 8,
        "w": 12,
        "x": 0,
        "y": 8
      },
      "id": 2,
      "options": {
        "colorMode": "value",
        "graphMode": "area",
        "justifyMode": "auto",
        "orientation": "auto",
        "reduceOptions": {
          "calcs": [
            "lastNotNull"
          ],
          "fields": "",
          "values": false
        },
        "textMode": "auto"
      },
      "pluginVersion": "8.5.1",
      "targets": [
        {
          "datasource": {
            "type": "prometheus",
            "uid": "Fn0r6zw4z"
          },
          "editorMode": "builder",
          "expr": "min(http_response_result_code{host=\"provisioner\", server=\"http://localhost:3000/healthz\"})",
          "range": true,
          "refId": "A"
        }
      ],
      "title": "Teleport Health Code",
      "type": "stat"
    }
  ],
  "schemaVersion": 36,
  "style": "dark",
  "tags": [],
  "templating": {
    "list": []
  },
  "time": {
    "from": "now-30m",
    "to": "now"
  },
  "timepicker": {},
  "timezone": "",
  "title": "System Health",
  "uid": "EmoPWGwVk",
  "version": 4,
  "weekStart": ""
}

@mvgijssel
Copy link
Member Author

Asking in the InfluxDB Community why there is overlap in some of the metrics https://influxcommunity.slack.com/archives/CH99HUH8V/p1684753005874559

@mvgijssel
Copy link
Member Author

Also filed support ticket with Logz.io https://support.logz.io/hc/en-us/requests/60657

@mvgijssel
Copy link
Member Author

From https://groups.google.com/g/prometheus-users/c/JcV51GNnXNM

The current staleness handling means that the time series will still be returned by instant vectors for 5 minutes. I'd suggest putting the run number as the value of a single timeseries.

This was referenced May 22, 2023
@mvgijssel
Copy link
Member Author

@mvgijssel
Copy link
Member Author

mvgijssel commented May 28, 2023

Nee Relic seems to have a good offering as well and supports the New Relic Gate https://docs.newrelic.com/whats-new/2023/04/whats-new-04-20-github-integration/ to protect deployments.

Migration:

@mvgijssel
Copy link
Member Author

@mvgijssel
Copy link
Member Author

@mvgijssel
Copy link
Member Author

Asking promhippie/github_exporter#213 how to interpret the data from the GitHub exporter to setup an alert in New Relic when a workflow fails.

@tboerger
Copy link

tboerger commented Jun 1, 2023

Since you are mentioning promhippie/github_exporter just for actions metrics, this exporter can also provide various metrics generally for your GitHub orgs and repos :)

@mvgijssel
Copy link
Member Author

Created SLO's in New Relic for provisioner deployment and validation
image

@mvgijssel
Copy link
Member Author

Uninstalled microk8s in the provisioner and checking if this is picked up by New Relic and PagerDuty!

@mvgijssel
Copy link
Member Author

Works! Got a page from PagerDuty once the invalid provisioner state was detected
image

@mvgijssel
Copy link
Member Author

Trying to setup a sub account for dev/test doesn't work:

image

Because New Relic is on the free tier.

@mvgijssel
Copy link
Member Author

Seems snap is broken at the moment https://status.snapcraft.io/ so unable to finish deploy and restore monitors 😅

@mvgijssel
Copy link
Member Author

Tuned SLAs to be (a lot more) lenient. Because the traffic for the SLI's is low, setting the target to 99% correct means there is a very small error budget. Trying with these new numbers

image

@mvgijssel
Copy link
Member Author

Maybe update the testinfra and pyinfra Teleport clients to use ssh directly and proxy through Teleport. Generating the config doesn’t work, but maybe can generate the connection string as well?

@mvgijssel
Copy link
Member Author

Can use SSH multiplexing if using SSH directly to speed up subsequent connections for both pyinfra and testinfra!

@mvgijssel
Copy link
Member Author

devcontainer@d42ed4873fd2:/workspaces/setup$ time ssh -o 'ControlMaster=auto' -F tmp/ssh_config  ubuntu@provisioner.provisioner exit 0

real    0m0.396s
user    0m0.006s
sys     0m0.013s
devcontainer@d42ed4873fd2:/workspaces/setup$ time ssh -o 'ControlMaster=no' -F tmp/ssh_config  ubuntu@provisioner.provisioner exit 0

real    0m0.734s
user    0m0.016s
sys     0m0.017s
devcontainer@d42ed4873fd2:/workspaces/setup$ time tsh ssh ubuntu@provisioner exit 0

real    0m0.950s
user    0m0.257s
sys     0m0.224s

So using the SSH client with multiplexing enabled is almost 3x faster than using the tsh ssh command.

@mvgijssel
Copy link
Member Author

Can use Paramiko to parse the ssh_config file from Teleport https://snyk.io/advisor/python/paramiko/functions/paramiko.SSHConfig

@mvgijssel
Copy link
Member Author

Unsure how to generate the SSH config file for the identity file though gravitational/teleport#27659

@mvgijssel
Copy link
Member Author

Update secrets macro to

secrets({
  “FOO”: “bar”,
  “/tmp/secret”: “filesecret”,
  “./rel/secret”: “relative file secret”,
})

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants