Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent gets unhealthy temporarily because Beat monitoring sockets are not available #5332

Closed
amolnater-qasource opened this issue Aug 21, 2024 · 14 comments · Fixed by #5999, elastic/beats#41570 or #6160
Assignees
Labels
bug Something isn't working impact:medium QA:Ready For Testing Code is merged and ready for QA to validate Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@amolnater-qasource
Copy link

Kibana Build details:

VERSION: 8.16.0 SNAPSHOT
BUILD: 77469
COMMIT: 1f1a3594615afceda26308abf163497114b5858f

Preconditions:

  1. 8.16.0 SNAPSHOT Kibana cloud environment should be available.
  2. Remote Elasticsearch Output should be added.
  3. Agent should be installed with policy having System integration.

Steps to reproduce:

  1. Navigate to Agent policies tab.
  2. Now update output for system integration and set Remote Elasticsearch.
  3. Observe agent gets unhealthy.
  4. However observe data on the remote ES cluster is available.

NOTE:

  • No impact on data is observed.
  • Issue is consistently reproducible on two different agents.

Expected Result:
Agent should remain healthy on switching integration output to Remote ES.

Logs:
elastic-agent-diagnostics-2024-08-21T09-40-35Z-00.zip

Screen Recording:

ip-172-31-18-23.-.Agents.-.Fleet.-.Elastic.-.Google.Chrome.2024-08-21.15-09-58.mp4
Agents.-.Fleet.-.Elastic.-.Google.Chrome.2024-08-21.15-11-25.mp4

Feature:
elastic/kibana#143905

@amolnater-qasource amolnater-qasource added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team impact:medium labels Aug 21, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@amolnater-qasource
Copy link
Author

@karanbirsingh-qasource Please review.

@ghost
Copy link

ghost commented Aug 21, 2024

secondary review of this ticket is done

@pierrehilbert
Copy link
Contributor

Are we still unhealthy after restarting the Agent?
Are we having the same behavior when we are directly setting up the Remote Elasticsearch instead of switching to it?

@amolnater-qasource
Copy link
Author

Hi @pierrehilbert
Thank you for looking into this.

Are we still unhealthy after restarting the Agent?

  • No, the agent gets back healthy after sometime.
  • On restart too, agent immediately gets back Healthy.

Are we having the same behavior when we are directly setting up the Remote Elasticsearch instead of switching to it?

We have revalidated by setting the Remote Elasticsearch output first and then installing the agent.
We have found it working fine and observed agent remains Healthy throughout.

Agent Logs:
elastic-agent-diagnostics-2024-08-28T09-15-50Z-00.zip

Please let us know if we anything else is required from our end.
Thanks!!

@cmacknz
Copy link
Member

cmacknz commented Sep 3, 2024

Looking at logs this doesn't have anything to do with remote ES and is caused by the monitoring endpoints of the Beats not being available fast enough:

logs/elastic-agent-8.16.0-SNAPSHOT-063287/elastic-agent-20240821.ndjson
3278:{"log.level":"warn","@timestamp":"2024-08-21T09:36:49.086Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":663},"message":"Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (HEALTHY->DEGRADED): Error fetching data for metricset beat.stats: error making http request: Get \"http://unix/stats\": dial unix /opt/Elastic/Agent/data/tmp/YnNOCxAwgihP9s-7dwAEjXjR-siLP2_L.sock: connect: no such file or directory","log":{"source":"elastic-agent"},"component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring-metrics-monitoring-beats","type":"input","state":"DEGRADED","old_state":"HEALTHY"},"ecs.version":"1.6.0"}
3279:{"log.level":"warn","@timestamp":"2024-08-21T09:36:49.086Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":663},"message":"Unit state changed http/metrics-monitoring-metrics-monitoring-agent (CONFIGURING->DEGRADED): Error fetching data for metricset http.json: error making http request: Get \"http://unix/stats\": dial unix /opt/Elastic/Agent/data/tmp/DaWi_h1Ho3tUlAynCECuHtgXQYuKcWEJ.sock: connect: no such file or directory","log":{"source":"elastic-agent"},"component":{"id":"http/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"http/metrics-monitoring-metrics-monitoring-agent","type":"input","state":"DEGRADED","old_state":"CONFIGURING"},"ecs.version":"1.6.0"}
3338:{"log.level":"info","@timestamp":"2024-08-21T09:36:49.651Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":663},"message":"Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (DEGRADED->CONFIGURING): Configuring","log":{"source":"elastic-agent"},"component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring-metrics-monitoring-beats","type":"input","state":"CONFIGURING","old_state":"DEGRADED"},"ecs.version":"1.6.0"}
3352:{"log.level":"info","@timestamp":"2024-08-21T09:36:49.960Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":663},"message":"Unit state changed http/metrics-monitoring-metrics-monitoring-agent (DEGRADED->CONFIGURING): Configuring","log":{"source":"elastic-agent"},"component":{"id":"http/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"http/metrics-monitoring-metrics-monitoring-agent","type":"input","state":"CONFIGURING","old_state":"DEGRADED"},"ecs.version":"1.6.0"}
3366:{"log.level":"warn","@timestamp":"2024-08-21T09:36:50.657Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":663},"message":"Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (CONFIGURING->DEGRADED): Error fetching data for metricset beat.stats: error making http request: Get \"http://unix/stats\": dial unix /opt/Elastic/Agent/data/tmp/YnNOCxAwgihP9s-7dwAEjXjR-siLP2_L.sock: connect: no such file or directory","log":{"source":"elastic-agent"},"component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring-metrics-monitoring-beats","type":"input","state":"DEGRADED","old_state":"CONFIGURING"},"ecs.version":"1.6.0"}
3367:{"log.level":"warn","@timestamp":"2024-08-21T09:36:50.671Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":663},"message":"Unit state changed http/metrics-monitoring-metrics-monitoring-agent (CONFIGURING->DEGRADED): Error fetching data for metricset http.json: error making http request: Get \"http://unix/inputs\": dial unix /opt/Elastic/Agent/data/tmp/YnNOCxAwgihP9s-7dwAEjXjR-siLP2_L.sock: connect: no such file or directory","log":{"source":"elastic-agent"},"component":{"id":"http/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"http/metrics-monitoring-metrics-monitoring-agent","type":"input","state":"DEGRADED","old_state":"CONFIGURING"},"ecs.version":"1.6.0"}
3405:{"log.level":"info","@timestamp":"2024-08-21T09:36:59.352Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":663},"message":"Unit state changed beat/metrics-monitoring-metrics-monitoring-beats (DEGRADED->HEALTHY): Healthy","log":{"source":"elastic-agent"},"component":{"id":"beat/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"beat/metrics-monitoring-metrics-monitoring-beats","type":"input","state":"HEALTHY","old_state":"DEGRADED"},"ecs.version":"1.6.0"}
3406:{"log.level":"info","@timestamp":"2024-08-21T09:36:59.410Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":663},"message":"Unit state changed http/metrics-monitoring-metrics-monitoring-agent (DEGRADED->HEALTHY): Healthy","log":{"source":"elastic-agent"},"component":{"id":"http/metrics-monitoring","state":"HEALTHY"},"unit":{"id":"http/metrics-monitoring-metri

@ycombinator
Copy link
Contributor

is caused by the monitoring endpoints of the Beats not being available fast enough:

@cmacknz is there any work to be done here or is this normal, expected behavior?

@cmacknz
Copy link
Member

cmacknz commented Oct 21, 2024

This behavior is currently causing flakiness in our integration tests so I think we should address is.

This is the first example I've seen outside of our tests, which means any fix can't be in our test infrastructure it has to go in agent itself.

I think the rate at which we poll for this endpoint to exist is the new 60s metrics interval, so you can have to poll the agent health for 1+ minute to see if it recovers.

A fix might look like retrying the connection of this socket faster in the agent. It is also possible beat startup has gotten slower.

@ycombinator
Copy link
Contributor

Added it to the current sprint.

@cmacknz cmacknz changed the title Agent gets unhealthy temporarily on switching integration output to Remote ES. Agent gets unhealthy temporarily because Beat monitoring sockets are not available Oct 21, 2024
@pchila
Copy link
Member

pchila commented Nov 6, 2024

Had a quick look around the code of elastic-agent and beats to try and sum up the issue:

  • @cmacknz is right in saying that the degraded state is coming from the http endpoints not being available because of a restart due to output change, the involved snippets of code are:
    • elastic-agent
      componentListWithMonitoring := map[string]string{
      fmt.Sprintf("beat/%s", monitoringMetricsUnitID): "metricbeat",
      fmt.Sprintf("http/%s", monitoringMetricsUnitID): "metricbeat",
      monitoringFilesUnitsID: "filebeat",
      }
      for k, v := range componentIDToBinary {
      componentListWithMonitoring[k] = v
      }
      for unit, binaryName := range componentListWithMonitoring {
      if !isSupportedMetricsBinary(binaryName) {
      continue
      }
      endpoints := []interface{}{prefixedEndpoint(utils.SocketURLWithFallback(unit, paths.TempDir()))}
      name := strings.ReplaceAll(strings.ReplaceAll(binaryName, "-", "_"), "/", "_") // conform with index naming policy
      if isSupportedBeatsBinary(binaryName) {
      beatsStreams = append(beatsStreams, map[string]interface{}{
      idKey: fmt.Sprintf("%s-", monitoringMetricsUnitID) + name,
      "data_stream": map[string]interface{}{
      "type": "metrics",
      "dataset": fmt.Sprintf("elastic_agent.%s", name),
      "namespace": monitoringNamespace,
      },
      "metricsets": []interface{}{"stats"},
      "hosts": endpoints,
      "period": metricsCollectionIntervalString,
      "index": fmt.Sprintf("metrics-elastic_agent.%s-%s", name, monitoringNamespace),
    • beats https://github.com/elastic/beats/blob/1f033c9eed2ccd5b5acca830b47ec3c2261ef026/metricbeat/mb/module/wrapper.go#L254-L269
      So metricbeat is setting itself as degraded and will reset to healthy only on the next metrics fetch (which will happen by default after 60s)
  • Changing the interval between metrics is already possible via configuration in the monitoring config
    if metricsPeriod, found := monitoringMap[monitoringMetricsPeriodKey]; found {
    if metricsPeriodStr, ok := metricsPeriod.(string); ok {
    metricsCollectionIntervalString = metricsPeriodStr
    }
    }

At this point I see a few options:

  • Reduce the default interval for metrics collection: we will lose some or all the savings implemented with Increase metrics interval to 60s #3578
  • if the issue is only in tests: a simple monitoring config is the best bet as it doesn't impact anyone else and doesn't require code changes in production code
  • make the state change based on metricset fetch outcome configurable and disable it for the monitoring inputs so that agent will report the monitoring beats degraded only when those are missing check-ins
  • Implement a retry mechanisms in metricbeat for failed metricsets:
    1. Retry n times within a reasonable span of time (every <configurable number> seconds up to some deadline < normal metric interval)
    2. Allow metricsets fetch to reschedule themselves with a shorter interval in case of error: this will have an impact on the regular sampling that is used right now in metricbeat with some consequences in case of aggregations that assume regular spacing between metricsets

@ycombinator @pierrehilbert @jlind23 @cmacknz what do you think would be an acceptable solution for this ?
Can we sacrifice some of the ES space saved by slowing the metricsets ingestion in ES in favor of better responsiveness ?
Is it ok to introduce some variance in the cadence of metricsets ingestion in case of errors ?

@pchila
Copy link
Member

pchila commented Nov 6, 2024

A very valid suggestion from @pkoutsovasilis that can be added to the list of options:
we could also configure the number of failures before metricbeat sets itself as DEGRADED (for example 2-3 failures in a row for our monitoring inputs) so that we can avoid showing DEGRADED status for transient errors

@cmacknz
Copy link
Member

cmacknz commented Nov 6, 2024

if the issue is only in tests:

This bug report is an instance of this happening outside of our tests, our tests are just built to detect this easily, and I think the problem is ultimately transient.

Can we sacrifice some of the ES space saved by slowing the metricsets ingestion in ES in favor of better responsiveness ?

No, the space savings are cost savings in the end.

we could also configure the number of failures before metricbeat sets itself as DEGRADED (for example 2-3 failures in a row for our monitoring inputs) so that we can avoid showing DEGRADED status for transient errors

I like this, seems like simple and works as long as this problem is self-correcting.

@blakerouse
Copy link
Contributor

Or another correct solution would be to stop restarting the beats when the output changes ;-)

@amolnater-qasource amolnater-qasource added the QA:Ready For Testing Code is merged and ready for QA to validate label Nov 20, 2024
@cmacknz
Copy link
Member

cmacknz commented Nov 26, 2024

Looks like this is still happening #6049 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment