Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet]: On making changes to elastic defend Endpoints get unhealthy. #5754

Closed
harshitgupta-qasource opened this issue Oct 10, 2024 · 15 comments
Labels
bug Something isn't working impact:high Short-term priority; add to current release, or definitely next. QA:Validated Validated by the QA Team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@harshitgupta-qasource
Copy link

Kibana Build details:

VERSION: 8.16.0 SNAPSHOT
BUILD: 78993
COMMIT: 6eb8471c3124046eca03cccf20e0cc4f9706bcd5

Artifact: https://snapshots.elastic.co/8.16.0-106cdbc2/downloads/beats/elastic-agent/elastic-agent-8.16.0-SNAPSHOT-windows-x86_64.zip

Host: Windows 10 - Test Signing ON

Preconditions:

  1. 8.16.0 SNAPSHOT Cloud environment should be available.
  2. Agent should be installed.

Steps to reproduce:

  1. Navigate to Fleet>Agents tab.
  2. Select any agent policy.
  3. Click on Add integration.
  4. Select the Elastic Defend integration and Add to multiple policies.
  5. Wait for 10-15 minutes.
  6. Observe that on adding elastic defend as shared integration, Endpoints get unhealthy.

Expected Result:
On adding elastic defend as shared integration, Endpoints should remain healthy.

Note:

  • No issue is observed when elastic defend is only added to single agent policy.

Screenshots:
Image
Image

Agents Logs
elastic-agent-diagnostics-2024-10-10T09-41-16Z-00.zip

@harshitgupta-qasource harshitgupta-qasource added bug Something isn't working impact:high Short-term priority; add to current release, or definitely next. Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Oct 10, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@harshitgupta-qasource
Copy link
Author

@amolnater-qasource Kindly review

@amolnater-qasource
Copy link

Secondary Review for this ticket is Done.

@pierrehilbert
Copy link
Contributor

@harshitgupta-qasource could you please confirm to me this is not happening when added to only one policy?

@harshitgupta-qasource
Copy link
Author

Hi @pierrehilbert

Thank you for looking into this issue.

We have attempted to reproduce this issue while adding Elastic Defend to a single agent policy on latest 8.16.0 SNAPSHOT kibana cloud environment and found it not reproducible.
Observation

  • On adding Elastic Defend to a single agent policy, agent remains healthy throughout.

Screenshot:

Image

Build details:
VERSION: 8.16.0-SNAPSHOT
BUILD: 79128
COMMIT: d7556c5782195e1a8526c4b52a976597e32ba242
ARTIFACT: https://snapshots.elastic.co/8.16.0-f1fafc82/downloads/beats/elastic-agent/elastic-agent-8.16.0-SNAPSHOT-windows-x86_64.zip

Agents Logs
elastic-agent-diagnostics-2024-10-14T07-12-37Z-00.zip

Please let us know if anything else is required from our end.
Thank you

@pierrehilbert
Copy link
Contributor

@kpollich do you know what we are doing differently when applying an integration policy to two agent policies?
@nfritts do you know from an Endpoint perspective what could maybe cause this kind of issue?

@intxgo
Copy link
Contributor

intxgo commented Oct 14, 2024

No issue can be found on Endpoint side from the attached diagnostics.

endpoint config elastic-endpoint.yaml output setting looks ok

output:
  elasticsearch:
    api_key: <REDACTED>
    hosts:
    - <REDACTED>
    preset: balanced
    type: elasticsearch

endpoint policy response policy_response.json indicates success.

                    {
                        "message": "Successfully configured output connection",
                        "name": "configure_output",
                        "status": "success"
                    },
                    {
                        "message": "Successfully connected to Agent",
                        "name": "agent_connectivity",
                        "status": "success"
                    },
                    {
                        "message": "Successfully executed all workflows",
                        "name": "workflow",
                        "status": "success"
                    }

PS. too bad the failed Configure output was not expanded on the screenshot, but it seems the policy ID from the screenshot doesn't match the policy response recorded by Endpoint

                "endpoint_policy_version": "1",
                "id": "8c38a630-bb56-4588-afc8-d9772640b8b0",
                "name": "tesr;n",

oh tesr;n, I've been looking at the wrong diagnostics file 😉

@intxgo
Copy link
Contributor

intxgo commented Oct 14, 2024

in the first zip, Endpoint indeed indicated output failure

                    {
                        "message": "Failed to read output configuration. No valid output configuration found",
                        "name": "configure_output",
                        "status": "failure"
                    },

the config

output:
  "": {}

log:

{"@timestamp":"2024-10-10T09:33:12.9721816Z","agent":{"id":"44a105dc-6e4c-4d4f-87bd-65bf04620238","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":408,"name":"Response.cpp"}}},"message":"Response.cpp:408 Policy action configure_output: failure - Failed to read output configuration","process":{"pid":16452,"thread":{"id":21148}}}
{"@timestamp":"2024-10-10T09:33:12.9721816Z","agent":{"id":"44a105dc-6e4c-4d4f-87bd-65bf04620238","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"error","origin":{"file":{"line":962,"name":"PolicyComms.cpp"}}},"message":"PolicyComms.cpp:962 No valid comms client configured","process":{"pid":16452,"thread":{"id":21148}}}
{"@timestamp":"2024-10-10T09:33:12.9721816Z","agent":{"id":"44a105dc-6e4c-4d4f-87bd-65bf04620238","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":966,"name":"PolicyComms.cpp"}}},"message":"PolicyComms.cpp:966     Queue:","process":{"pid":16452,"thread":{"id":21148}}}
{"@timestamp":"2024-10-10T09:33:12.9721816Z","agent":{"id":"44a105dc-6e4c-4d4f-87bd-65bf04620238","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":967,"name":"PolicyComms.cpp"}}},"message":"PolicyComms.cpp:967       size:                  : 3200","process":{"pid":16452,"thread":{"id":21148}}}
{"@timestamp":"2024-10-10T09:33:12.9721816Z","agent":{"id":"44a105dc-6e4c-4d4f-87bd-65bf04620238","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":968,"name":"PolicyComms.cpp"}}},"message":"PolicyComms.cpp:968       flush:","process":{"pid":16452,"thread":{"id":21148}}}
{"@timestamp":"2024-10-10T09:33:12.9721816Z","agent":{"id":"44a105dc-6e4c-4d4f-87bd-65bf04620238","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":969,"name":"PolicyComms.cpp"}}},"message":"PolicyComms.cpp:969         min_events:          : 1600","process":{"pid":16452,"thread":{"id":21148}}}
{"@timestamp":"2024-10-10T09:33:12.9721816Z","agent":{"id":"44a105dc-6e4c-4d4f-87bd-65bf04620238","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":970,"name":"PolicyComms.cpp"}}},"message":"PolicyComms.cpp:970         timeout:             : 10000 ms","process":{"pid":16452,"thread":{"id":21148}}}

@intxgo
Copy link
Contributor

intxgo commented Oct 14, 2024

it's weird as in both zips the Agent pre-config.yaml has exactly the same settings (just different host URL)

outputs:
    default:
        api_key: <REDACTED>
        hosts:
            - https://2edcae624e0540cca2848e9d7b82eebe.us-west2.gcp.elastic-cloud.com:443
        preset: balanced
        type: elasticsearch

Unfortunately Endpoint is not logging the original form of the config as received via V2, we can make an ad-hoc build with the log if needed, let me know @pierrehilbert

@harshitgupta-qasource
Copy link
Author

Hi Team

While testing on latest 8.16.0 Latest SNAPSHOT Kibana cloud build we had further observations:

Observation

  • On adding Elastic Defend to a single agent policy and on making changes to the agent policy without using shared integration, agent gets unhealthy inconsistently .

Agents Logs
elastic-agent-diagnostics-2024-10-15T10-44-13Z-00.zip

Screenshot:
Image

Build details:
VERSION: 8.16.0
BUILD: 79135
COMMIT: 1a3efcceb40a5a6c5ee55a44c3fe7642206008e5

Please let us know if anything else is required from our end.
Thank you

@kpollich
Copy link
Member

@harshitgupta-qasource - Could you provide the full agent policy YML from Fleet via the "show policy" button? I'd like to try and see if this is an issue with Fleet's policy compilation, or if the output configuration breaks farther along in Fleet Server/Agent.

@harshitgupta-qasource
Copy link
Author

harshitgupta-qasource commented Oct 15, 2024

Hi @kpollich,
Please find below complete yml for the same.
elastic-agent.zip

Thanks

@kpollich
Copy link
Member

Thank you! I think this is the relevant section of the policy YML

id: 1d1d61e3-9f45-4e27-9235-418a8fcfb6cb
revision: 5
outputs:
  default:
    type: elasticsearch
    hosts:
      - >-
        https://4a68fa885ccd4fee90c8da91b3ebdf9e.us-west2.gcp.elastic-cloud.com:443
    preset: balanced
fleet:
  hosts:
    - >-
      https://07f4ecface6f4a3c8b4b591b5d9c4adf.fleet.us-west2.gcp.elastic-cloud.com:443

This looks correct to me on first glance, but I wonder if the YML multiline character prefixing the output is causing issues here when endpoint parses it out? AFAIK this is the existing behavior (at least my 8.15.0 cloud cluster produces the outputs block in the exact same way), but I wonder if there is a regression somewhere related to parsing out these output blocks when preceded by >- \n

@amolnater-qasource amolnater-qasource changed the title [Fleet]: On adding elastic defend as shared integration, Endpoints get unhealthy. [Fleet]: On making changes to elastic defend Endpoints get unhealthy. Oct 21, 2024
@nfritts
Copy link

nfritts commented Oct 24, 2024

I believe this is fixed now (via https://github.com/elastic/endpoint-dev/pull/15144) The fix is live in BC2. Can you please retest with BC2 and see if the issue is resolved?

@harshitgupta-qasource
Copy link
Author

Hi Team,

We have re-validated this issue on the latest 8.16.0 BC2 Kibana cloud environment and found it fixed now.

Observations:

  • On adding elastic defend as shared integration and making any changes to the Elastic Defend, Endpoints is remain healthy.

Build details:
VERSION: 8.16.0 BC2
BUILD: 79434
COMMIT: 59220e984f2e3ca8b99fe904d077a5979f5f298d

Screen-shot:

Image

Agents Logs
elastic-agent-diagnostics-2024-10-25T08-09-28Z-00.zip

Hence, we are marking this issue as QA: Validated.

Thanks

@harshitgupta-qasource harshitgupta-qasource added the QA:Validated Validated by the QA Team label Oct 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working impact:high Short-term priority; add to current release, or definitely next. QA:Validated Validated by the QA Team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests

7 participants