Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

operator-ip: recover from failure until akash-network/support#105 is solved #276

Open
andy108369 opened this issue Apr 20, 2024 · 0 comments

Comments

@andy108369
Copy link
Collaborator

andy108369 commented Apr 20, 2024

provider pod won't respond over 8443/status & 8444 (grpc) status endpoints if operator-ip is used and the latter failed to initialize (typically on first worker node re/boot - as the network still haven't fully initialized it is unable to talk to the DNS yet 169.254.25.10:53: server misbehaving):

Operator-ip Logs:

$ kubectl -n akash-services logs deployment/operator-ip
I[2024-04-20|08:48:29.791] using in cluster kube config                 cmp=provider
D[2024-04-20|08:48:41.089] service being found via autodetection        operator=ip service=metal-lb
I[2024-04-20|08:48:41.093] clients                                      operator=ip kube="kube client 0xc001168000 ns=lease" metallb="metal LB client 0xc001168060"
I[2024-04-20|08:48:41.093] HTTP listening                               operator=ip address=:8080
I[2024-04-20|08:48:41.093] ip operator start                            operator=ip
I[2024-04-20|08:48:41.093] associated provider                          operator=ip addr=akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
I[2024-04-20|08:48:41.093] fetching existing IP passthroughs            operator=ip
E[2024-04-20|08:48:41.105] dns discovery failed                         operator=ip cmp=service-discovery-agent error="lookup _monitoring._TCP.controller.metallb-system.svc.cluster.local on 169.254.25.10:53: server misbehaving" portName=monitoring service-name=controller namespace=metallb-system
D[2024-04-20|08:48:41.105] satisfying pending requests                  operator=ip cmp=service-discovery-agent qty=1
E[2024-04-20|08:48:41.105] observation stopped                          operator=ip err="lookup _monitoring._TCP.controller.metallb-system.svc.cluster.local on 169.254.25.10:53: server misbehaving"
I[2024-04-20|08:48:41.105] delaying                                     operator=ip
I[2024-04-20|08:48:44.107] associated provider                          operator=ip addr=akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
I[2024-04-20|08:48:44.107] fetching existing IP passthroughs            operator=ip
E[2024-04-20|08:49:44.118] observation stopped                          operator=ip err="context deadline exceeded"
I[2024-04-20|08:49:44.119] associated provider                          operator=ip addr=akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
I[2024-04-20|08:49:44.119] fetching existing IP passthroughs            operator=ip
E[2024-04-20|08:50:44.126] observation stopped                          operator=ip err="context deadline exceeded"
I[2024-04-20|08:50:44.126] associated provider                          operator=ip addr=akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
I[2024-04-20|08:50:44.126] fetching existing IP passthroughs            operator=ip
E[2024-04-20|08:51:12.567] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:14.567] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:16.568] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:18.569] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:20.570] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:22.572] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:24.572] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:26.574] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:28.576] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:30.576] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:32.577] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:34.578] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:36.580] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:38.580] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:40.581] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:42.582] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:44.133] observation stopped                          operator=ip err="context deadline exceeded"
I[2024-04-20|08:51:44.133] associated provider                          operator=ip addr=akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
I[2024-04-20|08:51:44.133] fetching existing IP passthroughs            operator=ip
E[2024-04-20|08:51:44.584] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:46.585] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:48.585] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:50.587] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:52.588] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:54.589] barrier is locked, can't service request     operator=ip path=/health

Provider logs:

$ kubectl -n akash-services logs akash-provider-0 --tail=100 -f
...
I[2024-04-20|08:51:13.666] found orders                                 module=bidengine-service cmp=provider count=28
I[2024-04-20|08:51:13.667] fetched provider attributes                  module=bidengine-service cmp=provider provider=akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
I[2024-04-20|08:51:13.669] grpc listening on "0.0.0.0:8444"             cmp=provider
I[2024-04-20|08:51:14.567] check result                                 cmp=provider operator=ip status=503
E[2024-04-20|08:51:14.567] not yet ready                                cmp=provider cmp=waiter waitable="<*ip.client 0xc00167d6c0>" error="ip operator is not yet alive"

Fix:

$ kubectl -n akash-services rollout restart deployment/operator-ip
deployment.apps/operator-ip restarted

Probably need to implement a workaround that would keep restarting the operator-ip pod once it finds "barrier is locked, can't service request" message in the logs, similarly to how it's done in akash-provider pod. Except that in some cases this error might be a real error, and we probably don't want operator-ip keep restarting way too often.. need to see if can do exponential backoff mechanism.

OR, ideally, if there is a healthz kind of endpoint operator-ip may have that can be queried.

Refs.
akash-network/support#105

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant