operator-ip: recover from failure until akash-network/support#105 is solved #276

andy108369 · 2024-04-20T08:59:00Z

provider pod won't respond over 8443/status & 8444 (grpc) status endpoints if operator-ip is used and the latter failed to initialize (typically on first worker node re/boot - as the network still haven't fully initialized it is unable to talk to the DNS yet 169.254.25.10:53: server misbehaving):

Operator-ip Logs:

$ kubectl -n akash-services logs deployment/operator-ip
I[2024-04-20|08:48:29.791] using in cluster kube config                 cmp=provider
D[2024-04-20|08:48:41.089] service being found via autodetection        operator=ip service=metal-lb
I[2024-04-20|08:48:41.093] clients                                      operator=ip kube="kube client 0xc001168000 ns=lease" metallb="metal LB client 0xc001168060"
I[2024-04-20|08:48:41.093] HTTP listening                               operator=ip address=:8080
I[2024-04-20|08:48:41.093] ip operator start                            operator=ip
I[2024-04-20|08:48:41.093] associated provider                          operator=ip addr=akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
I[2024-04-20|08:48:41.093] fetching existing IP passthroughs            operator=ip
E[2024-04-20|08:48:41.105] dns discovery failed                         operator=ip cmp=service-discovery-agent error="lookup _monitoring._TCP.controller.metallb-system.svc.cluster.local on 169.254.25.10:53: server misbehaving" portName=monitoring service-name=controller namespace=metallb-system
D[2024-04-20|08:48:41.105] satisfying pending requests                  operator=ip cmp=service-discovery-agent qty=1
E[2024-04-20|08:48:41.105] observation stopped                          operator=ip err="lookup _monitoring._TCP.controller.metallb-system.svc.cluster.local on 169.254.25.10:53: server misbehaving"
I[2024-04-20|08:48:41.105] delaying                                     operator=ip
I[2024-04-20|08:48:44.107] associated provider                          operator=ip addr=akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
I[2024-04-20|08:48:44.107] fetching existing IP passthroughs            operator=ip
E[2024-04-20|08:49:44.118] observation stopped                          operator=ip err="context deadline exceeded"
I[2024-04-20|08:49:44.119] associated provider                          operator=ip addr=akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
I[2024-04-20|08:49:44.119] fetching existing IP passthroughs            operator=ip
E[2024-04-20|08:50:44.126] observation stopped                          operator=ip err="context deadline exceeded"
I[2024-04-20|08:50:44.126] associated provider                          operator=ip addr=akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
I[2024-04-20|08:50:44.126] fetching existing IP passthroughs            operator=ip
E[2024-04-20|08:51:12.567] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:14.567] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:16.568] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:18.569] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:20.570] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:22.572] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:24.572] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:26.574] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:28.576] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:30.576] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:32.577] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:34.578] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:36.580] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:38.580] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:40.581] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:42.582] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:44.133] observation stopped                          operator=ip err="context deadline exceeded"
I[2024-04-20|08:51:44.133] associated provider                          operator=ip addr=akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
I[2024-04-20|08:51:44.133] fetching existing IP passthroughs            operator=ip
E[2024-04-20|08:51:44.584] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:46.585] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:48.585] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:50.587] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:52.588] barrier is locked, can't service request     operator=ip path=/health
E[2024-04-20|08:51:54.589] barrier is locked, can't service request     operator=ip path=/health

Provider logs:

$ kubectl -n akash-services logs akash-provider-0 --tail=100 -f
...
I[2024-04-20|08:51:13.666] found orders                                 module=bidengine-service cmp=provider count=28
I[2024-04-20|08:51:13.667] fetched provider attributes                  module=bidengine-service cmp=provider provider=akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
I[2024-04-20|08:51:13.669] grpc listening on "0.0.0.0:8444"             cmp=provider
I[2024-04-20|08:51:14.567] check result                                 cmp=provider operator=ip status=503
E[2024-04-20|08:51:14.567] not yet ready                                cmp=provider cmp=waiter waitable="<*ip.client 0xc00167d6c0>" error="ip operator is not yet alive"

Fix:

$ kubectl -n akash-services rollout restart deployment/operator-ip
deployment.apps/operator-ip restarted

Probably need to implement a workaround that would keep restarting the operator-ip pod once it finds "barrier is locked, can't service request" message in the logs, similarly to how it's done in akash-provider pod. Except that in some cases this error might be a real error, and we probably don't want operator-ip keep restarting way too often.. need to see if can do exponential backoff mechanism.

OR, ideally, if there is a healthz kind of endpoint operator-ip may have that can be queried.

Refs.
akash-network/support#105

The text was updated successfully, but these errors were encountered:

andy108369 mentioned this issue Apr 20, 2024

ip-operator (provider-services 0.2.1) won't self-recover akash-network/support#105

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

operator-ip: recover from failure until akash-network/support#105 is solved #276

operator-ip: recover from failure until akash-network/support#105 is solved #276

andy108369 commented Apr 20, 2024 •

edited

Loading

operator-ip: recover from failure until akash-network/support#105 is solved #276

operator-ip: recover from failure until akash-network/support#105 is solved #276

Comments

andy108369 commented Apr 20, 2024 • edited Loading

andy108369 commented Apr 20, 2024 •

edited

Loading