You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
provider pod won't respond over 8443/status & 8444 (grpc) status endpoints if operator-ip is used and the latter failed to initialize (typically on first worker node re/boot - as the network still haven't fully initialized it is unable to talk to the DNS yet 169.254.25.10:53: server misbehaving):
Operator-ip Logs:
$ kubectl -n akash-services logs deployment/operator-ip
I[2024-04-20|08:48:29.791] using in cluster kube config cmp=provider
D[2024-04-20|08:48:41.089] service being found via autodetection operator=ip service=metal-lb
I[2024-04-20|08:48:41.093] clients operator=ip kube="kube client 0xc001168000 ns=lease" metallb="metal LB client 0xc001168060"
I[2024-04-20|08:48:41.093] HTTP listening operator=ip address=:8080
I[2024-04-20|08:48:41.093] ip operator start operator=ip
I[2024-04-20|08:48:41.093] associated provider operator=ip addr=akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
I[2024-04-20|08:48:41.093] fetching existing IP passthroughs operator=ip
E[2024-04-20|08:48:41.105] dns discovery failed operator=ip cmp=service-discovery-agent error="lookup _monitoring._TCP.controller.metallb-system.svc.cluster.local on 169.254.25.10:53: server misbehaving" portName=monitoring service-name=controller namespace=metallb-system
D[2024-04-20|08:48:41.105] satisfying pending requests operator=ip cmp=service-discovery-agent qty=1
E[2024-04-20|08:48:41.105] observation stopped operator=ip err="lookup _monitoring._TCP.controller.metallb-system.svc.cluster.local on 169.254.25.10:53: server misbehaving"
I[2024-04-20|08:48:41.105] delaying operator=ip
I[2024-04-20|08:48:44.107] associated provider operator=ip addr=akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
I[2024-04-20|08:48:44.107] fetching existing IP passthroughs operator=ip
E[2024-04-20|08:49:44.118] observation stopped operator=ip err="context deadline exceeded"
I[2024-04-20|08:49:44.119] associated provider operator=ip addr=akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
I[2024-04-20|08:49:44.119] fetching existing IP passthroughs operator=ip
E[2024-04-20|08:50:44.126] observation stopped operator=ip err="context deadline exceeded"
I[2024-04-20|08:50:44.126] associated provider operator=ip addr=akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
I[2024-04-20|08:50:44.126] fetching existing IP passthroughs operator=ip
E[2024-04-20|08:51:12.567] barrier is locked, can't service request operator=ip path=/health
E[2024-04-20|08:51:14.567] barrier is locked, can't service request operator=ip path=/health
E[2024-04-20|08:51:16.568] barrier is locked, can't service request operator=ip path=/health
E[2024-04-20|08:51:18.569] barrier is locked, can't service request operator=ip path=/health
E[2024-04-20|08:51:20.570] barrier is locked, can't service request operator=ip path=/health
E[2024-04-20|08:51:22.572] barrier is locked, can't service request operator=ip path=/health
E[2024-04-20|08:51:24.572] barrier is locked, can't service request operator=ip path=/health
E[2024-04-20|08:51:26.574] barrier is locked, can't service request operator=ip path=/health
E[2024-04-20|08:51:28.576] barrier is locked, can't service request operator=ip path=/health
E[2024-04-20|08:51:30.576] barrier is locked, can't service request operator=ip path=/health
E[2024-04-20|08:51:32.577] barrier is locked, can't service request operator=ip path=/health
E[2024-04-20|08:51:34.578] barrier is locked, can't service request operator=ip path=/health
E[2024-04-20|08:51:36.580] barrier is locked, can't service request operator=ip path=/health
E[2024-04-20|08:51:38.580] barrier is locked, can't service request operator=ip path=/health
E[2024-04-20|08:51:40.581] barrier is locked, can't service request operator=ip path=/health
E[2024-04-20|08:51:42.582] barrier is locked, can't service request operator=ip path=/health
E[2024-04-20|08:51:44.133] observation stopped operator=ip err="context deadline exceeded"
I[2024-04-20|08:51:44.133] associated provider operator=ip addr=akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
I[2024-04-20|08:51:44.133] fetching existing IP passthroughs operator=ip
E[2024-04-20|08:51:44.584] barrier is locked, can't service request operator=ip path=/health
E[2024-04-20|08:51:46.585] barrier is locked, can't service request operator=ip path=/health
E[2024-04-20|08:51:48.585] barrier is locked, can't service request operator=ip path=/health
E[2024-04-20|08:51:50.587] barrier is locked, can't service request operator=ip path=/health
E[2024-04-20|08:51:52.588] barrier is locked, can't service request operator=ip path=/health
E[2024-04-20|08:51:54.589] barrier is locked, can't service request operator=ip path=/health
Provider logs:
$ kubectl -n akash-services logs akash-provider-0 --tail=100 -f
...
I[2024-04-20|08:51:13.666] found orders module=bidengine-service cmp=provider count=28
I[2024-04-20|08:51:13.667] fetched provider attributes module=bidengine-service cmp=provider provider=akash15tl6v6gd0nte0syyxnv57zmmspgju4c3xfmdhk
I[2024-04-20|08:51:13.669] grpc listening on "0.0.0.0:8444" cmp=provider
I[2024-04-20|08:51:14.567] check result cmp=provider operator=ip status=503
E[2024-04-20|08:51:14.567] not yet ready cmp=provider cmp=waiter waitable="<*ip.client 0xc00167d6c0>" error="ip operator is not yet alive"
Probably need to implement a workaround that would keep restarting the operator-ip pod once it finds "barrier is locked, can't service request" message in the logs, similarly to how it's done in akash-provider pod. Except that in some cases this error might be a real error, and we probably don't want operator-ip keep restarting way too often.. need to see if can do exponential backoff mechanism.
OR, ideally, if there is a healthz kind of endpoint operator-ip may have that can be queried.
provider pod won't respond over 8443/status & 8444 (grpc) status endpoints if operator-ip is used and the latter failed to initialize (typically on first worker node re/boot - as the network still haven't fully initialized it is unable to talk to the DNS yet
169.254.25.10:53: server misbehaving
):Operator-ip Logs:
Provider logs:
Fix:
Probably need to implement a workaround that would keep restarting the operator-ip pod once it finds "barrier is locked, can't service request" message in the logs, similarly to how it's done in akash-provider pod. Except that in some cases this error might be a real error, and we probably don't want
operator-ip
keep restarting way too often.. need to see if can do exponential backoff mechanism.OR, ideally, if there is a healthz kind of endpoint operator-ip may have that can be queried.
Refs.
akash-network/support#105
The text was updated successfully, but these errors were encountered: