Wait for a function to become healthy in scale-up event #213

alexellis · 2021-11-01T11:02:10Z

Signed-off-by: Alex Ellis (OpenFaaS Ltd) alexellis2@gmail.com

Description

Wait for a function to become healthy in scale-up event

Motivation and Context

Prior to this change, after scaling a function up and
returning the API call, a function may still not be ready to
serve traffic. This resulted in HTTP errors, for a percentage
of the time, especially if the task was deleted instead of
being just paused.

Pausing was instant, but during re-creation the function needs
some time to start up.

This change puts a health check into the hot path for the
scale event. It is blocking, so scaling up will have some
additional latency, but will return with a ready endpoint
much more of the time than previously.

This approach means that faasd doesn't have to run a set of
exec or HTTP healthchecks continually, and use CPU for
each of them, even when a function is idle.

How Has This Been Tested?

Tested with the nodeinfo function, by killing the task
and then invoking the function. Prior to this, the
function may give an error code some of the time.

If you want to test this patch, follow these instructions:

https://github.com/openfaas/faasd/blob/master/docs/PATCHES.md

Then deploy your function, kill its task, and then invoke it. Let's say you deployed bot as a function

faas-cli deploy -f bot.yml

# View the task
sudo ctr -n openfaas-fn task ls

# Delete the task by killing the process
sudo ctr -n openfaas-fn task rm bot -f

# Invoke the function to cause a scale up (and the new code that waits until the function is healthy)
time curl -i http://127.0.0.1:8080/function/bot

If you're using an openfaas template, it already implements a health endpoint, there's no need to add anything to your code.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

Commits:

I've read the CONTRIBUTION guide
My commit message has a body and describe how this was tested and why it is required.
I have signed-off my commits with git commit -s for the Developer Certificate of Origin (DCO)

Code:

My code follows the code style of this project.
I have added tests to cover my changes.

Docs:

My change requires a change to the documentation.
I have updated the documentation accordingly.

Prior to this change, after scaling a function up and returning the API call, a function may still not be ready to serve traffic. This resulted in HTTP errors, for a percentage of the time, especially if the task was deleted instead of being just paused. Pausing was instant, but during re-creation the function needs some time to start up. This change puts a health check into the hot path for the scale event. It is blocking, so scaling up will have some additional latency, but will return with a ready endpoint much more of the time than previously. This approach means that faasd doesn't have to run a set of exec or HTTP healthchecks continually, and use CPU for each of them, even when a function is idle. Tested with the nodeinfo function, by killing the task and then invoking the function. Prior to this, the function may give an error code some of the time. Signed-off-by: Alex Ellis (OpenFaaS Ltd) <alexellis2@gmail.com>

Signed-off-by: Alex Ellis (OpenFaaS Ltd) <alexellis2@gmail.com>

rgee0 · 2021-11-03T20:24:32Z

Happy Path

Check currently running tasks:

$ sudo ctr -n openfaas-fn task ls
TASK            PID     STATUS    
nodeinfo        3914    RUNNING
env             4056    RUNNING
derek-sha256    4452    RUNNING

Lets rm env:

sudo ctr -n openfaas-fn task rm env -f
WARN[0000] task env exit with non-zero exit code 137

Check its gone:

$ sudo ctr -n openfaas-fn task ls
TASK            PID     STATUS    
derek-sha256    4452    RUNNING
nodeinfo        3914    RUNNING

Call the env function:

time curl -i http://127.0.0.1:8080/function/env
HTTP/1.1 200 OK
Content-Length: 425
Content-Type: text/plain; charset=utf-8
Date: Wed, 03 Nov 2021 19:15:18 GMT
X-Call-Id: a5bb5d32-8dc7-4e1d-87b7-f7f0d3121954
X-Duration-Seconds: 0.001052
X-Start-Time: 1635966918789614976

PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
fprocess=env
HOME=/home/app
Http_X_Forwarded_For=10.62.0.1:50736
Http_X_Forwarded_Host=127.0.0.1:8080
Http_X_Start_Time=1635966918789614976
Http_User_Agent=curl/7.68.0
Http_Accept=*/*
Http_Accept_Encoding=gzip
Http_X_Call_Id=a5bb5d32-8dc7-4e1d-87b7-f7f0d3121954
Http_Method=GET
Http_ContentLength=0
Http_Content_Length=0
Http_Path=/
Http_Host=10.62.0.15:8080

real	0m0.259s
user	0m0.012s
sys	0m0.001s

Has the task been reinstated?

$ sudo ctr -n openfaas-fn task ls
TASK            PID      STATUS    
nodeinfo        3914     RUNNING
env             39201    RUNNING
derek-sha256    4452     RUNNING

Load test

Create a script:

#!/bin/bash

N=10

echo Attempting scale up event ${N} times...

for ((i=1; i<=$N; i++ ));
do
 ctr -n openfaas-fn task rm env -f > /dev/null 2>&1
 sleep 1
 STATUS=$(curl -is http://127.0.0.1:8080/function/env | head -n 1|cut -d$' ' -f2)
 PID=$(ctr -n openfaas-fn task ls | grep 'env' | awk '{print $2}')
 echo "PID:${PID} - STATUS:${STATUS}"
done

Run the script over a small set echoing each time to demonstrate efficacy:

$ ./loop.sh
Attempting scale up event 10 times...
PID:52761 - STATUS:200
PID:52947 - STATUS:200
PID:53128 - STATUS:200
PID:53297 - STATUS:200
PID:53469 - STATUS:200
PID:53640 - STATUS:200
PID:53812 - STATUS:200
PID:53984 - STATUS:200
PID:54150 - STATUS:200
PID:54323 - STATUS:200

Edit script to only echo when a non-200 return code is encountered:

#!/bin/bash

N=1000

echo Attempting scale up event ${N} times...

for ((i=1; i<=$N; i++ ));
do
 ctr -n openfaas-fn task rm env -f > /dev/null 2>&1
 sleep 1
 STATUS=$(curl -is http://127.0.0.1:8080/function/env | head -n 1|cut -d$' ' -f2)
 PID=$(ctr -n openfaas-fn task ls | grep 'env' | awk '{print $2}')
 if [ "${STATUS}" != "200" ]
 then
 echo "PID:${PID} - STATUS:${STATUS}"
 fi
done
echo "Test Completed"

Run the script (add time as a sanity check):

$ time ./loop.sh
Attempting scale up event 1000 times...
PID:59226 - STATUS:500
PID: - STATUS:404
PID: - STATUS:404
PID: - STATUS:
PID: - STATUS:404
PID: - STATUS:
PID:191771 - STATUS:500
PID:213436 - STATUS:500
Test Completed

real	25m35.304s
user	0m30.736s
sys	0m28.744s

Rerun the N=1000 script with the iteration number in the output:

time ./loop.sh
Attempting scale up event 1000 times...
PID:251329 - STATUS:500 - Iteration:138
PID: - STATUS:404 - Iteration:293
PID: - STATUS:404 - Iteration:695
PID:364081 - STATUS:500 - Iteration:786
Test Completed

real	28m52.654s
user	0m29.693s
sys	0m27.052s

alexellis · 2021-11-03T20:27:50Z

What a great testing method 💪

Perhaps there's something going wrong here that I need to dig into.

How does 1000x test run compare when you use the original binary?

rgee0 · 2021-11-03T21:03:01Z

Edit: Further investigation meant a rerun was required. Rerun results provided here.

Putting the 0.14.3 binary back in place and rerunning the N=1000 test:

$ time ./loop.sh
Attempting scale up event 1000 times...
PID:458207 - STATUS:500 - Iteration:32
PID: - STATUS: - Iteration:252
PID: - STATUS: - Iteration:344
PID: - STATUS:404 - Iteration:361
PID: - STATUS: - Iteration:540
Test Completed

real	32m15.716s
user	0m30.298s
sys	0m28.397s

tmiklas · 2021-11-04T21:05:07Z

I was trying to avoid failing but... failed :-P

Here's observation on faasd version: 0.14.3 commit: ea62c1b
https://asciinema.org/a/J6KJL3KZgFmFO1TL0pBTQg0SU

Here's repeat on build from wait-for-healthy branch
https://asciinema.org/a/FyxgzsEiOUVh82PCPDlk39I1w

Unfortunately regardless of how long I wait after faasd starts up, the first hit on env function produces the same outcome (fail) and any later one works just fine

root@debian:~/faasd# curl http://localhost:8080/function/env
Can't reach service for: env.

root@debian:~/faasd# curl http://localhost:8080/function/env
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
fprocess=env
HOME=/home/app
[...]

This is why I hacked it by adding script that after faasd starts will tickle every registered function with GET request. Ugly but works...

rgee0 · 2021-11-05T07:27:20Z

Following the steps of @tmiklas I've found that that error corresponds to a HTTP/500.

So far I've only found Can't reach service for: here:

https://github.com/openfaas/faas-provider/blob/bc0f26cae70c1b3ba2a595cea8bfb8c04925d5c6/proxy/proxy.go#L168-L168

Which is introduced in this change, see:

https://github.com/openfaas/faasd/pull/213/files#diff-d4774f198ccba12d22912607ba38ceac90b936cc5f0198415df45e22629a3770R18

However, I cannot currently see a path to that particular error message.

rgee0 · 2021-11-05T07:34:07Z

I ought to mention that the same happens for me using 0.14.3:

root@derek:~# faasd version
  __                     _ 
 / _| __ _  __ _ ___  __| |
| |_ / _` |/ _` / __|/ _` |
|  _| (_| | (_| \__ \ (_| |
|_|  \__,_|\__,_|___/\__,_|

faasd version: 0.14.3	commit: ea62c1b12dd1ae1de794e6dd260351cfbd1a6759
root@derek:~# killall -9 faasd; systemctl restart faasd  
root@derek:~# ctr -n openfaas-fn task ls  
TASK            PID       STATUS    
nodeinfo        3914      RUNNING
env             635600    RUNNING
derek-sha256    4452      RUNNING
root@derek:~# systemctl status faasd 
● faasd.service - faasd
     Loaded: loaded (/lib/systemd/system/faasd.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2021-11-05 07:32:01 UTC; 18s ago
   Main PID: 635793 (faasd)
      Tasks: 8 (limit: 9513)
     Memory: 22.0M (limit: 500.0M)
     CGroup: /system.slice/faasd.service
             └─635793 /usr/local/bin/faasd up

Nov 05 07:32:07 derek faasd[635793]: 2021/11/05 07:32:07 Resolver: "localhost"="127.0.0.1"
Nov 05 07:32:07 derek faasd[635793]: 2021/11/05 07:32:07 Resolver: "faasd-provider"="10.62.0.1"
Nov 05 07:32:07 derek faasd[635793]: 2021/11/05 07:32:07 Resolver: "nats"="10.62.12.21"
Nov 05 07:32:07 derek faasd[635793]: 2021/11/05 07:32:07 Resolver: "queue-worker"="10.62.12.22"
Nov 05 07:32:07 derek faasd[635793]: 2021/11/05 07:32:07 Resolver: "basic-auth-plugin"="10.62.12.23"
Nov 05 07:32:07 derek faasd[635793]: 2021/11/05 07:32:07 Resolver: "prometheus"="10.62.12.24"
Nov 05 07:32:07 derek faasd[635793]: 2021/11/05 07:32:07 Resolver: "gateway"="10.62.12.25"
Nov 05 07:32:07 derek faasd[635793]: 2021/11/05 07:32:07 Proxy from: 127.0.0.1:9090, to: prometheus:9090 (10.62.12.24)
Nov 05 07:32:07 derek faasd[635793]: 2021/11/05 07:32:07 faasd: waiting for SIGTERM or SIGINT
Nov 05 07:32:07 derek faasd[635793]: 2021/11/05 07:32:07 Proxy from: 0.0.0.0:8080, to: gateway:8080 (10.62.12.25)
root@derek:~# curl -i http://localhost:8080/function/env  
HTTP/1.1 500 Internal Server Error
Content-Length: 30
Content-Type: text/plain; charset=utf-8
Date: Fri, 05 Nov 2021 07:32:34 GMT
X-Call-Id: 0fed9735-c9e5-497d-9202-b4182535b055
X-Content-Type-Options: nosniff
X-Start-Time: 1636097554054763230

Can't reach service for: env.

and the logs:

Nov 05 07:32:07 derek faasd[635793]: 2021/11/05 07:32:07 faasd: waiting for SIGTERM or SIGINT
Nov 05 07:32:07 derek faasd[635793]: 2021/11/05 07:32:07 Proxy from: 0.0.0.0:8080, to: gateway:8080 (10.62.12.25)
Nov 05 07:32:08 derek openfaas:basic-auth-plugin[636244]: 2021/11/05 07:32:08 Validated request 200.
Nov 05 07:32:08 derek openfaas:gateway[636474]: 2021/11/05 07:32:08 error with upstream request to: /system/function/env, Get "http://faasd-provider:8081/system/function/env?namespace=openfaas-fn": dial tcp 10.62.0.1:8081: connect: connection refused
Nov 05 07:32:08 derek openfaas:gateway[636474]: 2021/11/05 07:32:08 Forwarded [GET] to /system/function/env?namespace=openfaas-fn - [502] - 0.002867s seconds
Nov 05 07:32:10 derek openfaas:basic-auth-plugin[636244]: 2021/11/05 07:32:10 Validated request 200.
Nov 05 07:32:10 derek openfaas:gateway[636474]: 2021/11/05 07:32:10 error with upstream request to: /system/functions, Get "http://faasd-provider:8081/system/functions?namespace=openfaas-fn": dial tcp 10.62.0.1:8081: connect: connection refused
Nov 05 07:32:10 derek openfaas:gateway[636474]: 2021/11/05 07:32:10 Forwarded [GET] to /system/functions?namespace=openfaas-fn - [502] - 0.002327s seconds
Nov 05 07:32:10 derek openfaas:gateway[636474]: 2021/11/05 07:32:10 List functions responded with code 502, body: 
Nov 05 07:32:11 derek openfaas:basic-auth-plugin[636244]: 2021/11/05 07:32:11 Validated request 200.
Nov 05 07:32:11 derek openfaas:gateway[636474]: 2021/11/05 07:32:11 error with upstream request to: /system/function/env, Get "http://faasd-provider:8081/system/function/env?namespace=openfaas-fn": dial tcp 10.62.0.1:8081: connect: connection refused
Nov 05 07:32:11 derek openfaas:gateway[636474]: 2021/11/05 07:32:11 Forwarded [GET] to /system/function/env?namespace=openfaas-fn - [502] - 0.002428s seconds
Nov 05 07:32:11 derek systemd[1]: faasd-provider.service: Scheduled restart job, restart counter is at 3.
Nov 05 07:32:11 derek systemd[1]: Stopped faasd-provider.
Nov 05 07:32:11 derek systemd[1]: Started faasd-provider.
Nov 05 07:32:11 derek faasd[636580]: 2021/11/05 07:32:11 faasd-provider starting..        Service Timeout: 1m0s
Nov 05 07:32:11 derek faasd[636580]: faasd version: 0.14.3        commit: ea62c1b12dd1ae1de794e6dd260351cfbd1a6759
Nov 05 07:32:11 derek faasd[636580]: 2021/11/05 07:32:11 Writing network config...
Nov 05 07:32:11 derek faasd[636580]: 2021/11/05 07:32:11 Listening on TCP port: 8081

rgee0 · 2021-11-05T07:41:42Z

I can also trigger the HTTP/404 in this way, again on 0.14.3:

 killall -9 faasd; systemctl restart faasd  
root@derek:~# ctr -n openfaas-fn task ls  
TASK            PID       STATUS    
nodeinfo        3914      RUNNING
env             637515    STOPPED
derek-sha256    4452      RUNNING
root@derek:~# systemctl status faasd 
● faasd.service - faasd
     Loaded: loaded (/lib/systemd/system/faasd.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2021-11-05 07:40:36 UTC; 4s ago
   Main PID: 638443 (faasd)
      Tasks: 7 (limit: 9513)
     Memory: 18.9M (limit: 500.0M)
     CGroup: /system.slice/faasd.service
             └─638443 /usr/local/bin/faasd up

Nov 05 07:40:38 derek faasd[638443]: Starting: nats
Nov 05 07:40:38 derek faasd[638443]: 2021/11/05 07:40:38 Created container: nats
Nov 05 07:40:39 derek faasd[638443]: 2021/11/05 07:40:39 nats has IP: 10.62.12.38
Nov 05 07:40:39 derek faasd[638443]: 2021/11/05 07:40:39 Task: nats        Container: nats
Nov 05 07:40:39 derek faasd[638443]: Starting: prometheus
Nov 05 07:40:39 derek faasd[638443]: 2021/11/05 07:40:39 Created container: prometheus
Nov 05 07:40:40 derek faasd[638443]: 2021/11/05 07:40:40 prometheus has IP: 10.62.12.39
Nov 05 07:40:40 derek faasd[638443]: 2021/11/05 07:40:40 Task: prometheus        Container: prometheus
Nov 05 07:40:40 derek faasd[638443]: Starting: gateway
Nov 05 07:40:40 derek faasd[638443]: 2021/11/05 07:40:40 Created container: gateway
root@derek:~# curl -i http://localhost:8080/function/env  
HTTP/1.1 404 Not Found
Date: Fri, 05 Nov 2021 07:40:42 GMT
Content-Length: 168
Content-Type: text/plain; charset=utf-8

error finding function env.openfaas-fn: Get "http://faasd-provider:8081/system/function/env?namespace=openfaas-fn": dial tcp 10.62.0.1:8081: connect: connection

And logs:

Nov 05 07:40:41 derek faasd[638443]: 2021/11/05 07:40:41 queue-worker has IP: 10.62.12.41
Nov 05 07:40:41 derek faasd[638443]: 2021/11/05 07:40:41 Task: queue-worker        Container: queue-worker
Nov 05 07:40:41 derek openfaas:queue-worker[639127]: Loading basic authentication credentials
Nov 05 07:40:41 derek openfaas:queue-worker[639127]: Starting queue-worker. Version: 0.12.2        Git Commit: ada9a31504c99b57c3ae6fb7559d7684988e5683
Nov 05 07:40:41 derek openfaas:queue-worker[639127]: Connect: nats://nats:4222
Nov 05 07:40:41 derek openfaas:queue-worker[639127]: Subscribing to: faas-request at nats://nats:4222
Nov 05 07:40:41 derek openfaas:queue-worker[639127]: Wait for  5m5s
Nov 05 07:40:41 derek openfaas:queue-worker[639127]: Listening on [faas-request], clientID=[faas-worker-queue-worker], qgroup=[faas] maxInFlight=[1]
Nov 05 07:40:41 derek openfaas:nats[638761]: [1] 2021/11/05 07:40:41.779531 [INF] STREAM: Channel "faas-request" has been created
Nov 05 07:40:41 derek faasd[638443]: 2021/11/05 07:40:41 Supervisor init done in: 5.108281179s
Nov 05 07:40:41 derek faasd[638443]: 2021/11/05 07:40:41 Looking up IP for: "gateway"
Nov 05 07:40:41 derek faasd[638443]: 2021/11/05 07:40:41 Resolver rebuilding map
Nov 05 07:40:41 derek faasd[638443]: 2021/11/05 07:40:41 Resolver: "localhost"="127.0.0.1"
Nov 05 07:40:41 derek faasd[638443]: 2021/11/05 07:40:41 Resolver: "faasd-provider"="10.62.0.1"
Nov 05 07:40:41 derek faasd[638443]: 2021/11/05 07:40:41 Resolver: "basic-auth-plugin"="10.62.12.37"
Nov 05 07:40:41 derek faasd[638443]: 2021/11/05 07:40:41 Resolver: "nats"="10.62.12.38"
Nov 05 07:40:41 derek faasd[638443]: 2021/11/05 07:40:41 Resolver: "prometheus"="10.62.12.39"
Nov 05 07:40:41 derek faasd[638443]: 2021/11/05 07:40:41 Resolver: "gateway"="10.62.12.40"
Nov 05 07:40:41 derek faasd[638443]: 2021/11/05 07:40:41 Resolver: "queue-worker"="10.62.12.41"
Nov 05 07:40:41 derek faasd[638443]: 2021/11/05 07:40:41 Looking up IP for: "prometheus"
Nov 05 07:40:41 derek faasd[638443]: 2021/11/05 07:40:41 Proxy from: 127.0.0.1:9090, to: prometheus:9090 (10.62.12.39)
Nov 05 07:40:41 derek faasd[638443]: 2021/11/05 07:40:41 faasd: waiting for SIGTERM or SIGINT
Nov 05 07:40:42 derek faasd[638443]: 2021/11/05 07:40:42 Proxy from: 0.0.0.0:8080, to: gateway:8080 (10.62.12.40)
Nov 05 07:40:42 derek systemd-networkd[608]: veth76b7beab: Gained IPv6LL
Nov 05 07:40:42 derek openfaas:gateway[639002]: 2021/11/05 07:40:42 http://faasd-provider:8081/system/function/env?namespace=openfaas-fn Get "http://faasd-provider:8081/system/function/env?namespace=openfaas-fn": dial tcp 10.62.0.1:8081: connect: connection refused
Nov 05 07:40:42 derek openfaas:gateway[639002]: 2021/11/05 07:40:42 GetReplicas [env.openfaas-fn] took: 0.002563s
Nov 05 07:40:42 derek openfaas:gateway[639002]: 2021/11/05 07:40:42 Scaling: error finding function env.openfaas-fn: Get "http://faasd-provider:8081/system/function/env?namespace=openfaas-fn": dial tcp 10.62.0.1:8081: connect: connection refused
Nov 05 07:40:45 derek openfaas:gateway[639002]: 2021/11/05 07:40:45 Get "http://faasd-provider:8081/system/namespaces": dial tcp 10.62.0.1:8081: connect: connection refused
Nov 05 07:40:45 derek openfaas:gateway[639002]: 2021/11/05 07:40:45 Get "http://faasd-provider:8081/system/functions?namespace=openfaas-fn": dial tcp 10.62.0.1:8081: connect: connection refused

alexellis · 2022-05-30T09:23:10Z

cc @welteki

The solution that we're exploring with probing in the gateway, may also help with this issue in faasd.

openfaas/faas#1721

alexellis added 2 commits November 1, 2021 11:00

Add comment to explain how method works

bbd3b4f

Signed-off-by: Alex Ellis (OpenFaaS Ltd) <alexellis2@gmail.com>

alexellis requested a review from LucasRoesler November 1, 2021 11:05

alexellis assigned alexellis and rgee0 Nov 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait for a function to become healthy in scale-up event #213

Wait for a function to become healthy in scale-up event #213

alexellis commented Nov 1, 2021 •

edited

Loading

rgee0 commented Nov 3, 2021 •

edited

Loading

alexellis commented Nov 3, 2021

rgee0 commented Nov 3, 2021 •

edited

Loading

tmiklas commented Nov 4, 2021 •

edited

Loading

rgee0 commented Nov 5, 2021

rgee0 commented Nov 5, 2021 •

edited

Loading

rgee0 commented Nov 5, 2021 •

edited

Loading

alexellis commented May 30, 2022

Wait for a function to become healthy in scale-up event #213

Are you sure you want to change the base?

Wait for a function to become healthy in scale-up event #213

Conversation

alexellis commented Nov 1, 2021 • edited Loading

Description

Motivation and Context

How Has This Been Tested?

Types of changes

Checklist:

rgee0 commented Nov 3, 2021 • edited Loading

Happy Path

Load test

alexellis commented Nov 3, 2021

rgee0 commented Nov 3, 2021 • edited Loading

tmiklas commented Nov 4, 2021 • edited Loading

rgee0 commented Nov 5, 2021

rgee0 commented Nov 5, 2021 • edited Loading

rgee0 commented Nov 5, 2021 • edited Loading

alexellis commented May 30, 2022

alexellis commented Nov 1, 2021 •

edited

Loading

rgee0 commented Nov 3, 2021 •

edited

Loading

rgee0 commented Nov 3, 2021 •

edited

Loading

tmiklas commented Nov 4, 2021 •

edited

Loading

rgee0 commented Nov 5, 2021 •

edited

Loading

rgee0 commented Nov 5, 2021 •

edited

Loading