-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wait for a function to become healthy in scale-up event #213
base: master
Are you sure you want to change the base?
Conversation
Prior to this change, after scaling a function up and returning the API call, a function may still not be ready to serve traffic. This resulted in HTTP errors, for a percentage of the time, especially if the task was deleted instead of being just paused. Pausing was instant, but during re-creation the function needs some time to start up. This change puts a health check into the hot path for the scale event. It is blocking, so scaling up will have some additional latency, but will return with a ready endpoint much more of the time than previously. This approach means that faasd doesn't have to run a set of exec or HTTP healthchecks continually, and use CPU for each of them, even when a function is idle. Tested with the nodeinfo function, by killing the task and then invoking the function. Prior to this, the function may give an error code some of the time. Signed-off-by: Alex Ellis (OpenFaaS Ltd) <alexellis2@gmail.com>
Signed-off-by: Alex Ellis (OpenFaaS Ltd) <alexellis2@gmail.com>
Happy PathCheck currently running tasks:
Lets
Check its gone:
Call the
Has the task been reinstated?
Load testCreate a script: #!/bin/bash
N=10
echo Attempting scale up event ${N} times...
for ((i=1; i<=$N; i++ ));
do
ctr -n openfaas-fn task rm env -f > /dev/null 2>&1
sleep 1
STATUS=$(curl -is http://127.0.0.1:8080/function/env | head -n 1|cut -d$' ' -f2)
PID=$(ctr -n openfaas-fn task ls | grep 'env' | awk '{print $2}')
echo "PID:${PID} - STATUS:${STATUS}"
done Run the script over a small set echoing each time to demonstrate efficacy:
Edit script to only echo when a non-200 return code is encountered: #!/bin/bash
N=1000
echo Attempting scale up event ${N} times...
for ((i=1; i<=$N; i++ ));
do
ctr -n openfaas-fn task rm env -f > /dev/null 2>&1
sleep 1
STATUS=$(curl -is http://127.0.0.1:8080/function/env | head -n 1|cut -d$' ' -f2)
PID=$(ctr -n openfaas-fn task ls | grep 'env' | awk '{print $2}')
if [ "${STATUS}" != "200" ]
then
echo "PID:${PID} - STATUS:${STATUS}"
fi
done
echo "Test Completed" Run the script (add
Rerun the N=1000 script with the iteration number in the output:
|
What a great testing method 💪 Perhaps there's something going wrong here that I need to dig into. How does 1000x test run compare when you use the original binary? |
Edit: Further investigation meant a rerun was required. Rerun results provided here. Putting the 0.14.3 binary back in place and rerunning the N=1000 test:
|
I was trying to avoid failing but... failed :-P Here's observation on faasd version: 0.14.3 commit: ea62c1b Here's repeat on build from Unfortunately regardless of how long I wait after faasd starts up, the first hit on
This is why I hacked it by adding script that after |
Following the steps of @tmiklas I've found that that error corresponds to a HTTP/500. So far I've only found Which is introduced in this change, see: However, I cannot currently see a path to that particular error message. |
I ought to mention that the same happens for me using 0.14.3:
and the logs:
|
I can also trigger the HTTP/404 in this way, again on 0.14.3:
And logs:
|
cc @welteki The solution that we're exploring with probing in the gateway, may also help with this issue in faasd. |
Signed-off-by: Alex Ellis (OpenFaaS Ltd) alexellis2@gmail.com
Description
Wait for a function to become healthy in scale-up event
Motivation and Context
Prior to this change, after scaling a function up and
returning the API call, a function may still not be ready to
serve traffic. This resulted in HTTP errors, for a percentage
of the time, especially if the task was deleted instead of
being just paused.
Pausing was instant, but during re-creation the function needs
some time to start up.
This change puts a health check into the hot path for the
scale event. It is blocking, so scaling up will have some
additional latency, but will return with a ready endpoint
much more of the time than previously.
This approach means that faasd doesn't have to run a set of
exec or HTTP healthchecks continually, and use CPU for
each of them, even when a function is idle.
How Has This Been Tested?
Tested with the nodeinfo function, by killing the task
and then invoking the function. Prior to this, the
function may give an error code some of the time.
If you want to test this patch, follow these instructions:
https://github.com/openfaas/faasd/blob/master/docs/PATCHES.md
Then deploy your function, kill its task, and then invoke it. Let's say you deployed
bot
as a functionIf you're using an openfaas template, it already implements a health endpoint, there's no need to add anything to your code.
Types of changes
Checklist:
Commits:
git commit -s
for the Developer Certificate of Origin (DCO)Code:
Docs: