-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Seldon ROCKs to 1.17.1 version for CKF release 1.8 #37
Comments
mlserver-*Regarding MLServer ROCKs, as we see in release 1.17.0, its version was bumped to diff Dockerfile1.2.x Dockerfile1.3.5 >> changes.diff |
tensorflow-servingRegarding tensorflow-serving ROCK, we see that release 1.17.1 still uses imageVersion |
sklearnserverRegarding sklearn-server ROCK, while investigating with @i-chvets, we realised that current rockcraft.yaml file is based on upstream Dockerfile.conda and doesn't take into account the Dockerfile. Looking at the upstream Makefile though, we see that in order to build the image, they use both (with Dockerfile.conda as a EDIT: This has been resolved and the rockcraft.yaml updated. Plus, we documented its implementation here #47. |
This also needs to be addressed, even though it is not really a ROCK work, but related to 1.8: canonical/seldon-core-operator#200 |
tox.ini filesRegarding tox environments, we updated the pytest commands according to updates in the seldon-core-operator tox.ini file which means that we used:
|
Tensorflow-servingWe noticed that this could be the reason that when we introduced this ROCK, the first tensorflow-serving version that was working was 2.13.0 (going up from 2.1.0). |
Seldon-core-operator@orfeas-k Here are some of my findings.
These are the errors seen on SeldonDeployment, while tests are running (all pods are up):
More investigation is required into why version 1.17.1 fails to complete deployment of SeldonDeployments. |
seldon-core-operatorGreat job @i-chvets. These findings shed a bit of light in the situation. Let's note also that the only change introduced in this update is the Regarding logs, we 've also seen this in the conditions of resources `seldondeployment'
which is similar but not exactly the same with the above. |
EDIT: Doesn't stand for final PR that updates sklearnserverLooks like this is the case also for status:
conditions:
- lastTransitionTime: "2023-09-13T10:14:20Z"
lastUpdateTime: "2023-09-13T10:14:20Z"
message: Deployment does not have minimum availability.
reason: MinimumReplicasUnavailable
status: "False"
type: Available
- lastTransitionTime: "2023-09-13T10:14:20Z"
lastUpdateTime: "2023-09-13T10:14:20Z"
message: ReplicaSet "mlflow-default-0-classifier-665d987847" is progressing.
reason: ReplicaSetUpdated
status: "True"
type: Progressing After this, all next runs of the tests (with different parameters) fail. |
mlserver-sklearnTrying to run
It looks like the corresponding seldondeployment pod ( ubuntu@ip-172-31-31-120:~$ kubectl -n test-seldon-servers-ooek logs sklearn-default-0-classifier-5b85bb86b5-f5nmm --all-containers
{"level":"error","ts":1694607015.8338234,"logger":"SeldonRestApi","msg":"Ready check failed","error":"dial tcp [::1]:9000: connect: connection refused","stacktrace":"net/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.handleCORSRequests.func1\n\t/workspace/api/rest/middlewares.go:64\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/gorilla/mux.CORSMethodMiddleware.func1.1\n\t/go/pkg/mod/github.com/gorilla/mux@v1.8.0/middleware.go:51\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.xssMiddleware.func1\n\t/workspace/api/rest/middlewares.go:87\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.(*CloudeventHeaderMiddleware).Middleware.func1\n\t/workspace/api/rest/middlewares.go:47\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/seldonio/seldon-core/executor/api/rest.puidHeader.func1\n\t/workspace/api/rest/middlewares.go:79\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2047\ngithub.com/gorilla/mux.(*Router).ServeHTTP\n\t/go/pkg/mod/github.com/gorilla/mux@v1.8.0/mux.go:210\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2879\nnet/http.(*conn).serve\n\t/usr/local/go/src/net/http/server.go:1930"} Describing the same pod, we get those events ubuntu@ip-172-31-31-120:~$ k -n test-seldon-servers-ooek describe pod sklearn-default-0-classifier-5b85bb86b5-f5nmm
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 3m52s default-scheduler Successfully assigned test-seldon-servers-ooek/sklearn-default-0-classifier-5b85bb86b5-f5nmm to ip-172-31-31-120
Normal Pulled 3m52s kubelet Container image "seldonio/rclone-storage-initializer:1.14.1" already present on machine
Normal Created 3m52s kubelet Created container classifier-model-initializer
Normal Started 3m52s kubelet Started container classifier-model-initializer
Normal Pulled 3m51s kubelet Container image "mlserver-sklearn_1.3.5_20.04_1_amd64:1.3.5_20.04_1" already present on machine
Normal Created 3m51s kubelet Created container classifier
Normal Started 3m51s kubelet Started container classifier
Normal Pulled 3m51s kubelet Container image "docker.io/seldonio/seldon-core-executor:1.14.0" already present on machine
Normal Created 3m51s kubelet Created container seldon-container-engine
Normal Started 3m51s kubelet Started container seldon-container-engine
Warning Unhealthy 2m53s (x8 over 3m28s) kubelet Readiness probe failed: Get "http://10.1.63.206:9000/v2/health/ready": dial tcp 10.1.63.206:9000: connect: connection refused
Warning Unhealthy 2m53s (x8 over 3m28s) kubelet Readiness probe failed: HTTP probe failed with statuscode: 503 As expected, in pod's conditions we see: status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2023-09-13T12:09:52Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2023-09-13T12:09:50Z"
message: 'containers with unready status: [classifier seldon-container-engine]'
reason: ContainersNotReady
status: "False"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2023-09-13T12:09:50Z"
message: 'containers with unready status: [classifier seldon-container-engine]'
reason: ContainersNotReady
status: "False"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2023-09-13T12:09:50Z"
status: "True"
type: PodScheduled and in the status:
address:
url: http://sklearn-default.test-seldon-servers-ooek.svc.cluster.local:8000/v2/models/classifier/infer
conditions:
- lastTransitionTime: "2023-09-13T12:09:50Z"
message: Deployment does not have minimum availability.
reason: MinimumReplicasUnavailable
status: "False"
type: DeploymentsReady
- lastTransitionTime: "2023-09-13T12:09:50Z"
reason: No HPAs defined
status: "True"
type: HpasReady
- lastTransitionTime: "2023-09-13T12:09:50Z"
reason: No KEDA resources defined
status: "True"
type: KedaReady
- lastTransitionTime: "2023-09-13T12:09:50Z"
reason: No PDBs defined
status: "True"
type: PdbsReady
- lastTransitionTime: "2023-09-13T12:09:50Z"
message: Deployment does not have minimum availability.
reason: MinimumReplicasUnavailable
status: "False"
type: Ready
- lastTransitionTime: "2023-09-13T12:09:50Z"
reason: Not all services created
status: "False"
type: ServicesReady Next runs also failed (e.g. the run
and logging
|
Regarding the issue above, in the process of updating the charm's manifests, I 've hit the same issues when running the tests. The logs I see in seldon-core-operator
|
- Skip doing rockcraft.yaml updates since release 1.17.1 still uses `imageVersion` 2.1.0 - Update the `version` as per canonical/bundle-kubeflow/#747 - update `base` since using `:` is deprecated - refactor `tox.ini` according to canonical/oidc-authservice-rock#14 and canonical/bundle-kubeflow#763 - update `test_rock.py` according to latest changes in chisme canonical/charmed-kubeflow-chisme#81 Refs #37
…66) - Update ROCK according to upstream changes - Introduce parts that we missed in the ROCK - use ubuntu 20.04 as base due to #39 - refactor tox.ini according to canonical/oidc-authservice-rock#14 and canonical/bundle-kubeflow#763 - update `test_rock.py` according to latest changes in chisme canonical/charmed-kubeflow-chisme#81 Details for changes in #37. Closes #54
MLServer-* commandWe will omit
|
Thank you for reporting us your feedback! The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5186.
|
Update ROCK according to upstream changes plus: - introduce parts that we missed in the ROCK - use ubuntu 20.04 as base due to #39 - refactor tox.ini according to canonical/oidc-authservice-rock#14 and canonical/bundle-kubeflow#763 - update `test_rock.py` according to latest changes in chisme canonical/charmed-kubeflow-chisme#81 - pins starlette due to #80 Ref #37 Closes #53
Update ROCK according to upstream changes plus: - introduce parts that we missed in the ROCK - use ubuntu 20.04 as base due to #39 - refactor tox.ini according to canonical/oidc-authservice-rock#14 and canonical/bundle-kubeflow#763 - update `test_rock.py` according to latest changes in chisme canonical/charmed-kubeflow-chisme#81 - pins starlette due to #80 Ref #37 Closes #51
Update ROCK according to upstream changes plus: - introduce parts that we missed in the ROCK - use ubuntu 20.04 as base due to #39 - refactor tox.ini according to canonical/oidc-authservice-rock#14 and canonical/bundle-kubeflow#763 - update `test_rock.py` according to latest changes in chisme canonical/charmed-kubeflow-chisme#81 - pins starlette due to #80 Ref #37 Closes #52
All ROCKs have been updated in linked PRs. |
This issue tracks the process of updating Seldon ROCKs to Seldon's 1.17 for CKF release 1.8. For the process, we 're following our internal Kubeflow ROCK Images Best Practices that has a section about Upgrade of ROCK Images.
The changes that this process will introduce should match what the upstream has for version release
1.17.1
.The text was updated successfully, but these errors were encountered: