✨ Add workspace phase update logic #3183

mjudeikis · 2024-10-17T16:45:56Z

Summary

Add a reconciler to switch workspaces phase to "Unavailable" if any workspace conditions are not ready.
This way, if some of the auxiliary conditions are Unavailable, the phase will change.

Adds mounts controller to propagate status to the workspace when mount is not ready. This way we don't use the workspace at all if the mount is not responding.

Adds code to index/proxy to not serve workspaces which are in Unavailable state, so we avoid using workspaces with errorous backends.

partially based on kcp-dev/enhancements#6

Related issue(s)

Fixes #

Release Notes

Add workspace phase reporter reconciler 
Add the Unavailable phase to the API
Implement exclusion of Unavailable workspaces from serving via proxy to avoid serving something which is not supposed to be served.

sdk/apis/core/v1alpha1/logicalcluster_types.go

pkg/reconciler/tenancy/workspacestatuses/README.md

mjudeikis · 2024-10-19T20:38:25Z

@sttts @embik so this might need a discussion.
With Mountpoint, we need a way to prevent users from interacting with the workspace if the mount is dead. The current behaviour is that if the URL does not respond you will get an error from the reverse proxy inside the proxy. But this does not cover all cases fully, as the URL might be alive but we just don't want the users to interact with the workspace (maintenance, or just temporary off).

In addition, it's very confusing (I found this from running this in my fork for some time already), if mountpoint goes down due to network, dead agent, you name it... And Workspace still shows Ready but if you try to access it - you get a proxy error, so need a way to indicate to users "This is not healthy".

The solution I proposed was to add a new phase - Unavailable. So the flow would be:
Scheduling -> Initializing -> Ready <--> Unavailable. So Workspace can jump between ready and Unavailable (like a node). This is achieved by extending phase reconciliation to detect if any condition in the workspace object is not True and flip the phase to it.

Even more, not even serve these workspaces at proxy layer if they are Unavailable. This allows us to do some maintenance modes too at shard level apis.

This allows some third-party component, or in our case, a mount controller, to inject a condition of its own based on those changes to the workspace phase. It works quite nicely, but there is a risk that somebody adds some non-blocking condition, like Update is needed for Workspace APIs, which is more informational and not terminal.

So I think not to go further from the original enhancement https://github.com/kcp-dev/enhancements/pull/6/files#diff-716b46559ae0795860e35aebf72fce98bdf288d935756a40b917410386e10870R265

In addition to avoid making the workspace unavailable for some non-terminal conditions, only Conditions with prefix Workspace will be acted on.

sttts · 2024-10-21T06:32:17Z

I think we have to distinguish between being switched to the new phase for some hard reason, like maintenance, versus soft reasons like network issues, or when the agent is not reachable. The former is definitely a phase, the later is not.

There are many actors potentially who might experience problems to reach a mount. Think of some HA env where one component thinks the mount is down and the other side thinks it is up. If one of them can mark the mount unavailable for all components, that amplifies errors.

mjudeikis · 2024-10-21T10:29:36Z

I think we have to distinguish between being switched to the new phase for some hard reason, like maintenance, versus soft reasons like network issues, or when the agent is not reachable. The former is definitely a phase, the later is not.

I think final result of this is that author needs to decide if accessing workspace would be allowed or not.
Image situation: My workspace is backed by mount. So situation is like in linux where I have file but I soft link ontop of the file pointing to some other file. So my file still exists but its not accesible as I have link ontop of it. So now if the link malfunctions and workspace is still served via front-proxy I might in-intentionally get access to original "file" and I would not see what I expect to see. In this case I want either to show right file or not show at all.

So all and all, is there anything you suggest to change in this proposed implementation? If document, would our public docs would be enough?

➜  mjudeikis tree
.
├── test
│   └── test.txt
└── test2
    └── test2.txt
➜  mjudeikis ln -sfn ../test2/test2.txt ./test/test.txt
➜  mjudeikis tree
.
├── test
│   └── test.txt -> ../test2/test2.txt
└── test2
    └── test2.txt

There are many actors potentially who might experience problems to reach a mount. Think of some HA env where one component thinks the mount is down and the other side thinks it is up. If one of them can mark the mount unavailable for all components, that amplifies errors.

I think this now for implemented to deal with. But its authors ownership to make decision.

sttts · 2024-10-22T07:32:01Z

"Unavailable" as a phase seems to cover the soft reasons too. I want to discourage an implementor to implement health checks in some controller and then set that phase indirectly through a condition.

Note: a condition from a health check is fine.

Maybe "Inaccessible" is better? Or leave it with Unavailable but be very clear in the docs that this is not meant for health checks.

sdk/apis/core/v1alpha1/logicalcluster_types.go

pkg/index/index.go

pkg/reconciler/tenancy/workspace/workspace_reconcile_phase.go

pkg/index/index.go

mjudeikis · 2024-10-24T06:07:01Z

/retest

mjudeikis · 2024-10-24T06:37:34Z

pkg/index/index.go

-	// When we promote this to workspace structure, we should make this check smarter and better tested.
-	if (cluster.String() == ws.Spec.Cluster) && (ws.Annotations[tenancyv1alpha1.ExperimentalWorkspaceMountAnnotationKey] != "" && mountObjString == ws.Annotations[tenancyv1alpha1.ExperimentalWorkspaceMountAnnotationKey]) {
-		return
-	}


These were "overoptimization". It backfired when now condition and annotations is being updated (in 2 different reconcile cycles due to not try to update both in one) where annotations gets updated and now this prevents it to move forward and update error codes cache.

Overengineering from my side :/

mjudeikis · 2024-10-24T06:43:23Z

pkg/index/index.go

 		// This should be simplified once we promote this to workspace structure.
-		clusterWorkspaceMountAnnotation: map[logicalcluster.Name]map[string]string{},
+		shardClusterWorkspaceMountAnnotation: map[string]map[logicalcluster.Name]map[string]string{},


We have few duplicate ones here. We could do some more nicer abstraction here.
Most data is shard -> cluster-> workspace -> data - so some in-memory store mini-engine could help. Now its very hard to make changes where and make sure we clean, add, and update when we need to.

Something to think of during insomnia nights.

mjudeikis · 2024-10-24T07:07:42Z

/retest
show recociler in tests? looking

pkg/index/index.go

mjudeikis · 2024-10-24T10:14:10Z

/retest

embik

This seems to look fine for now to me, I agree with @sttts that we should clearly document expectations for implementors once we have this stabilised a bit and are ready for people to integrate with it.

kcp-ci-bot · 2024-10-25T14:26:12Z

LGTM label has been added.

Git tree hash: cea8ad2dfbed0a3dd9f3ee892695151f0f888cc5

sttts · 2024-11-21T11:44:45Z

sdk/apis/core/v1alpha1/logicalcluster_types.go

+	// LogicalClusterPhaseUnavailable phase is used to indicate that the logical cluster us unavailable to be used.
+	// It will will not be served via front-proxy when in this state.
+	// Possible state transitions are from Ready to Unavailable and from Unavailable to Ready.
+	LogicalClusterPhaseUnavailable LogicalClusterPhaseType = "Unavailable"


I still think we have to be very clear that this is NOT a phase for temporary unavailability backed by e.g. some probing.

sdk/apis/tenancy/v1alpha1/types_mounts.go

pkg/reconciler/tenancy/workspace/workspace_reconcile_phase.go

pkg/reconciler/tenancy/workspacemounts/workspacemounts_reconcile_updater.go

sttts · 2024-11-21T12:00:28Z

pkg/reconciler/tenancy/workspacemounts/workspacemounts_reconcile_updater.go

+			if current.Status == v1.ConditionTrue {
+				return reconcileStatusContinue, nil
+			}
+			conditions.MarkTrue(workspace, tenancyv1alpha1.MountConditionReady)


isn't this and the if above the same?

pkg/reconciler/tenancy/workspacemounts/workspacemounts_reconcile_updater.go

mjudeikis · 2024-11-23T11:22:16Z

I agree with @sttts that we should clearly document expectations for implementors once we have this stabilised a bit and are ready for people to integrate with it.

I think we might need to move to WorkspaceType before this but agree.

sttts · 2024-11-23T13:05:30Z

pkg/reconciler/tenancy/workspace/workspace_reconcile_phase.go

 // It returns true if the phase was changed, false otherwise.
-func updateTerminalConditionsAndPhase(workspace *tenancyv1alpha1.Workspace) bool {
+func terminalConditionPhase(workspace *tenancyv1alpha1.Workspace) bool {


not really what I meant. Either call it terminalConditionPhase and make it side-effect-free (i.e. returning the phase string), or updateTerminalConditionPhase (without the "And").

Taking second one :) As reconcilers already playing a lot with pointers and editing existing objects - this just felt natural.

sttts · 2024-11-23T21:46:53Z

/lgtm
/approve

kcp-ci-bot · 2024-11-23T21:46:58Z

LGTM label has been added.

Git tree hash: 14964f76c011b2238bdad778ec1cf3c71441424e

kcp-ci-bot · 2024-11-23T21:47:01Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sttts

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [sttts]
~~config/crds/OWNERS~~ [sttts]
~~sdk/apis/OWNERS~~ [sttts]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sttts reviewed Oct 17, 2024

View reviewed changes

sdk/apis/core/v1alpha1/logicalcluster_types.go Outdated Show resolved Hide resolved

sttts reviewed Oct 19, 2024

View reviewed changes

pkg/reconciler/tenancy/workspacestatuses/README.md Outdated Show resolved Hide resolved

mjudeikis force-pushed the mjudeikis/workspace.status.merges branch from b1944b3 to 08b37fe Compare October 19, 2024 12:23

kcp-ci-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 19, 2024

mjudeikis force-pushed the mjudeikis/workspace.status.merges branch 3 times, most recently from e6b7c5c to 9b951a7 Compare October 19, 2024 20:06

mjudeikis changed the title ~~✨ Add workspace statuses updater~~ ✨ Add workspace phase update logic Oct 19, 2024

mjudeikis force-pushed the mjudeikis/workspace.status.merges branch from cfc160f to fce06d3 Compare October 20, 2024 07:16

add workspace phase updater

b36adce

mjudeikis force-pushed the mjudeikis/workspace.status.merges branch from fce06d3 to b36adce Compare October 20, 2024 14:39