Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Add workspace phase update logic #3183

Merged
merged 3 commits into from
Nov 23, 2024

Conversation

mjudeikis
Copy link
Contributor

@mjudeikis mjudeikis commented Oct 17, 2024

Summary

Add a reconciler to switch workspaces phase to "Unavailable" if any workspace conditions are not ready.
This way, if some of the auxiliary conditions are Unavailable, the phase will change.

Adds mounts controller to propagate status to the workspace when mount is not ready. This way we don't use the workspace at all if the mount is not responding.

Adds code to index/proxy to not serve workspaces which are in Unavailable state, so we avoid using workspaces with errorous backends.

partially based on kcp-dev/enhancements#6

Related issue(s)

Fixes #

Release Notes

Add workspace phase reporter reconciler 
Add the Unavailable phase to the API
Implement exclusion of Unavailable workspaces from serving via proxy to avoid serving something which is not supposed to be served. 

@kcp-ci-bot kcp-ci-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has signed the DCO. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API labels Oct 17, 2024
@mjudeikis mjudeikis force-pushed the mjudeikis/workspace.status.merges branch from b1944b3 to 08b37fe Compare October 19, 2024 12:23
@kcp-ci-bot kcp-ci-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 19, 2024
@mjudeikis mjudeikis force-pushed the mjudeikis/workspace.status.merges branch 3 times, most recently from e6b7c5c to 9b951a7 Compare October 19, 2024 20:06
@mjudeikis mjudeikis changed the title ✨ Add workspace statuses updater ✨ Add workspace phase update logic Oct 19, 2024
@mjudeikis
Copy link
Contributor Author

mjudeikis commented Oct 19, 2024

@sttts @embik so this might need a discussion.
With Mountpoint, we need a way to prevent users from interacting with the workspace if the mount is dead. The current behaviour is that if the URL does not respond you will get an error from the reverse proxy inside the proxy. But this does not cover all cases fully, as the URL might be alive but we just don't want the users to interact with the workspace (maintenance, or just temporary off).

In addition, it's very confusing (I found this from running this in my fork for some time already), if mountpoint goes down due to network, dead agent, you name it... And Workspace still shows Ready but if you try to access it - you get a proxy error, so need a way to indicate to users "This is not healthy".

The solution I proposed was to add a new phase - Unavailable. So the flow would be:
Scheduling -> Initializing -> Ready <--> Unavailable. So Workspace can jump between ready and Unavailable (like a node). This is achieved by extending phase reconciliation to detect if any condition in the workspace object is not True and flip the phase to it.

Even more, not even serve these workspaces at proxy layer if they are Unavailable. This allows us to do some maintenance modes too at shard level apis.

This allows some third-party component, or in our case, a mount controller, to inject a condition of its own based on those changes to the workspace phase. It works quite nicely, but there is a risk that somebody adds some non-blocking condition, like Update is needed for Workspace APIs, which is more informational and not terminal.

So I think not to go further from the original enhancement https://github.com/kcp-dev/enhancements/pull/6/files#diff-716b46559ae0795860e35aebf72fce98bdf288d935756a40b917410386e10870R265

In addition to avoid making the workspace unavailable for some non-terminal conditions, only Conditions with prefix Workspace will be acted on.

@mjudeikis mjudeikis force-pushed the mjudeikis/workspace.status.merges branch from cfc160f to fce06d3 Compare October 20, 2024 07:16
@mjudeikis mjudeikis force-pushed the mjudeikis/workspace.status.merges branch from fce06d3 to b36adce Compare October 20, 2024 14:39
@sttts
Copy link
Member

sttts commented Oct 21, 2024

I think we have to distinguish between being switched to the new phase for some hard reason, like maintenance, versus soft reasons like network issues, or when the agent is not reachable. The former is definitely a phase, the later is not.

There are many actors potentially who might experience problems to reach a mount. Think of some HA env where one component thinks the mount is down and the other side thinks it is up. If one of them can mark the mount unavailable for all components, that amplifies errors.

@mjudeikis
Copy link
Contributor Author

I think we have to distinguish between being switched to the new phase for some hard reason, like maintenance, versus soft reasons like network issues, or when the agent is not reachable. The former is definitely a phase, the later is not.

I think final result of this is that author needs to decide if accessing workspace would be allowed or not.
Image situation: My workspace is backed by mount. So situation is like in linux where I have file but I soft link ontop of the file pointing to some other file. So my file still exists but its not accesible as I have link ontop of it. So now if the link malfunctions and workspace is still served via front-proxy I might in-intentionally get access to original "file" and I would not see what I expect to see. In this case I want either to show right file or not show at all.

So all and all, is there anything you suggest to change in this proposed implementation? If document, would our public docs would be enough?

➜  mjudeikis tree
.
├── test
│   └── test.txt
└── test2
    └── test2.txt
➜  mjudeikis ln -sfn ../test2/test2.txt ./test/test.txt
➜  mjudeikis tree
.
├── test
│   └── test.txt -> ../test2/test2.txt
└── test2
    └── test2.txt

There are many actors potentially who might experience problems to reach a mount. Think of some HA env where one component thinks the mount is down and the other side thinks it is up. If one of them can mark the mount unavailable for all components, that amplifies errors.

I think this now for implemented to deal with. But its authors ownership to make decision.

@sttts
Copy link
Member

sttts commented Oct 22, 2024

"Unavailable" as a phase seems to cover the soft reasons too. I want to discourage an implementor to implement health checks in some controller and then set that phase indirectly through a condition.

Note: a condition from a health check is fine.

Maybe "Inaccessible" is better? Or leave it with Unavailable but be very clear in the docs that this is not meant for health checks.

pkg/index/index.go Outdated Show resolved Hide resolved
pkg/index/index.go Outdated Show resolved Hide resolved
pkg/index/index.go Outdated Show resolved Hide resolved
@mjudeikis mjudeikis force-pushed the mjudeikis/workspace.status.merges branch from 690c6bd to 0db20f1 Compare October 23, 2024 17:48
@mjudeikis
Copy link
Contributor Author

/retest

// When we promote this to workspace structure, we should make this check smarter and better tested.
if (cluster.String() == ws.Spec.Cluster) && (ws.Annotations[tenancyv1alpha1.ExperimentalWorkspaceMountAnnotationKey] != "" && mountObjString == ws.Annotations[tenancyv1alpha1.ExperimentalWorkspaceMountAnnotationKey]) {
return
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were "overoptimization". It backfired when now condition and annotations is being updated (in 2 different reconcile cycles due to not try to update both in one) where annotations gets updated and now this prevents it to move forward and update error codes cache.

Overengineering from my side :/

// This should be simplified once we promote this to workspace structure.
clusterWorkspaceMountAnnotation: map[logicalcluster.Name]map[string]string{},
shardClusterWorkspaceMountAnnotation: map[string]map[logicalcluster.Name]map[string]string{},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have few duplicate ones here. We could do some more nicer abstraction here.
Most data is shard -> cluster-> workspace -> data - so some in-memory store mini-engine could help. Now its very hard to make changes where and make sure we clean, add, and update when we need to.

Something to think of during insomnia nights.

@mjudeikis
Copy link
Contributor Author

/retest
show recociler in tests? looking

pkg/index/index.go Outdated Show resolved Hide resolved
@mjudeikis mjudeikis force-pushed the mjudeikis/workspace.status.merges branch from 480cd96 to 8c40ccb Compare October 24, 2024 09:46
@mjudeikis
Copy link
Contributor Author

/retest

Copy link
Member

@embik embik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to look fine for now to me, I agree with @sttts that we should clearly document expectations for implementors once we have this stabilised a bit and are ready for people to integrate with it.

@kcp-ci-bot kcp-ci-bot added the lgtm Indicates that a PR is ready to be merged. label Oct 25, 2024
@kcp-ci-bot
Copy link
Contributor

LGTM label has been added.

Git tree hash: cea8ad2dfbed0a3dd9f3ee892695151f0f888cc5

// LogicalClusterPhaseUnavailable phase is used to indicate that the logical cluster us unavailable to be used.
// It will will not be served via front-proxy when in this state.
// Possible state transitions are from Ready to Unavailable and from Unavailable to Ready.
LogicalClusterPhaseUnavailable LogicalClusterPhaseType = "Unavailable"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we have to be very clear that this is NOT a phase for temporary unavailability backed by e.g. some probing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

if current.Status == v1.ConditionTrue {
return reconcileStatusContinue, nil
}
conditions.MarkTrue(workspace, tenancyv1alpha1.MountConditionReady)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this and the if above the same?

@kcp-ci-bot kcp-ci-bot removed the lgtm Indicates that a PR is ready to be merged. label Nov 23, 2024
@mjudeikis
Copy link
Contributor Author

I agree with @sttts that we should clearly document expectations for implementors once we have this stabilised a bit and are ready for people to integrate with it.

I think we might need to move to WorkspaceType before this but agree.

// It returns true if the phase was changed, false otherwise.
func updateTerminalConditionsAndPhase(workspace *tenancyv1alpha1.Workspace) bool {
func terminalConditionPhase(workspace *tenancyv1alpha1.Workspace) bool {
Copy link
Member

@sttts sttts Nov 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really what I meant. Either call it terminalConditionPhase and make it side-effect-free (i.e. returning the phase string), or updateTerminalConditionPhase (without the "And").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking second one :) As reconcilers already playing a lot with pointers and editing existing objects - this just felt natural.

@mjudeikis mjudeikis force-pushed the mjudeikis/workspace.status.merges branch from 7a42025 to 141f375 Compare November 23, 2024 16:56
@sttts
Copy link
Member

sttts commented Nov 23, 2024

/lgtm
/approve

@kcp-ci-bot kcp-ci-bot added the lgtm Indicates that a PR is ready to be merged. label Nov 23, 2024
@kcp-ci-bot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 14964f76c011b2238bdad778ec1cf3c71441424e

@kcp-ci-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sttts

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kcp-ci-bot kcp-ci-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 23, 2024
@kcp-ci-bot kcp-ci-bot merged commit 5bc6985 into kcp-dev:main Nov 23, 2024
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has signed the DCO. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants