Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ActorSystem: improve timers (scheduler) #12091

Open
snaury opened this issue Nov 28, 2024 · 0 comments
Open

ActorSystem: improve timers (scheduler) #12091

snaury opened this issue Nov 28, 2024 · 0 comments
Assignees
Labels
area/actorsystem Actor System related issues

Comments

@snaury
Copy link
Member

snaury commented Nov 28, 2024

Currently actor system uses a special thread (the scheduler) that waits until a deadline is reached and sends scheduled messages. However scheduler has multiple problems:

  • It uses multiple lock-free queues that don't have any way to wake the scheduler, this results in scheduler waking pretty often and consuming cpu even when completely idle. We need a way to wake the sleeping scheduler only when scheduling a new event with an earlier deadline than the one it is currently waiting for.
  • Historically the scheduler has very poor support for timers that are far in the future, and for that reason we have "long" timers. These are actors that reschedule timers multiple times as long as it's not cancelled. Most of the time these timers are cancelled and never reach their respective deadline, but this additional actor layer consumes resources unnecessarily.
  • Scheduled events support so-called cookies (2-way and 3-way) for cancellation. However, these cookies don't really cancel timers, they just mark them as no longer necessary. In truth these timers stay in the queue and consume resources (primary memory).
  • The memory situation is pretty bad actually, under heavy traffic (~200k rps) nodes almost look like they leak memory, because unnecessary scheduled events consume memory until their deadline. The deadline is often far away (minutes) so they accumulate.

This needs to be fixed. We need to support long timers without extra actor layers. When scheduled event is cancelled it should free memory as soon as possible. Instead of complicated 2-way and 3-way cookies (that nobody uses since they are complicated) scheduled events should always return a handle that may be used to cancel the event (as long as it's not sent already).

I think I have a way to make all of this fast and scalable:

  • Per-thread lock-free stacks, which are fast to push handles to. These handles are intrusive.
  • Handles are almost like 2-way cookies. When pushed to the queue it has two owners temporarily (the actor and the scheduler), as long as there's at least one owner it must not be deallocated.
  • Handle should have a const deadline, some scheduler fields (for intrusive heap/list management), and an atomic state used to communicate between two parties.
  • The scheduler pops these handles from the queue and adds them to the heap/list, calculating the next deadline.
  • When deadline has been reached it should mark the state as "sent" and send the corresponding event. Depending on the previous state it should either free the handle or leave it for someone else to free.

The states are:

  • Enqueued: initial state when the handle is first added to the queue
  • Enqueued+Cancelled: the timer was cancelled while in the queue (scheduler will free the handle after it pops it from the queue)
  • Enqueued+Abandoned: the timer became uncancellable while in the queue (scheduler is responsible for freeing the handle after the deadline)
  • Scheduled: the handle is waiting in the scheduler's heap until the deadline
  • Scheduled+Cancelled: the timer was cancelled while scheduled, event should not be sent when deadline is reached, the handle should be removed after it has been popped from the queue a second time
  • Scheduled+Abandoned: the timer is uncancellable while in the heap, the handle should be freed after the deadline
  • Triggered: the handle reached the deadline, and the event has been sent (or will be sent soon), cannot be cancelled anymore and the actor is responsible for freeing the handle
  • Triggered+Cancelled: optional support for 3-way cookie behavior, e.g. when Cancel is called after the event has been triggered it can be detected at the event receive time. Probably not needed since cancelling after being triggered should just free the handle, and there's no state pointer after that, so no need for this state.

When the actor-side handle is abandoned without cancellation the handle is marked as abandoned (and the scheduler is responsible to free the handle's memory), unless triggered already. When cancelled before scheduler has dequeued the event it is marked as cancelled and left for the scheduler to sort out. When cancelling an already scheduled event it is marked as such and pushed to the queue a second time (to handle races), the scheduler will then remove it from the heap and free any associated memory. To process this correctly we would need to wake the scheduler thread up. Cancelling an already triggered event is a special case only needed for 3-way cookie behavior (however one side of this cookie is usually in the event itself, which may be a challenge to replicate without extra allocations).

Now scheduler will need to communicate current deadline, and use an interruptible sleep until that deadline. When pushing events with a lower deadline (or when cancelling already scheduled events) we would need to interrupt sleep, only needed once per sleep cycle. Sleeps will be as coarse as needed, e.g. when there's one event per minute (unlikely, but whatever) scheduler will only wake up once a minute. Scheduler also doesn't need to wake up to drain the queue unless it needs to free memory (cancelled events).

@snaury snaury added the area/actorsystem Actor System related issues label Nov 28, 2024
@snaury snaury self-assigned this Nov 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/actorsystem Actor System related issues
Projects
None yet
Development

No branches or pull requests

1 participant