Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection state recovery uses (nearly) unindexable query yielding a (near) full collection scan #17

Open
genisd opened this issue Sep 27, 2023 · 2 comments

Comments

@genisd
Copy link

genisd commented Sep 27, 2023

The query used for connection recovery (restoreSession()) uses a query which is not efficiently indexable.

Specifically, say one has an index on data.opts.rooms, the where clause cannot be looked up from that index.

        {
          $or: [
            {
              "data.opts.rooms": {
                $size: 0,
              },
            },
            {
              "data.opts.rooms": {
                $in: session.rooms,
              },
            },
          ],
        },

data.ops.rooms size == 0 cannot be looked up from an index 😞, mongodb by default will do a full index scan over _id index (which essentially is a full collection scan). I tried not exists as an alternative, but that didn't improve it either (tested on mongodb 6).

The comment on line 830 is not accurate, one cannot create a composite index containing two arrays (data.opts.rooms && data.ops.except). And even if one could, the index does not appear to be usable for the data.opts.rooms size == 0.

My understanding is that the rooms field is only populated on like "roomless" global broadcasts.

So I'm wondering how we could improve this implementation.
We ourselves don't use room less broadcasts, so we removed the OR data.ops.rooms size == 0 clause for us.

I'm wondering what the best solution would be to improve this here.

I was thinking that perhaps instead of not setting a room for a global broadcast we could instead use a reserved magic string/room like GLOBAL_BROADCAST for global broadcasts. That way we can remove the OR clause and by default include the GLOBAL_BROADCAST magic string in the room and the except filter.
With that approach an index on data.opts.rooms will perfectly reduce the candidate documents to be replayed.

Would that be feasible? I'm hoping for feedback and/or better ideas for this.

@darrachequesne
Copy link
Member

Hi! You are right, a query with a $size operator cannot use an index.

Reference: https://www.mongodb.com/docs/v7.0/reference/operator/query/size/

I guess we could also add a global: true attribute in the data.opts objects, for global broadcasts.

{
  $or: [
    {
      "data.opts.global": {
        $eq: true
      }
    },
    {
      "data.opts.global": {
        $eq: false
      },
      "data.opts.rooms": {
        $in: session.rooms,
      },
    },
  ],
},

What do you think?

Out of curiosity, what is your message rate/duration of retention?

@genisd
Copy link
Author

genisd commented Oct 3, 2023

I guess we could also add a global: true attribute in the data.opts objects, for global broadcasts.

I'd have to check for the indexability of this. But at first glance it sounds plausible.

Out of curiosity, what is your message rate/duration of retention?

Our rate isn't high right now.
Our retention might be on the long end, 24 hours. I'm not sure what the reasoning for that is to be honest, we probably should reduce it to at least 12 hours or so. My guess would be that our data ware house is syncing the messages on a polling basis for analytics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants