Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TODO: sorting through populate fields (Ex: sort users through role.name) #316

Open
lsarrazi opened this issue Jan 24, 2023 · 16 comments
Open

Comments

@lsarrazi
Copy link

Hi everyone here,
I got super excited on your project recently and started to write an API with it.
Good job for all of what have already been done.
I wanted to know if there is tracks to implement sorting through populated fields ? (or just linking model fields at least)

@lsarrazi
Copy link
Author

lsarrazi commented Jan 29, 2023

While doing research to patch this problem I found an even bigger problem (I think)

in the _getAllHandler function in handler-helper.js we have 2 requests, one first to get the documents ids (and linking documents) and a second one to get the documents themselves. The problem is that only the second request is paginated / sorted / excluded etc correctly. Meaning that the first request will fetch all the data of the database anyway, which may imply that the endpoints will become unusable as the database grow.
I think this part need to be completly reinvented

@JKHeadley
Copy link
Owner

Hi @lsarrazi ! Thanks for your interest in the project!

There is a task to add sorting through populated fields, but no work has been done on that yet.

For the _getAllHandler issue you mentioned, it's been quite a long time since that code has been updated, so I wouldn't be surprised if there were some potential performance issues there. I will take a look and see what could be done. If you have any suggestions in mind I would be happy to hear them.

Just as an FYI, I'm currently working on a major update to the framework, so depending on the results of this task I might be able to add this as part of that update.

Thanks!

@lsarrazi
Copy link
Author

lsarrazi commented Jan 30, 2023

Hello @JKHeadley , this project is really awesome, I do believe. Glad to hear your still working on it :)
I think that I found a solution using the Aggregation API of mongo which solve lot of our issues (I hope). I couldnt find a way to make it work with the mongoose wrappers. Fortunately we can use the Aggregation API with mongoose using the model.aggregate function.
Before sharing this with you I need to make sure that Im not forgetting an important point:

  • It need to sort ($sort) and THEN paginate ($skip, $limit...) correctly and efficiently, from any criteria, including deeply nested properties (ex: $sort=thing.very.deep)
  • It need to be able to embed any field in the document, including deeply nested properties using $embed=thing.deep for example
  • It need to select and excludes fields from the result (with $select or $exclude)
  • It need to match (or search using $text and $searchFields) for any specified field
  • Other simpler things like $count or $flatten

Is there something missing here or a detail you want to specify ?

I also want to ask, whats for you the difference (pros/cons) of to output for _getAllHandler:
1st approach: (statu quo)

[
    { // child model
       association: { // linking model
          relashionship: 'friend'
       },
       name: 'bob',
    },
    ...
]

versus:
2nd approach:

[
    { // linking model
       child: { // child model
          name: 'bob'
       },
       relashionship: 'friend'
    },
    ...
]

@JKHeadley
Copy link
Owner

Thanks @lsarrazi ! All thoughts are greatly appreciated here.

I have thought about this for a while, but haven't truly brainstormed it yet. I'll try to share some of my thoughts here with you, though please keep in mind this is not a thorough review of all aspects that need to be considered.

When it comes to fetching nested associations, there are 3 major topics that come to mind:

  • $embed queries
  • getAll endpoints
  • Authorization for both of these

If we were to properly handle nested associations, I think you are correct in assuming we need to handle all the same functionality/query parameters that a normal query supports. In effect, I think this means we would need to support objects in addition to strings as part of the $embed parameter. For example:

$embed = ['pets.veterinarian', {
    'association': 'friends',
    '$sort': 'lastName',
    '$select': 'email',
    '$limit': 5,
    '$text': 'john',
    '$embed': ['role', 'pets']
}]

In the example above, the original $embed parameter accepts pets.verterinarian as a string input (currently supported format), as well as an object that includes other query options to apply to the embedded friends association, and this object further includes an $embed parameter that takes some strings.

This is just an example, but I think you can see how this could be further nested.

This of course is only an example of a way to update the query format to accept more advanced embedding. On the implementation side I am not sure yet what the best approach will be, whether it's updating the current usage of the mongoose populate functionality, or switching altogether to use the Aggregation API as you mentioned. I will need to re-familiarize myself with both to have a better feeling for this.

Assuming we are able to effectively implement these updates, I think this would solve the initial issue that you pointed out with the _getAllHandler since we could simply update the first query to include pagination, etc.

Now, this updated $embed functionality is distinct from updating other parameters to be more deeply nested, such as the $sort='thing.very.deep' example you gave.

On a related note, we would need to update query validation to handle the new query parameter functionality based on the type of association being queried. For example, it should cause an error to include a $sort parameter within an $embed parameter if the association being embedded is ONE_ONE. Ex:

$embed = [{
    'association': 'role',
    '$sort': 'name'
}]

Similarly, attempting to sort via a ONE_MANY or MANY_MANY association should throw an error as well, Ex:

$sort='friends.role.name'

Lastly, I certainly haven't thought out all the implications these updates create for other features such as Authorization, but I think it's important to bring up. This actually might already be an existing issue. For example, we might have a schema that includes a user model and each user can have multiple images via an image model, however the endpoints might be configured such that images can only be accessed by their 'owner' while at the same time any user can access the 'list users' endpoint. In this case a user can bypass the authorization applied to the image endpoints simply by adding $embed='images' to their 'list users' query.

It may be a solution to include authorization checks within the process of constructing the mongoose/mongo query. Essentially for each embedded field we would have to check the authorization rules applied to that field and cross-reference the scopes applied to the user.

Of course authorization doesn't apply when using the mongoose wrappers directly, unless 'restCall=true' is applied.

As a final thought, from my experience so far the most time consuming aspect of adding/updating features such as this is making sure each new case is tested thoroughly. I have mainly been approaching this through e2e tests but such a big change would likely need more unit tests as well.

As a final final thought :) it might be helpful to look at the prisma framework for inspiration. I believe they have already solved many of these issues, however they are purely an ORM, whereas rest-hapi is intended to provide a full REST API solution.

Thanks again for your interest. Looking forward to hearing your thoughts if you would like to share :)

@JKHeadley
Copy link
Owner

JKHeadley commented Jan 31, 2023

As for your question about the output variations for 'getAll', this is a good question. The second approach actually looks closer to the results you receive when you $embed an association as part of a query.

They both have pros and cons. To begin with, here are the current issues I see:

  • The format between getAll and $embed is not consistent. It would be better if the formats were the same (unless I'm am forgetting a good reason against this).
  • The duplicated _id fields are confusing and unnecessary.
  • The format should be configurable (i.e. the user should be able to explicitly state what format they prefer.

Some pros and cons between the formats:

  • option 2 can be confusing since the embedded data is nested deeper than one might expect, however the field names are straightforward
  • option 1 is more intuitive if there are no extra fields (the linking model fields), however the linking model embedded object can be confusing since the field name is the model name. If this option is selected, the user should be able to specify the field name as something other than the model name (i.e. an alias).

@lsarrazi
Copy link
Author

lsarrazi commented Jan 31, 2023

Thanks for the answers @JKHeadley :)
I didn't imaginated such a complete syntax for the $embed parameter, but it might be a good idea, I dont really know what's the absolute best way to go.
I do know prisma which is great but I dont really know what to conclude for our problems from that. However I found a solution that I want to communicate to you. you'll have to focus on my explanations because I didn't made a general implementation but a specific one (for a specific case only).
So to begin, imagine you have a ownerModel called ranking.
You also have a childModel called entity. ranking has entity childs, this association is a MANY_MANY named entitys. The linkingModel is called ranking_entity and do have extra fields. Amongst those extra fields, there is a field called rating which is a number.

Here is the code we can use to get all the linking documents ranking_entity and child documents entity of a given ranking in only one aggregation query. It will embed the child model as a field (approach 2), sorted ($sort=entitys.rating), paginated ($limit=2 and $skip=2), in only one query, and (I believe) efficient.

Lets jump in the _getAllHandler function and replace everything by:

let mongooseQuery = ownerModel.aggregate([
  {
    $match: { // match the owner model, just like a regular model.findById()
      _id: new mongoose.Types.ObjectId(ownerId),
    },
  },
  { // This guy will do the equivalent of a "populate", It do a Left Outer Join to get the children ranking_entity of the owner
    $lookup: {
      from: "ranking_entity",
      localField: "_id",
      foreignField: "ranking",
      as: "entitys",
    },
  },
  { // This one is like a "Array.map", it will enable us to iterate throught the ranking_entitys
    $unwind: {
      path: "$entitys",
    },
  },
  { // Another "populate" to retrieve the child model "entity" of each "ranking_entity". Worth noting this block can be replicated to embed anything we want in our linking or child model, for any field in them.
    $lookup: {
      from: "entities",
      localField: "entitys.entity",
      foreignField: "_id",
      as: "entitys.entity",
    },
  },
  { // $lookup do output arrays of elements, not a single element, so we basically replace the array by its first child, as the linking model always as one and only one corresponding child model. As the last one, it is Worth noting this block can be EXTENDED (not replicated this time, maybe trickier) to embed anything we want in our linking or child model, for any field in them.
    $set: {
      entitys: {
        entity: {
          $arrayElemAt: ["$entitys.entity", 0],
        },
      },
    },
  },
  { // here we do sorting, on "entitys.rating" which is field on the linking model, but it do actually work on ANY deeply nested field, including the ones we might $embed. See the example commented below
    $sort: {
      "entitys.rating": -1, // <-- sort entitys by there linking model "rating" field
      // "entitys.entity.name" <-- sort entitys by their names, which is not in the linking model but on the child model
    },
  },
  { // Pagination at the end of pipeline, cannot be done before, otherwise it would sort the documents on the paginated set, not the whole set
    $skip: 2,
  },
  {
    $limit: 2,
  },
  { // We regroup all the documents in an array
    $group: {
      _id: "$_id",
      entitys: {
        $push: "$entitys",
      },
    },
  },
])
// Here I skipped lot of checks just to make the code clear, but its basically just that:
let result;

result = await mongooseQuery.exec()

result = result[0];

result = result['entitys']
    
return result; // here the response returned by my _getAllHandler function

Other query parameters , like field match or field $exclude can be easily implemented with a $match and $unset aggregation AFTER or BEFORE the sorting stage (need to do some benchmarks here maybe)
$count and flatten are also simple to implement.

The output of this query on my database is the following, just for you to look at the format: (serialized)

[
    {
        "_id": "63c5988eb293294147ef190d",
        "entity": {
            "_id": "63c597cfaaf417e3dc7050de",
            "name": "Cell",
            "description": "occaecat id commodo enim minim",
            "createdAt": "2023-01-16T18:30:39.558Z",
            "__v": 0
        },
        "ranking": "63c349c8c4d1bdb75e8cee75",
        "__v": 0,
        "rating": 1500,
        "rd": 400,
        "vol": 0.06
    },
    {
        "_id": "63c34b2fb293294147ef0253",
        "entity": {
            "_id": "63c34918df112e6d9c20c316",
            "name": "Vegeta",
            "description": "non",
            "createdAt": "2023-01-15T00:30:16.851Z",
            "__v": 0
        },
        "ranking": "63c349c8c4d1bdb75e8cee75",
        "__v": 0,
        "rating": 1350.1746357829322,
        "rd": 228.93669034890752,
        "vol": 0.05999915643996186
    }
]

As you can notice it follow the second option of output that I mentionned.

Lets talk complexity for a second,
lets n be the size of the queried set, and N the size of the whole linking model Set

if my intuition is good we should drop from N*log(N) to n*log(N), only if the indexes are correctly set on the sorted field of course.
From the mongodb documentation:

When a $sort precedes a $limit and there are no intervening stages that modify the number of documents, the optimizer can coalesce the $limit into the $sort. This allows the $sort operation to only maintain the top n results as it progresses, where n is the specified limit, and ensures that MongoDB only needs to store n items in memory. This optimization still applies when allowDiskUse is true and the n items exceed the aggregation memory limit.

<-- That's why I believe we should make the pagination and sort stage stuck together, just to be sure :p

And of course, because there is this problem that I mentionned on my very first message, the simplest query complexity should drop from N to n (which is galactic improvement on even a medium sized database).

I think the only challenge here is to generalize the embed mechanism for any deeply nested field using the $lookup and $set stages. Otherwise everything seems good to me, is there something that I didn't thought of ?

I didn't thought about any authorization mechanism on the document fields, but maybe you'll have ideas on that I hope :)

@JKHeadley
Copy link
Owner

@lsarrazi This is great!

At a high level here we are basically discussing the mongoose populate approach vs the native mongo $lookup. I'm not a db expert but several times I've tried researching the difference between the two approaches, specifically performance wise. My instinct tells me that $lookup should be a more performant option for most cases, however it seems difficult to find evidence of this. I think part of the reason is that it greatly depends on the situation is is used in. For sure the only clear difference that I know of is that populate can work on sharded collections while $lookup cannot.

Regardless, I think it would be very beneficial to implement solutions using both, and allow the user to configure which implementation they prefer. Again, my instinct is that using aggregations in general will be more performant since I imagine it takes advantage of internal MongoDB optimizations (such as the ones you point out).

As far as the N*log(N) to n*log(N) optimization goes, that is certainly an improvement. However (unless I'm mistaken) I believe this is simply a design flaw in the logic, and can be improved in either approach (populate or aggregations).

As you mentioned, the trick here will be to generalize the approach you have demonstrated, and to make sure it supports all current features/functionality. I think this certainly can be done. My first thoughts are approaching it using a type of 'query builder' utility. There might be some existing libraries we can take advantage of, or we could just develop our own. In essence, this would take in all the request parameters and build a single aggregation query.

As far as authorization goes, I think we could find a way to integrate authorization checks within the query builder utility.

One final note about performance. I do consider performance to be very important, however my priority for this project has always been functionality/robustness as much as possible. I think this approach is fitting simply because rest-hapi is designed to always allow the user to take advantage of the flexibility MongoDB provides, and one of the main advantages of that flexibility is that it allows you to optimize your schema to a specific application, which will almost always (in my opinion) be the biggest factor in performance. As a quick example, using denormalization (via the duplicate fields feature), already enables you to sort by deeply nested fields (Ex: user.role.name <=> user.roleName) in a much more efficient way since the field is directly accessible in the immediate document.

I greatly appreciate your contributions here. Please let me know if there's anything I've missed (or I'm just wrong, haha). If you would like to contribute to the code itself I'd be happy to assist you however I can. I'm very excited about these ideas and I hope to make progress on them as soon as I'm done with my current updates.

@lsarrazi
Copy link
Author

Thanks, it's a pleasure 😊
I think the best move to split the tasks is to let you make a specification of what this query builder should support as features since you have a better overall understanding of the projet, and let me participate to the "isolated" implementation of the query builder. you would do everything around (such as the _getAllHandler function implementation) to ensure consistency with the rest of the project. Does this organization seems ok to you?

To be honest I was not aware of the duplicate fields feature at all, would the sorting benefit from indexes on the duplicated fields ?

I do not believe that $lookup is more performant than populate as it is based internally on $lookup. but I do believe that this query in particular is more performant than the existings ones. And in fact the existing method might be unusable on a quite large database. I didn't find a way to make a query with populate and the mongoose API that satisfy all the criterias I mentionned, but someone might find a way. I do find the aggregation way clever tho 😀

One last thought on the validation part: There might be an infinite number of possibility for the $embed query with nested fields, for example: $embed=user.friend.friend.friend, it might be impossible to verify that the field really exist at the JOI schema generation time.
What do you think about it ? Let me know 👍

@JKHeadley
Copy link
Owner

JKHeadley commented Feb 1, 2023

Sounds great! I'll try to put aside some time to specify the query builder requirements for you.

The sorting benefit from denormalizing the data through duplicate fields is that you no longer have to perform a $lookup or populate to be able to sort by a field in an association. So as an example

user document:

{
	'name': 'John Q',
	'role': '<some object id>'
}

If you want to sort a user query by 'role.name', you would first have to perform a $lookup or populate to get the role data. If you use duplicate fields to denormalize the 'role.name' field into the user document, then you would have:

user document with duplicate field:

{
	'name': 'John Q',
	'role': '<some object id>',
	'roleName: 'Admin'
}

Now you can simply sort the user query by the 'roleName' field, which avoids the extra processing involved with $lookup or populate. The duplicate fields feature will automatically update the roleName field if the referenced role.name value ever changes. This of course makes updates less performant, but that is a tradeoff that the user can make based on their application needs.

As far as JOI validation for nested $embed (or even nested $sort), I don't know if there is a solution, but I think that's okay. JOI validation is just the first line of defense, we can always add some sort of query checker logic before any db requests are made, or there might be a way to integrate that logic into JOI somehow.

Anyway, thanks again for the ideas. Feel free to continue the conversation or ask about anything else.

@lsarrazi
Copy link
Author

lsarrazi commented Feb 4, 2023

Hi there, I just wanted to notice you I think I successfully generalized the embedding mechanism for any deeply nested fields. The sorting part is also working without any extra effort. I essentially re-created a populate function to generate $lookup and $set aggregation stages and I'm able to validate if a deeply nested field do exist or not, before executing the mongo query. So it might be also a way to validate the authorization for a given model or field on a model, to not embed things that are not allowed for example.
At the end of the day, the code is not that scary. I'll keep you in the loop, have a good day !

@JKHeadley
Copy link
Owner

@lsarrazi That's great news! Thanks for keeping me updated.

Here are some initial requirements for the query builder. We can discuss and modify as needed:


Below is a list of requirements for a rest-hapi query builder. There is no particular order to the list

  • $flatten: if set to true, remove extra fields/linking model data
  • $count: If set to true, only a count of the query results will be returned.
  • (field "where" queries): match exact values for a single field. If possible, we should support matching nested fields (i.e. role.name=Admin)
  • $embed: an array that either includes strings of associations to embed or objects defining nested queries. Nested queries should support all applicable query parameters (i.e. $embed, $select, etc...)
  • $term: A regex search parameter. Slower than $text search but supports partial matches and doesn't require indexing. This can be refined using the $searchFields parameter.
  • $searchFields: A set of fields to apply the $term search parameter to. If this parameter is not included, the $term search parameter is applied to all searchable fields.
  • $text: A full text search parameter. Takes advantage of indexes for efficient searching. Also implements stemming with searches. Prefixing search terms with a "-" will exclude results that match that term.
  • $sort: A set of fields to sort by. Including field name indicates it should be sorted ascending, while prepending '-' indicates descending. The default sort direction is 'ascending' (lowest value to highest value). Listing multiple fields prioritizes the sort starting with the first field listed. We should support nested sorting as well (i.e. $sort=role.name)
  • $select: A list of basic fields to be included in each resource.
  • $limit: The maximum number of records to return. This is typically used in pagination.
  • $page: The number of records to skip based on the $limit parameter. This is typically used in pagination.
  • $skip: The number of records to skip in the database. This is typically used in pagination.

Notes:

Since we will be supporting pagination for embedded associations, this means the response format of embedded results should support pagination. For example, instead of :

{
  "name": "Foo",
  "friends": [
    {
      "name": "Bar",
    },
    {
      "name": "Baz"
    }
  ]

We would have:

{
  "name": "Foo",
  "friends": {
    "docs": [
      {
        "name": "Bar"
      },
      {
        "name": "Baz"
      }
    ],
    "pages": {...},
    "items": {...}
  }
}

This should be the default format, but we should allow the user to configure a paginateEmbed config variable to disable it for backwards compatibility.


I haven't investigated much how the $term and $text parameters might work when applied to embedded associations.

@lsarrazi
Copy link
Author

Hello @JKHeadley, I hope you're doing good.
Thanks for the list of requirements you sent me.
Im not sure exactly how to implement association embedding and I think it would complexify a lot the aggregation query building. I think the best thing to do here is to let that for a future version of the query builder, at least after we got a stable version of it, is that ok for you ?
I also wanted to ask whats the benefit for association embedding instead of just running a second API call to the corresponding association endpoint in the API?

I have 2 questions:

  • would you like to replace _listHandler with the new query builder as well or solely _getAllHandler ?

  • what should be the behavior of $select if you choose to select a deeply nested field, example:
    $select=deep.name on

    {
            rootName: "Name",
            deep: {
                 name: "MyName",
                 age: 18
            }
    }

    for example ?

    Should it be this: (1)

    {
            rootName: "Name",
            deep: {
                 name: "MyName",
            }
    }

    or this: (2)

    {
            deep: {
                 name: "MyName",
            }
    }

Have a good day, I stay tuned :)

@lsarrazi
Copy link
Author

Hey @JKHeadley , what's the news about the features your developping ? Are you done with it ?
Best regards

@JKHeadley
Copy link
Owner

Hi @lsarrazi, thanks for checking in. The current features are coming along slowly but surely. Life has picked up recently so I don't have as much time lately, but that just seems to be how these projects go.

Apologies, I thought I responded to the previous message.

No worries if you feel it's too much to work in association embedding at this time. It's usually best to build a solid foundation first anyway.

The benefits of association embedding can be twofold: 1) fewer http requests and 2) it opens the possibility of taking advantage of a more optimized db query.

As for the questions:

  1. A fully implemented query builder should replace the logic for all the query handlers (find, list, get)
  2. I don't think it will be useful to try to implement a nested select in this manner. Each $select should only reference top level fields. This will keep things simpler and allow for a more recursive approach when we implement embedded associations.

@lsarrazi
Copy link
Author

lsarrazi commented Apr 4, 2023

Hi there, I made some work here: https://github.com/lsarrazi/rest-hapi/tree/aggregation-query-builder

Its far from done but I think I got a good part of the features we need. I dont have much time those days sorry for the long time not replying, tell me your thoughts on it. Have a good day

@JKHeadley
Copy link
Owner

@lsarrazi thanks! I will try to check it out soon. No worries for any delays, that's how this work goes sometimes :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants