Optimize Gnocchi aggregates API #1307

rafaelweingartner · 2023-07-10T12:51:14Z

Gnocchi aggregates API can present slowness with time, due to the growing number of resources and revisions in large scale OpenStack Cloud environments. This happens due to the situation where Gnocchi does not apply filters in the MySQL queries in the resource tables. I mean, start/stop filters are not applied, and others as well, which makes Gnocchi to manipulate all dataset in the MySQL database and then the filtering is executed in Python.

To cope with that situation, we are proposing an optimization to improve the query executed by Gnocchi in MySQL, and apply the filtering right away; thus, the dataset manipulated by MySQL will be drastically reduced. After applying the patch, we noticed a reduction of 5-6 folds in the response time for the Gnocchi aggregates API.

rafaelweingartner · 2023-07-10T12:51:35Z

@tobias-urdin can you look into this one? It is a nice to have in Gnocchi.

rafaelweingartner · 2023-07-10T12:58:15Z

we need to merge this one first: #1308

chungg

just to confirm, this is to enable resource history filtering?

gnocchi/rest/aggregates/api.py

gnocchi/indexer/sqlalchemy.py

tobias-urdin · 2023-08-01T07:43:10Z

@Mergifyio rebase

mergify · 2023-08-01T07:43:13Z

rebase

❌ Unable to rebase: user `tobias-urdin` is unknown.

Please make sure tobias-urdin has logged in Mergify dashboard.

tobias-urdin · 2023-08-01T07:48:57Z

@Mergifyio rebase

mergify · 2023-08-01T07:49:12Z

rebase

✅ Branch has been successfully rebased

pedro-martins

Hi Rafael, nice work here, I have only a simple suggestion

gnocchi/indexer/sqlalchemy.py

rafaelweingartner · 2023-08-07T14:16:26Z

Hello @chungg and @tobias-urdin I guess this patch is ready for your reviews again.

tobias-urdin · 2023-08-07T20:58:35Z

Hello @chungg and @tobias-urdin I guess this patch is ready for your reviews again.

I will check sometime this week but it's probably best to get @chungg review as well.

Please rebase on the now merged changes for SQLAlchemy 2.0 support, remove some duplicate commits that seems to be on here, squash commits and add some clear commit messages.

Thanks!

rafaelweingartner · 2023-08-09T16:01:38Z

@tobias-urdin and @chungg rebase and squash done!

gnocchi/rest/aggregates/api.py

rafaelweingartner · 2023-09-14T14:17:46Z

Hello guys,
Do we need some other amendments in this patch?

chungg

you probably don't need me to approve. i'm just getting familiar with this again.

i guess my only other feedback would be it'd be good to have a new test rather than just making sure existing ones don't break.

gnocchi/rest/aggregates/api.py

gnocchi/indexer/sqlalchemy.py

gnocchi/rest/aggregates/api.py

gnocchi/json.py

rafaelweingartner · 2023-12-01T17:20:43Z

Thank you @chungg for your reviews! I have extract the code you suggested to a new PR at #1361.

Besides that, are we missing something else here?

rafaelweingartner · 2024-01-19T13:22:20Z

@chungg and @tobias-urdin I have rebased the patch. Can you guys check it?

rafaelweingartner · 2024-01-27T18:32:40Z

Hello @tobias-urdin, @chungg, @jd do you guys think that we need to do some more adjusts in the code, so it can get merged?

jd · 2024-01-29T11:25:13Z

There is no way I could review this TBH, it's been so long. I'm ok to approve and merge it though as I don't want to block any improvement.
Anyone else wants to weight it before I do it?

rafaelweingartner · 2024-01-29T11:29:05Z

There is no way I could review this TBH, it's been so long. I'm ok to approve and merge it though as I don't want to block any improvement. Anyone else wants to weight it before I do it?

Thanks!

BTW, we have been constantly improving Gnocchi for years so far, and we have more coming. Would it be possible for one of us to be come a committer with merge permissions in the project here?

tobias-urdin

I don't have the time to go through this in detail or test it, but just browsing through it and the fact that the testing is green (hoping that most things are tested that this touches) makes me OK to just merge.

I do however have some remarks.

The commit messages is very undescriptive and it looks like this PR is addressing more things in this PR than just the optimization itself
The revision_start -> search_start etc, why was that done can you explain? Did we merge broken code with the group history PR before? If so, why is that not in a separate PR and bulked into this PR that is unrelated to the optimization itself.
This adds a lot of LOG.debug calls, in pretty much the entire code path, I understand that might be good for debugging but is it really neccesary that we merge all that?

Overall it's kind of hard to understand what parts of this change is just cleanup in formatting, what fixes bugs, I would hope that would be cleaner in the future.

I'm all for giving our more access, my primary goal is still maintenance and keeping the project alive, especially to make sure it doesn't block anything in the OpenStack ecosystem so the project doesn't get dropped from there (keeping up with things such as the SQLAlchemy 2.0 migration).

I do really appreciate that you're taking the effort to address larger flaws in Gnocchi, like this one, I didn't even know about it until this PR, but I would hope that a high standard in commits is still kept even if the project is in maintenance mode.

tobias-urdin · 2024-01-29T13:12:58Z

gnocchi/rest/aggregates/api.py

@@ -211,19 +211,22 @@ def add_measure(self, measure, date_map, start, stop):
        self.measures.append(measure)

    def join_sequential_groups(self, group):
-        group.sort(key=lambda key: key['revision_start'])


Can you explain these changes.

Yes, this is done to use/group the results of the search according to the search start/end ranges. The way it was implemented, it was not optimized, as it was loading all resources, and just filtering the start/end parameters in Python code.

I still don't understand the revision_* -> search_* changes?

I also think this PR is too broad in what it does and should be cleaned up and be more minimal. For example the optimization to _get_history_result_mapper and changes to the aggregate/grouping code.

The problem is that I do not understand what you do not understand.

This PR is addressing a situation with Gnocchi aggregates API. This API can present slowness with time, due to the growing number of resources and revisions in large scale OpenStack Cloud environments.

The situation we reported happens is some cases when we have a very large number of resources in the cloud environment. The slowness happens because when the aggregate API is called, it does not apply the filters sent in the API directly in the MySQL query. It only applies some filtering. Therefore, it will bring more elements in the response than what is needed. For instance, the start/stop filters are not applied, and others as well, which makes Gnocchi to manipulate all the dataset (resources) in the MySQL database and then the filtering is executed in Python. In some situations we have seen that databases almost broke/got in a stuck/stale state.

To cope with that situation, we are proposing an optimization to improve the query executed by Gnocchi in MySQL, and apply the filtering right away. By that, I mean, applying the filters directly in the MySQL query. Therefore, we reduce the result-set size and the number of elements manipulated by MySQL query. After applying the patch, we noticed a reduction of 5-6 folds in the response time for the Gnocchi aggregates API for large scale cloud environments.

What don't you understand?

Do you need an example of query that we normally execute? This is one example of query that we used to discover the reported issue:

/v1/aggregates?details=False&granularity=3600&start=2023-08-11T00%3A00%3A00%2B00%3A00&stop=2023-08-11T01%3A00%3A00%2B00%3A00&use_history=True&groupby=id&groupby=display_name&groupby=flavor_id&groupby=flavor_name&groupby=user_id&groupby=project_id&groupby=revision_start&groupby=availability_zone&groupby=image_ref&groupby=flavor_vcpus&groupby=flavor_ram&groupby=operating_system_name&groupby=operating_system_distro&groupby=operating_system_type&groupby=operating_system_version&groupby=mssql_version" -H "Content-Type: application/json" -H "Accept: application/json, */*" -H "X-Auth-Token: ${ACCESS_TOKEN_KEYSTONE}" -d '{"operations":["aggregate","max",["metric","dynamic_pollster.compute.services.instance.status","max"]],"search":{"and":[{"=":{"type":"instance"}},{"=":{"project_id":"<project_id>"}}]},"resource_type":"instance"}'

The above query is one that might be generated by CloudKitty, for instance.

As you can see, with a lot of groupbys, MySQL will have trouble to process the large dataset generated by Gnocchi queries, as they are not properly applying the filters sent in the API request.

Furthermore, can you explain me why you think that the PR is too broad?

jd · 2024-01-29T14:55:29Z

There is no way I could review this TBH, it's been so long. I'm ok to approve and merge it though as I don't want to block any improvement. Anyone else wants to weight it before I do it?

Thanks!

BTW, we have been constantly improving Gnocchi for years so far, and we have more coming. Would it be possible for one of us to be come a committer with merge permissions in the project here?

I give you a soft-yes and will let @tobias-urdin and others interested to weigh in.

rafaelweingartner · 2024-01-29T22:01:15Z

I don't have the time to go through this in detail or test it, but just browsing through it and the fact that the testing is green (hoping that most things are tested that this touches) makes me OK to just merge.

I do however have some remarks.
* The commit messages is very undescriptive and it looks like this PR is addressing more things in this PR than just the optimization itself

* The revision_start -> search_start etc, why was that done can you explain? Did we merge broken code with the group history PR before? If so, why is that not in a separate PR and bulked into this PR that is unrelated to the optimization itself.

* This adds a lot of LOG.debug calls, in pretty much the entire code path, I understand that might be good for debugging but is it really neccesary that we merge all that?
Overall it's kind of hard to understand what parts of this change is just cleanup in formatting, what fixes bugs, I would hope that would be cleaner in the future.

I'm all for giving our more access, my primary goal is still maintenance and keeping the project alive, especially to make sure it doesn't block anything in the OpenStack ecosystem so the project doesn't get dropped from there (keeping up with things such as the SQLAlchemy 2.0 migration).

I do really appreciate that you're taking the effort to address larger flaws in Gnocchi, like this one, I didn't even know about it until this PR, but I would hope that a high standard in commits is still kept even if the project is in maintenance mode.

Let me start by saying that Gnocchi is very convoluted and quite complicated code-wise. The idea is simple, but the way it was coded, makes it very difficult to understand what is being done, and how to fix the issues we find or the extensions that need to be developed. We are trying to address that by adding logs, changing/creating new functions with more descriptive names, and so on. However, it is a task that is going to take much more time until we reach a state where things get better.

Regarding the addition of logs, it goes towards that situation. It is very difficult to troubleshoot Gnocchi. Therefore, the addition of those logs will help people to work with Gnocchi in the future. Furthermore, they (the logs) do not cause any harm. You can use it in INFO log level (debug=false), and then you will not see those DEBUG logs.

Those flaws that nobody knows exist that we are fixing/patching come due to the fact that most OpenStack setups are not taking full advantage of Gnocchi (as far as we have seen). We are only facing them because we have a situation where we are putting pressure on Gnocchi to the limit without cleaning/removing things (the setup) every upgrade with a newer installation and so on. Moreover, we have a generic billing pipeline that receives data not just from OpenStack, but other systems that compose the cloud.

And, regarding the description. It is clear for people using Gnocchi that have some understanding of its inner workings. Every metric pushed to Gnocchi is created for a resource. Every time we create a resource, we have the create_at field, that is marked, and every time we update it, we have a revision created for it. When we do a search, via the Gnocchi search API, before this patch, the search was going through all resources in the system. However, that is costly. We can filter the resources affected by the search using the created_at/revision_start ended_at/revision_end fields when they are provided by the user calling the API. This is the optimization that we are doing here. Gnocchi was filtering the results in Python, and loading all measure, for all resources for all timestamps. What we propose is to filter the result-set right away in the DB.

Regarding the quality standard, can you describe/show us what needs to be improved? Then, we can work to address it and make the standard to be in the same level as you expect it to be.

rafaelweingartner · 2024-01-29T22:02:54Z

There is no way I could review this TBH, it's been so long. I'm ok to approve and merge it though as I don't want to block any improvement. Anyone else wants to weight it before I do it?

Thanks!
BTW, we have been constantly improving Gnocchi for years so far, and we have more coming. Would it be possible for one of us to be come a committer with merge permissions in the project here?

I give you a soft-yes and will let @tobias-urdin and others interested to weigh in.

I do not know what that means =). However, even if we do not deserve a committer role. We just need to understand what are the requirements before a merge can be done.

We have more feature and optimizations to push to Gnocchi, and sometimes the time it takes to get them in is quite long.

jd · 2024-02-02T14:55:52Z

I think the root problem is that the project is barely maintained at this stage. All the people who worked on Gnocchi years ago and wrote most of that code are gone in the wild.
This means nobody is smart enough or has the time to review all that code, especially because it's a huge change that seems (I did not dig through it) to address multiple problems at the same time.

I'm not concerned by the project's fate, maintenance, etc, at this stage, so I'm not going to block anything. As I said, I'm giving a soft yes meaning "I think it's better to have this merged because you seem to care about the project more than anyone else, but also, I'm not smart enough to say yay that works let's merge it."

rafaelweingartner · 2024-02-02T15:20:17Z

I think the root problem is that the project is barely maintained at this stage. All the people who worked on Gnocchi years ago and wrote most of that code are gone in the wild. This means nobody is smart enough or has the time to review all that code, especially because it's a huge change that seems (I did not dig through it) to address multiple problems at the same time.

I'm not concerned by the project's fate, maintenance, etc, at this stage, so I'm not going to block anything. As I said, I'm giving a soft yes meaning "I think it's better to have this merged because you seem to care about the project more than anyone else, but also, I'm not smart enough to say yay that works let's merge it."

Thanks for the message! I understand what you mean. We are trying to reduce the scope for every new PR here. Gnocchi is a very interesting system. I mean, it has an interesting design and idea/concept, that we see a lot of potential. However, code-wise it has problems that we are working to address and to improve.

rafaelweingartner · 2024-02-13T10:56:03Z

Hello guys,
Do you want us to execute some other changes in this PR before it is merged?

rafaelweingartner · 2024-02-13T15:23:36Z

I was revisiting this PR, and the only file that was changed, and is not directly related to the patch proposed here is the following:

doc/source/rest.j2.

I can extract that is needed, but it was an opportunity to improve a documentation, and we just did it. All of the rest is related to the changeset proposed. All of the logs were needed/helped us to understand the situation, and they are being added in DEBUG log level. Therefore, if one does not want them, it is a matter of disabling DEBUG.

jd · 2024-02-13T21:47:07Z

Yes if you have changes that can be extracted, do this, that'll help for sure!

rafaelweingartner · 2024-02-14T11:55:52Z

Yes if you have changes that can be extracted, do this, that'll help for sure!

Done. The documentation improvement was extracted to: #1373

Pull request has been modified.

rafaelweingartner · 2024-02-28T17:00:28Z

Hello guys, can we merge this one? Or, do you think that we need some further improvements here?

The part of the documentation was already extracted to #1373.

jd

As said, I'm ok with that since I don't think it'd be fair to block it just on the basis we don't have time to review it.

rafaelweingartner · 2024-03-14T17:22:43Z

As said, I'm ok with that since I don't think it'd be fair to block it just on the basis we don't have time to review it.

Thanks for your support! If there is anything else missing here, please let me know. We have more patches to open for Gnocchi, but we are doing one by one, to avoid managing/dealing with conflicts in multiple fronts.

* Optimize Gnocchi aggregates API Gnocchi aggregates API can present slowness with time, due to the growing number of resources and revisions in large scale OpenStack Cloud environments. This happens due to the situation where Gnocchi does not apply filters in the MySQL queries in the resource tables. I mean, start/stop filters are not applied, and others as well, which makes Gnocchi to manipulate all dataset in the MySQL database and then the filtering is executed in Python. To cope with that situation, we are proposing an optimization to improve the query executed by Gnocchi in MySQL, and apply the filtering right away; thus, the dataset manipulated by MySQL will be drastically reduced. After applying the patch, we noticed a reduction of 5-6 folds in teh response time for the Gnocchi aggregates API. * Use proper start timeframe when handling different resources together * address Daniel's review * Add some extra information regarding re-aggregation with the use of use_history * Chungg review * revert a change that does not work * Remove documentation change

While executing some Gnocchi optimizations (gnocchixyz#1307), we noticed that some deleted/removed resources do not have the "ended_at" field with a datetime. This can cause slowness with time, as more and more "zombie" resources are left there, and this has a direct impact in the MySQL queries executed with the aggregates API. This patch introduces a new parameter called `metric_inactive_after`, which defines for how long a metric can go without receiving new datapoints until we consider it as inactive. Then, when all metrics of a resource are in inactive state, we can mark/consider the resource as removed.

chungg reviewed Jul 31, 2023

View reviewed changes

gnocchi/rest/aggregates/api.py Outdated Show resolved Hide resolved

gnocchi/indexer/sqlalchemy.py Outdated Show resolved Hide resolved

gnocchi/indexer/sqlalchemy.py Outdated Show resolved Hide resolved

gnocchi/indexer/sqlalchemy.py Outdated Show resolved Hide resolved

tobias-urdin force-pushed the filter_aggregates_with_time_frame-upstream branch from 44b1f5a to fa9a23d Compare August 1, 2023 07:49

rafaelweingartner force-pushed the filter_aggregates_with_time_frame-upstream branch 3 times, most recently from d75edc2 to 0f2b249 Compare August 2, 2023 18:35

pedro-martins reviewed Aug 3, 2023

View reviewed changes

gnocchi/indexer/sqlalchemy.py Outdated Show resolved Hide resolved

gnocchi/indexer/sqlalchemy.py Outdated Show resolved Hide resolved

rafaelweingartner force-pushed the filter_aggregates_with_time_frame-upstream branch from 273185a to 607d54f Compare August 4, 2023 00:28

rafaelweingartner force-pushed the filter_aggregates_with_time_frame-upstream branch 8 times, most recently from bc92a75 to 69b0b7c Compare August 9, 2023 15:52

rafaelweingartner force-pushed the filter_aggregates_with_time_frame-upstream branch from 958200c to 7648012 Compare August 15, 2023 16:56

GutoVeronezi reviewed Aug 16, 2023

View reviewed changes

gnocchi/rest/aggregates/api.py Outdated Show resolved Hide resolved

tobias-urdin requested a review from chungg September 15, 2023 13:23

chungg reviewed Sep 20, 2023

View reviewed changes

gnocchi/rest/aggregates/api.py Show resolved Hide resolved

gnocchi/indexer/sqlalchemy.py Show resolved Hide resolved

gnocchi/rest/aggregates/api.py Show resolved Hide resolved

gnocchi/rest/aggregates/api.py Outdated Show resolved Hide resolved

gnocchi/json.py Outdated Show resolved Hide resolved

revert a change that does not work

aece611

rafaelweingartner force-pushed the filter_aggregates_with_time_frame-upstream branch from 282c51b to aece611 Compare January 19, 2024 13:21

tobias-urdin previously requested changes Jan 29, 2024

View reviewed changes

Remove documentation change

83e4eba

jd approved these changes Mar 2, 2024

View reviewed changes

jd merged commit 7a289c9 into gnocchixyz:master Mar 22, 2024
23 checks passed

rafaelweingartner mentioned this pull request May 28, 2024

Automatically detect deleted resources #1386

Open

This was referenced Aug 2, 2024

Add Python 3.10 and 3.12 support gnocchixyz/python-gnocchiclient#142

Merged

Aggregate query drops measures with timestamps older than metric creation time #1392

Closed

Optimize Gnocchi aggregates API #1307

Optimize Gnocchi aggregates API #1307

Conversation

rafaelweingartner commented Jul 10, 2023 • edited Loading

rafaelweingartner commented Jul 10, 2023

rafaelweingartner commented Jul 10, 2023

chungg left a comment

Choose a reason for hiding this comment

tobias-urdin commented Aug 1, 2023

mergify bot commented Aug 1, 2023

❌ Unable to rebase: user tobias-urdin is unknown.

tobias-urdin commented Aug 1, 2023

mergify bot commented Aug 1, 2023

✅ Branch has been successfully rebased

pedro-martins left a comment

Choose a reason for hiding this comment

rafaelweingartner commented Aug 7, 2023

tobias-urdin commented Aug 7, 2023

rafaelweingartner commented Aug 9, 2023

rafaelweingartner commented Sep 14, 2023

chungg left a comment

Choose a reason for hiding this comment

rafaelweingartner commented Dec 1, 2023

rafaelweingartner commented Jan 19, 2024

rafaelweingartner commented Jan 27, 2024

jd commented Jan 29, 2024

rafaelweingartner commented Jan 29, 2024

tobias-urdin left a comment • edited Loading

Choose a reason for hiding this comment

tobias-urdin Jan 29, 2024

Choose a reason for hiding this comment

rafaelweingartner Jan 29, 2024

Choose a reason for hiding this comment

tobias-urdin Feb 29, 2024

Choose a reason for hiding this comment

rafaelweingartner Feb 29, 2024

Choose a reason for hiding this comment

jd commented Jan 29, 2024

rafaelweingartner commented Jan 29, 2024

rafaelweingartner commented Jan 29, 2024

jd commented Feb 2, 2024

rafaelweingartner commented Feb 2, 2024

rafaelweingartner commented Feb 13, 2024

rafaelweingartner commented Feb 13, 2024

jd commented Feb 13, 2024

rafaelweingartner commented Feb 14, 2024

rafaelweingartner commented Feb 28, 2024

jd left a comment

Choose a reason for hiding this comment

rafaelweingartner commented Mar 14, 2024

rafaelweingartner commented Jul 10, 2023 •

edited

Loading

❌ Unable to rebase: user `tobias-urdin` is unknown.

tobias-urdin left a comment •

edited

Loading