Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Gnocchi aggregates API #1307

Conversation

rafaelweingartner
Copy link
Contributor

@rafaelweingartner rafaelweingartner commented Jul 10, 2023

Gnocchi aggregates API can present slowness with time, due to the growing number of resources and revisions in large scale OpenStack Cloud environments. This happens due to the situation where Gnocchi does not apply filters in the MySQL queries in the resource tables. I mean, start/stop filters are not applied, and others as well, which makes Gnocchi to manipulate all dataset in the MySQL database and then the filtering is executed in Python.

To cope with that situation, we are proposing an optimization to improve the query executed by Gnocchi in MySQL, and apply the filtering right away; thus, the dataset manipulated by MySQL will be drastically reduced. After applying the patch, we noticed a reduction of 5-6 folds in the response time for the Gnocchi aggregates API.

@rafaelweingartner
Copy link
Contributor Author

@tobias-urdin can you look into this one? It is a nice to have in Gnocchi.

@rafaelweingartner
Copy link
Contributor Author

we need to merge this one first: #1308

Copy link
Member

@chungg chungg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to confirm, this is to enable resource history filtering?

gnocchi/rest/aggregates/api.py Outdated Show resolved Hide resolved
gnocchi/indexer/sqlalchemy.py Outdated Show resolved Hide resolved
gnocchi/indexer/sqlalchemy.py Outdated Show resolved Hide resolved
gnocchi/indexer/sqlalchemy.py Outdated Show resolved Hide resolved
@tobias-urdin
Copy link
Contributor

@Mergifyio rebase

@mergify
Copy link
Contributor

mergify bot commented Aug 1, 2023

rebase

❌ Unable to rebase: user tobias-urdin is unknown.

Please make sure tobias-urdin has logged in Mergify dashboard.

@tobias-urdin
Copy link
Contributor

@Mergifyio rebase

@mergify
Copy link
Contributor

mergify bot commented Aug 1, 2023

rebase

✅ Branch has been successfully rebased

@tobias-urdin tobias-urdin force-pushed the filter_aggregates_with_time_frame-upstream branch from 44b1f5a to fa9a23d Compare August 1, 2023 07:49
@rafaelweingartner rafaelweingartner force-pushed the filter_aggregates_with_time_frame-upstream branch 3 times, most recently from d75edc2 to 0f2b249 Compare August 2, 2023 18:35
Copy link
Contributor

@pedro-martins pedro-martins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Rafael, nice work here, I have only a simple suggestion

gnocchi/indexer/sqlalchemy.py Outdated Show resolved Hide resolved
gnocchi/indexer/sqlalchemy.py Outdated Show resolved Hide resolved
@rafaelweingartner rafaelweingartner force-pushed the filter_aggregates_with_time_frame-upstream branch from 273185a to 607d54f Compare August 4, 2023 00:28
@rafaelweingartner
Copy link
Contributor Author

Hello @chungg and @tobias-urdin I guess this patch is ready for your reviews again.

@tobias-urdin
Copy link
Contributor

Hello @chungg and @tobias-urdin I guess this patch is ready for your reviews again.

I will check sometime this week but it's probably best to get @chungg review as well.

Please rebase on the now merged changes for SQLAlchemy 2.0 support, remove some duplicate commits that seems to be on here, squash commits and add some clear commit messages.

Thanks!

@rafaelweingartner rafaelweingartner force-pushed the filter_aggregates_with_time_frame-upstream branch 8 times, most recently from bc92a75 to 69b0b7c Compare August 9, 2023 15:52
@rafaelweingartner
Copy link
Contributor Author

@tobias-urdin and @chungg rebase and squash done!

@rafaelweingartner rafaelweingartner force-pushed the filter_aggregates_with_time_frame-upstream branch from 958200c to 7648012 Compare August 15, 2023 16:56
@rafaelweingartner
Copy link
Contributor Author

Hello guys,
Do we need some other amendments in this patch?

Copy link
Member

@chungg chungg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you probably don't need me to approve. i'm just getting familiar with this again.

i guess my only other feedback would be it'd be good to have a new test rather than just making sure existing ones don't break.

gnocchi/rest/aggregates/api.py Show resolved Hide resolved
gnocchi/indexer/sqlalchemy.py Show resolved Hide resolved
gnocchi/rest/aggregates/api.py Show resolved Hide resolved
gnocchi/rest/aggregates/api.py Outdated Show resolved Hide resolved
gnocchi/json.py Outdated Show resolved Hide resolved
@rafaelweingartner
Copy link
Contributor Author

Thank you @chungg for your reviews! I have extract the code you suggested to a new PR at #1361.

Besides that, are we missing something else here?

@rafaelweingartner rafaelweingartner force-pushed the filter_aggregates_with_time_frame-upstream branch from 282c51b to aece611 Compare January 19, 2024 13:21
@rafaelweingartner
Copy link
Contributor Author

@chungg and @tobias-urdin I have rebased the patch. Can you guys check it?

@rafaelweingartner
Copy link
Contributor Author

Hello @tobias-urdin, @chungg, @jd do you guys think that we need to do some more adjusts in the code, so it can get merged?

@jd
Copy link
Member

jd commented Jan 29, 2024

There is no way I could review this TBH, it's been so long. I'm ok to approve and merge it though as I don't want to block any improvement.
Anyone else wants to weight it before I do it?

@rafaelweingartner
Copy link
Contributor Author

There is no way I could review this TBH, it's been so long. I'm ok to approve and merge it though as I don't want to block any improvement. Anyone else wants to weight it before I do it?

Thanks!

BTW, we have been constantly improving Gnocchi for years so far, and we have more coming. Would it be possible for one of us to be come a committer with merge permissions in the project here?

Copy link
Contributor

@tobias-urdin tobias-urdin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have the time to go through this in detail or test it, but just browsing through it and the fact that the testing is green (hoping that most things are tested that this touches) makes me OK to just merge.

I do however have some remarks.

  • The commit messages is very undescriptive and it looks like this PR is addressing more things in this PR than just the optimization itself
  • The revision_start -> search_start etc, why was that done can you explain? Did we merge broken code with the group history PR before? If so, why is that not in a separate PR and bulked into this PR that is unrelated to the optimization itself.
  • This adds a lot of LOG.debug calls, in pretty much the entire code path, I understand that might be good for debugging but is it really neccesary that we merge all that?

Overall it's kind of hard to understand what parts of this change is just cleanup in formatting, what fixes bugs, I would hope that would be cleaner in the future.

I'm all for giving our more access, my primary goal is still maintenance and keeping the project alive, especially to make sure it doesn't block anything in the OpenStack ecosystem so the project doesn't get dropped from there (keeping up with things such as the SQLAlchemy 2.0 migration).

I do really appreciate that you're taking the effort to address larger flaws in Gnocchi, like this one, I didn't even know about it until this PR, but I would hope that a high standard in commits is still kept even if the project is in maintenance mode.

@@ -211,19 +211,22 @@ def add_measure(self, measure, date_map, start, stop):
self.measures.append(measure)

def join_sequential_groups(self, group):
group.sort(key=lambda key: key['revision_start'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain these changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is done to use/group the results of the search according to the search start/end ranges. The way it was implemented, it was not optimized, as it was loading all resources, and just filtering the start/end parameters in Python code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand the revision_* -> search_* changes?

I also think this PR is too broad in what it does and should be cleaned up and be more minimal. For example the optimization to _get_history_result_mapper and changes to the aggregate/grouping code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that I do not understand what you do not understand.

This PR is addressing a situation with Gnocchi aggregates API. This API can present slowness with time, due to the growing number of resources and revisions in large scale OpenStack Cloud environments.

The situation we reported happens is some cases when we have a very large number of resources in the cloud environment. The slowness happens because when the aggregate API is called, it does not apply the filters sent in the API directly in the MySQL query. It only applies some filtering. Therefore, it will bring more elements in the response than what is needed. For instance, the start/stop filters are not applied, and others as well, which makes Gnocchi to manipulate all the dataset (resources) in the MySQL database and then the filtering is executed in Python. In some situations we have seen that databases almost broke/got in a stuck/stale state.

To cope with that situation, we are proposing an optimization to improve the query executed by Gnocchi in MySQL, and apply the filtering right away. By that, I mean, applying the filters directly in the MySQL query. Therefore, we reduce the result-set size and the number of elements manipulated by MySQL query. After applying the patch, we noticed a reduction of 5-6 folds in the response time for the Gnocchi aggregates API for large scale cloud environments.

What don't you understand?

Do you need an example of query that we normally execute? This is one example of query that we used to discover the reported issue:

/v1/aggregates?details=False&granularity=3600&start=2023-08-11T00%3A00%3A00%2B00%3A00&stop=2023-08-11T01%3A00%3A00%2B00%3A00&use_history=True&groupby=id&groupby=display_name&groupby=flavor_id&groupby=flavor_name&groupby=user_id&groupby=project_id&groupby=revision_start&groupby=availability_zone&groupby=image_ref&groupby=flavor_vcpus&groupby=flavor_ram&groupby=operating_system_name&groupby=operating_system_distro&groupby=operating_system_type&groupby=operating_system_version&groupby=mssql_version" -H "Content-Type: application/json" -H "Accept: application/json, */*" -H "X-Auth-Token: ${ACCESS_TOKEN_KEYSTONE}" -d '{"operations":["aggregate","max",["metric","dynamic_pollster.compute.services.instance.status","max"]],"search":{"and":[{"=":{"type":"instance"}},{"=":{"project_id":"<project_id>"}}]},"resource_type":"instance"}'

The above query is one that might be generated by CloudKitty, for instance.

As you can see, with a lot of groupbys, MySQL will have trouble to process the large dataset generated by Gnocchi queries, as they are not properly applying the filters sent in the API request.

Furthermore, can you explain me why you think that the PR is too broad?

@jd
Copy link
Member

jd commented Jan 29, 2024

There is no way I could review this TBH, it's been so long. I'm ok to approve and merge it though as I don't want to block any improvement. Anyone else wants to weight it before I do it?

Thanks!

BTW, we have been constantly improving Gnocchi for years so far, and we have more coming. Would it be possible for one of us to be come a committer with merge permissions in the project here?

I give you a soft-yes and will let @tobias-urdin and others interested to weigh in.

@rafaelweingartner
Copy link
Contributor Author

I don't have the time to go through this in detail or test it, but just browsing through it and the fact that the testing is green (hoping that most things are tested that this touches) makes me OK to just merge.

I do however have some remarks.

* The commit messages is very undescriptive and it looks like this PR is addressing more things in this PR than just the optimization itself

* The revision_start -> search_start etc, why was that done can you explain? Did we merge broken code with the group history PR before? If so, why is that not in a separate PR and bulked into this PR that is unrelated to the optimization itself.

* This adds a lot of LOG.debug calls, in pretty much the entire code path, I understand that might be good for debugging but is it really neccesary that we merge all that?

Overall it's kind of hard to understand what parts of this change is just cleanup in formatting, what fixes bugs, I would hope that would be cleaner in the future.

I'm all for giving our more access, my primary goal is still maintenance and keeping the project alive, especially to make sure it doesn't block anything in the OpenStack ecosystem so the project doesn't get dropped from there (keeping up with things such as the SQLAlchemy 2.0 migration).

I do really appreciate that you're taking the effort to address larger flaws in Gnocchi, like this one, I didn't even know about it until this PR, but I would hope that a high standard in commits is still kept even if the project is in maintenance mode.

Let me start by saying that Gnocchi is very convoluted and quite complicated code-wise. The idea is simple, but the way it was coded, makes it very difficult to understand what is being done, and how to fix the issues we find or the extensions that need to be developed. We are trying to address that by adding logs, changing/creating new functions with more descriptive names, and so on. However, it is a task that is going to take much more time until we reach a state where things get better.

Regarding the addition of logs, it goes towards that situation. It is very difficult to troubleshoot Gnocchi. Therefore, the addition of those logs will help people to work with Gnocchi in the future. Furthermore, they (the logs) do not cause any harm. You can use it in INFO log level (debug=false), and then you will not see those DEBUG logs.

Those flaws that nobody knows exist that we are fixing/patching come due to the fact that most OpenStack setups are not taking full advantage of Gnocchi (as far as we have seen). We are only facing them because we have a situation where we are putting pressure on Gnocchi to the limit without cleaning/removing things (the setup) every upgrade with a newer installation and so on. Moreover, we have a generic billing pipeline that receives data not just from OpenStack, but other systems that compose the cloud.

And, regarding the description. It is clear for people using Gnocchi that have some understanding of its inner workings. Every metric pushed to Gnocchi is created for a resource. Every time we create a resource, we have the create_at field, that is marked, and every time we update it, we have a revision created for it. When we do a search, via the Gnocchi search API, before this patch, the search was going through all resources in the system. However, that is costly. We can filter the resources affected by the search using the created_at/revision_start ended_at/revision_end fields when they are provided by the user calling the API. This is the optimization that we are doing here. Gnocchi was filtering the results in Python, and loading all measure, for all resources for all timestamps. What we propose is to filter the result-set right away in the DB.

Regarding the quality standard, can you describe/show us what needs to be improved? Then, we can work to address it and make the standard to be in the same level as you expect it to be.

@rafaelweingartner
Copy link
Contributor Author

There is no way I could review this TBH, it's been so long. I'm ok to approve and merge it though as I don't want to block any improvement. Anyone else wants to weight it before I do it?

Thanks!
BTW, we have been constantly improving Gnocchi for years so far, and we have more coming. Would it be possible for one of us to be come a committer with merge permissions in the project here?

I give you a soft-yes and will let @tobias-urdin and others interested to weigh in.

I do not know what that means =). However, even if we do not deserve a committer role. We just need to understand what are the requirements before a merge can be done.

We have more feature and optimizations to push to Gnocchi, and sometimes the time it takes to get them in is quite long.

@jd
Copy link
Member

jd commented Feb 2, 2024

I think the root problem is that the project is barely maintained at this stage. All the people who worked on Gnocchi years ago and wrote most of that code are gone in the wild.
This means nobody is smart enough or has the time to review all that code, especially because it's a huge change that seems (I did not dig through it) to address multiple problems at the same time.

I'm not concerned by the project's fate, maintenance, etc, at this stage, so I'm not going to block anything. As I said, I'm giving a soft yes meaning "I think it's better to have this merged because you seem to care about the project more than anyone else, but also, I'm not smart enough to say yay that works let's merge it."

@rafaelweingartner
Copy link
Contributor Author

I think the root problem is that the project is barely maintained at this stage. All the people who worked on Gnocchi years ago and wrote most of that code are gone in the wild. This means nobody is smart enough or has the time to review all that code, especially because it's a huge change that seems (I did not dig through it) to address multiple problems at the same time.

I'm not concerned by the project's fate, maintenance, etc, at this stage, so I'm not going to block anything. As I said, I'm giving a soft yes meaning "I think it's better to have this merged because you seem to care about the project more than anyone else, but also, I'm not smart enough to say yay that works let's merge it."

Thanks for the message! I understand what you mean. We are trying to reduce the scope for every new PR here. Gnocchi is a very interesting system. I mean, it has an interesting design and idea/concept, that we see a lot of potential. However, code-wise it has problems that we are working to address and to improve.

@rafaelweingartner
Copy link
Contributor Author

Hello guys,
Do you want us to execute some other changes in this PR before it is merged?

@rafaelweingartner
Copy link
Contributor Author

I was revisiting this PR, and the only file that was changed, and is not directly related to the patch proposed here is the following:

  • doc/source/rest.j2.

I can extract that is needed, but it was an opportunity to improve a documentation, and we just did it. All of the rest is related to the changeset proposed. All of the logs were needed/helped us to understand the situation, and they are being added in DEBUG log level. Therefore, if one does not want them, it is a matter of disabling DEBUG.

@jd
Copy link
Member

jd commented Feb 13, 2024

Yes if you have changes that can be extracted, do this, that'll help for sure!

@rafaelweingartner
Copy link
Contributor Author

Yes if you have changes that can be extracted, do this, that'll help for sure!

Done. The documentation improvement was extracted to: #1373

@mergify mergify bot dismissed tobias-urdin’s stale review February 14, 2024 11:56

Pull request has been modified.

@rafaelweingartner
Copy link
Contributor Author

Hello guys, can we merge this one? Or, do you think that we need some further improvements here?

The part of the documentation was already extracted to #1373.

Copy link
Member

@jd jd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As said, I'm ok with that since I don't think it'd be fair to block it just on the basis we don't have time to review it.

@rafaelweingartner
Copy link
Contributor Author

As said, I'm ok with that since I don't think it'd be fair to block it just on the basis we don't have time to review it.

Thanks for your support! If there is anything else missing here, please let me know. We have more patches to open for Gnocchi, but we are doing one by one, to avoid managing/dealing with conflicts in multiple fronts.

@jd jd merged commit 7a289c9 into gnocchixyz:master Mar 22, 2024
23 checks passed
rafaelweingartner added a commit to rafaelweingartner/gnocchi that referenced this pull request Apr 30, 2024
* Optimize Gnocchi aggregates API

Gnocchi aggregates API can present slowness with time, due to the growing number of resources and revisions in large scale OpenStack Cloud environments. This happens due to the situation where Gnocchi does not apply filters in the MySQL queries in the resource tables. I mean, start/stop filters are not applied, and others as well, which makes Gnocchi to manipulate all dataset in the MySQL database and then the filtering is executed in Python.

To cope with that situation, we are proposing an optimization to improve the query executed by Gnocchi in MySQL, and apply the filtering right away; thus, the dataset manipulated by MySQL will be drastically reduced. After applying the patch, we noticed a reduction of 5-6 folds in teh response time for the Gnocchi aggregates API.

* Use proper start timeframe when handling different resources together

* address Daniel's review

* Add some extra information regarding re-aggregation with the use of use_history

* Chungg review

* revert a change that does not work

* Remove documentation change
rafaelweingartner added a commit to rafaelweingartner/gnocchi that referenced this pull request May 28, 2024
While executing some Gnocchi optimizations (gnocchixyz#1307), we noticed that some deleted/removed resources do not have the "ended_at" field with a datetime. This can cause slowness with time, as more and more "zombie" resources are left there, and this has a direct impact in the MySQL queries executed with the aggregates API.

This patch introduces a new parameter called `metric_inactive_after`, which defines for how long a metric can go without receiving new datapoints until we consider it as inactive. Then, when all metrics of a resource are in inactive state, we can mark/consider the resource as removed.
rafaelweingartner added a commit to rafaelweingartner/gnocchi that referenced this pull request May 28, 2024
While executing some Gnocchi optimizations (gnocchixyz#1307), we noticed that some deleted/removed resources do not have the "ended_at" field with a datetime. This can cause slowness with time, as more and more "zombie" resources are left there, and this has a direct impact in the MySQL queries executed with the aggregates API.

This patch introduces a new parameter called `metric_inactive_after`, which defines for how long a metric can go without receiving new datapoints until we consider it as inactive. Then, when all metrics of a resource are in inactive state, we can mark/consider the resource as removed.
rafaelweingartner added a commit to rafaelweingartner/gnocchi that referenced this pull request May 28, 2024
While executing some Gnocchi optimizations (gnocchixyz#1307), we noticed that some deleted/removed resources do not have the "ended_at" field with a datetime. This can cause slowness with time, as more and more "zombie" resources are left there, and this has a direct impact in the MySQL queries executed with the aggregates API.

This patch introduces a new parameter called `metric_inactive_after`, which defines for how long a metric can go without receiving new datapoints until we consider it as inactive. Then, when all metrics of a resource are in inactive state, we can mark/consider the resource as removed.
rafaelweingartner added a commit to rafaelweingartner/gnocchi that referenced this pull request Jul 3, 2024
While executing some Gnocchi optimizations (gnocchixyz#1307), we noticed that some deleted/removed resources do not have the "ended_at" field with a datetime. This can cause slowness with time, as more and more "zombie" resources are left there, and this has a direct impact in the MySQL queries executed with the aggregates API.

This patch introduces a new parameter called `metric_inactive_after`, which defines for how long a metric can go without receiving new datapoints until we consider it as inactive. Then, when all metrics of a resource are in inactive state, we can mark/consider the resource as removed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants