-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize Gnocchi aggregates API #1307
Optimize Gnocchi aggregates API #1307
Conversation
@tobias-urdin can you look into this one? It is a nice to have in Gnocchi. |
we need to merge this one first: #1308 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to confirm, this is to enable resource history filtering?
@Mergifyio rebase |
❌ Unable to rebase: user
|
@Mergifyio rebase |
✅ Branch has been successfully rebased |
44b1f5a
to
fa9a23d
Compare
d75edc2
to
0f2b249
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Rafael, nice work here, I have only a simple suggestion
273185a
to
607d54f
Compare
Hello @chungg and @tobias-urdin I guess this patch is ready for your reviews again. |
I will check sometime this week but it's probably best to get @chungg review as well. Please rebase on the now merged changes for SQLAlchemy 2.0 support, remove some duplicate commits that seems to be on here, squash commits and add some clear commit messages. Thanks! |
bc92a75
to
69b0b7c
Compare
@tobias-urdin and @chungg rebase and squash done! |
958200c
to
7648012
Compare
Hello guys, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you probably don't need me to approve. i'm just getting familiar with this again.
i guess my only other feedback would be it'd be good to have a new test rather than just making sure existing ones don't break.
282c51b
to
aece611
Compare
@chungg and @tobias-urdin I have rebased the patch. Can you guys check it? |
Hello @tobias-urdin, @chungg, @jd do you guys think that we need to do some more adjusts in the code, so it can get merged? |
There is no way I could review this TBH, it's been so long. I'm ok to approve and merge it though as I don't want to block any improvement. |
Thanks! BTW, we have been constantly improving Gnocchi for years so far, and we have more coming. Would it be possible for one of us to be come a committer with merge permissions in the project here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have the time to go through this in detail or test it, but just browsing through it and the fact that the testing is green (hoping that most things are tested that this touches) makes me OK to just merge.
I do however have some remarks.
- The commit messages is very undescriptive and it looks like this PR is addressing more things in this PR than just the optimization itself
- The revision_start -> search_start etc, why was that done can you explain? Did we merge broken code with the group history PR before? If so, why is that not in a separate PR and bulked into this PR that is unrelated to the optimization itself.
- This adds a lot of LOG.debug calls, in pretty much the entire code path, I understand that might be good for debugging but is it really neccesary that we merge all that?
Overall it's kind of hard to understand what parts of this change is just cleanup in formatting, what fixes bugs, I would hope that would be cleaner in the future.
I'm all for giving our more access, my primary goal is still maintenance and keeping the project alive, especially to make sure it doesn't block anything in the OpenStack ecosystem so the project doesn't get dropped from there (keeping up with things such as the SQLAlchemy 2.0 migration).
I do really appreciate that you're taking the effort to address larger flaws in Gnocchi, like this one, I didn't even know about it until this PR, but I would hope that a high standard in commits is still kept even if the project is in maintenance mode.
@@ -211,19 +211,22 @@ def add_measure(self, measure, date_map, start, stop): | |||
self.measures.append(measure) | |||
|
|||
def join_sequential_groups(self, group): | |||
group.sort(key=lambda key: key['revision_start']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain these changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is done to use/group the results of the search according to the search start/end ranges. The way it was implemented, it was not optimized, as it was loading all resources, and just filtering the start/end parameters in Python code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't understand the revision_* -> search_* changes?
I also think this PR is too broad in what it does and should be cleaned up and be more minimal. For example the optimization to _get_history_result_mapper and changes to the aggregate/grouping code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that I do not understand what you do not understand.
This PR is addressing a situation with Gnocchi aggregates API. This API can present slowness with time, due to the growing number of resources and revisions in large scale OpenStack Cloud environments.
The situation we reported happens is some cases when we have a very large number of resources in the cloud environment. The slowness happens because when the aggregate API is called, it does not apply the filters sent in the API directly in the MySQL query. It only applies some filtering. Therefore, it will bring more elements in the response than what is needed. For instance, the start/stop filters are not applied, and others as well, which makes Gnocchi to manipulate all the dataset (resources) in the MySQL database and then the filtering is executed in Python. In some situations we have seen that databases almost broke/got in a stuck/stale state.
To cope with that situation, we are proposing an optimization to improve the query executed by Gnocchi in MySQL, and apply the filtering right away. By that, I mean, applying the filters directly in the MySQL query. Therefore, we reduce the result-set size and the number of elements manipulated by MySQL query. After applying the patch, we noticed a reduction of 5-6 folds in the response time for the Gnocchi aggregates API for large scale cloud environments.
What don't you understand?
Do you need an example of query that we normally execute? This is one example of query that we used to discover the reported issue:
/v1/aggregates?details=False&granularity=3600&start=2023-08-11T00%3A00%3A00%2B00%3A00&stop=2023-08-11T01%3A00%3A00%2B00%3A00&use_history=True&groupby=id&groupby=display_name&groupby=flavor_id&groupby=flavor_name&groupby=user_id&groupby=project_id&groupby=revision_start&groupby=availability_zone&groupby=image_ref&groupby=flavor_vcpus&groupby=flavor_ram&groupby=operating_system_name&groupby=operating_system_distro&groupby=operating_system_type&groupby=operating_system_version&groupby=mssql_version" -H "Content-Type: application/json" -H "Accept: application/json, */*" -H "X-Auth-Token: ${ACCESS_TOKEN_KEYSTONE}" -d '{"operations":["aggregate","max",["metric","dynamic_pollster.compute.services.instance.status","max"]],"search":{"and":[{"=":{"type":"instance"}},{"=":{"project_id":"<project_id>"}}]},"resource_type":"instance"}'
The above query is one that might be generated by CloudKitty, for instance.
As you can see, with a lot of groupbys, MySQL will have trouble to process the large dataset generated by Gnocchi queries, as they are not properly applying the filters sent in the API request.
Furthermore, can you explain me why you think that the PR is too broad?
I give you a soft-yes and will let @tobias-urdin and others interested to weigh in. |
Let me start by saying that Gnocchi is very convoluted and quite complicated code-wise. The idea is simple, but the way it was coded, makes it very difficult to understand what is being done, and how to fix the issues we find or the extensions that need to be developed. We are trying to address that by adding logs, changing/creating new functions with more descriptive names, and so on. However, it is a task that is going to take much more time until we reach a state where things get better. Regarding the addition of logs, it goes towards that situation. It is very difficult to troubleshoot Gnocchi. Therefore, the addition of those logs will help people to work with Gnocchi in the future. Furthermore, they (the logs) do not cause any harm. You can use it in INFO log level (debug=false), and then you will not see those DEBUG logs. Those flaws that nobody knows exist that we are fixing/patching come due to the fact that most OpenStack setups are not taking full advantage of Gnocchi (as far as we have seen). We are only facing them because we have a situation where we are putting pressure on Gnocchi to the limit without cleaning/removing things (the setup) every upgrade with a newer installation and so on. Moreover, we have a generic billing pipeline that receives data not just from OpenStack, but other systems that compose the cloud. And, regarding the description. It is clear for people using Gnocchi that have some understanding of its inner workings. Every metric pushed to Gnocchi is created for a resource. Every time we create a resource, we have the create_at field, that is marked, and every time we update it, we have a revision created for it. When we do a search, via the Gnocchi search API, before this patch, the search was going through all resources in the system. However, that is costly. We can filter the resources affected by the search using the created_at/revision_start ended_at/revision_end fields when they are provided by the user calling the API. This is the optimization that we are doing here. Gnocchi was filtering the results in Python, and loading all measure, for all resources for all timestamps. What we propose is to filter the result-set right away in the DB. Regarding the quality standard, can you describe/show us what needs to be improved? Then, we can work to address it and make the standard to be in the same level as you expect it to be. |
I do not know what that means =). However, even if we do not deserve a committer role. We just need to understand what are the requirements before a merge can be done. We have more feature and optimizations to push to Gnocchi, and sometimes the time it takes to get them in is quite long. |
I think the root problem is that the project is barely maintained at this stage. All the people who worked on Gnocchi years ago and wrote most of that code are gone in the wild. I'm not concerned by the project's fate, maintenance, etc, at this stage, so I'm not going to block anything. As I said, I'm giving a soft yes meaning "I think it's better to have this merged because you seem to care about the project more than anyone else, but also, I'm not smart enough to say yay that works let's merge it." |
Thanks for the message! I understand what you mean. We are trying to reduce the scope for every new PR here. Gnocchi is a very interesting system. I mean, it has an interesting design and idea/concept, that we see a lot of potential. However, code-wise it has problems that we are working to address and to improve. |
Hello guys, |
I was revisiting this PR, and the only file that was changed, and is not directly related to the patch proposed here is the following:
I can extract that is needed, but it was an opportunity to improve a documentation, and we just did it. All of the rest is related to the changeset proposed. All of the logs were needed/helped us to understand the situation, and they are being added in DEBUG log level. Therefore, if one does not want them, it is a matter of disabling DEBUG. |
Yes if you have changes that can be extracted, do this, that'll help for sure! |
Done. The documentation improvement was extracted to: #1373 |
Pull request has been modified.
Hello guys, can we merge this one? Or, do you think that we need some further improvements here? The part of the documentation was already extracted to #1373. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As said, I'm ok with that since I don't think it'd be fair to block it just on the basis we don't have time to review it.
Thanks for your support! If there is anything else missing here, please let me know. We have more patches to open for Gnocchi, but we are doing one by one, to avoid managing/dealing with conflicts in multiple fronts. |
* Optimize Gnocchi aggregates API Gnocchi aggregates API can present slowness with time, due to the growing number of resources and revisions in large scale OpenStack Cloud environments. This happens due to the situation where Gnocchi does not apply filters in the MySQL queries in the resource tables. I mean, start/stop filters are not applied, and others as well, which makes Gnocchi to manipulate all dataset in the MySQL database and then the filtering is executed in Python. To cope with that situation, we are proposing an optimization to improve the query executed by Gnocchi in MySQL, and apply the filtering right away; thus, the dataset manipulated by MySQL will be drastically reduced. After applying the patch, we noticed a reduction of 5-6 folds in teh response time for the Gnocchi aggregates API. * Use proper start timeframe when handling different resources together * address Daniel's review * Add some extra information regarding re-aggregation with the use of use_history * Chungg review * revert a change that does not work * Remove documentation change
While executing some Gnocchi optimizations (gnocchixyz#1307), we noticed that some deleted/removed resources do not have the "ended_at" field with a datetime. This can cause slowness with time, as more and more "zombie" resources are left there, and this has a direct impact in the MySQL queries executed with the aggregates API. This patch introduces a new parameter called `metric_inactive_after`, which defines for how long a metric can go without receiving new datapoints until we consider it as inactive. Then, when all metrics of a resource are in inactive state, we can mark/consider the resource as removed.
While executing some Gnocchi optimizations (gnocchixyz#1307), we noticed that some deleted/removed resources do not have the "ended_at" field with a datetime. This can cause slowness with time, as more and more "zombie" resources are left there, and this has a direct impact in the MySQL queries executed with the aggregates API. This patch introduces a new parameter called `metric_inactive_after`, which defines for how long a metric can go without receiving new datapoints until we consider it as inactive. Then, when all metrics of a resource are in inactive state, we can mark/consider the resource as removed.
While executing some Gnocchi optimizations (gnocchixyz#1307), we noticed that some deleted/removed resources do not have the "ended_at" field with a datetime. This can cause slowness with time, as more and more "zombie" resources are left there, and this has a direct impact in the MySQL queries executed with the aggregates API. This patch introduces a new parameter called `metric_inactive_after`, which defines for how long a metric can go without receiving new datapoints until we consider it as inactive. Then, when all metrics of a resource are in inactive state, we can mark/consider the resource as removed.
While executing some Gnocchi optimizations (gnocchixyz#1307), we noticed that some deleted/removed resources do not have the "ended_at" field with a datetime. This can cause slowness with time, as more and more "zombie" resources are left there, and this has a direct impact in the MySQL queries executed with the aggregates API. This patch introduces a new parameter called `metric_inactive_after`, which defines for how long a metric can go without receiving new datapoints until we consider it as inactive. Then, when all metrics of a resource are in inactive state, we can mark/consider the resource as removed.
Gnocchi aggregates API can present slowness with time, due to the growing number of resources and revisions in large scale OpenStack Cloud environments. This happens due to the situation where Gnocchi does not apply filters in the MySQL queries in the resource tables. I mean, start/stop filters are not applied, and others as well, which makes Gnocchi to manipulate all dataset in the MySQL database and then the filtering is executed in Python.
To cope with that situation, we are proposing an optimization to improve the query executed by Gnocchi in MySQL, and apply the filtering right away; thus, the dataset manipulated by MySQL will be drastically reduced. After applying the patch, we noticed a reduction of 5-6 folds in the response time for the Gnocchi aggregates API.