Skip to content
This repository has been archived by the owner on Apr 27, 2022. It is now read-only.

API to list all nodes in all projects #1054

Open
naved001 opened this issue Nov 9, 2018 · 14 comments
Open

API to list all nodes in all projects #1054

naved001 opened this issue Nov 9, 2018 · 14 comments

Comments

@naved001
Copy link
Contributor

naved001 commented Nov 9, 2018

I think we should have an admin only API to list all nodes in every project. This is something that we need to know.

Currently we can achieve this by running list_project_nodes on all projects , but it's quite slow (depending on the number of projects we have).

@zenhack Wanted to get your thoughts before I implement this.

@zenhack
Copy link
Contributor

zenhack commented Nov 12, 2018

So would this just be a variant on list all nodes where we also get a field telling us what project they're in?

Have you tried parallelizing the individual list_project_nodes calls? That should help some, maybe enough for a stopgap while we think about the longer term picture.

I worry about cases like this, where we have an obvious way to do something, but we're talking about adding a shortcut because composing existing operations is too slow. Minimalism was a big goal, but if we're not able to express the things we need in terms of the minimal operations, I'd rather step back for a moment and ask whether there's some structural change we can make to avoid this sort of scenario. I'd rather not get into a situation where we have calls for x y and z, and then also shortcuts for x-then-y, x-then-z, y-then-z, x-then-y-then-z... We should instead see if we can't find a way to provide primitives that actually compose nicely.

@naved001
Copy link
Contributor Author

So would this just be a variant on list all nodes where we also get a field telling us what project they're in?

Pretty much.

Have you tried parallelizing the individual list_project_nodes calls? That should help some, maybe enough for a stopgap while we think about the longer term picture.

Good point. I tried to parallelize my requests using python's multiprocessing module but the results were the same (took like 11-12 seconds to return nodes in 8 projects). Maybe I am doing it wrong. Relevant code snippet.

import requests
import os
import ast
from pprint import pprint
from multiprocessing import Pool

user = os.environ.get('HIL_USERNAME')
password = os.environ.get('HIL_PASSWORD')
url = os.environ.get('HIL_ENDPOINT') + '/v0/'
r = requests.get(url+'projects', auth=(user, password))
projects = ast.literal_eval(r.content)

def get_nodes(project):
    api = 'project/' + project + '/nodes'
    nodes = requests.get(url+api, auth=(user, password))
    key ={}
    key[project] = nodes.content
    return key

p = Pool(processes=5)
pprint(p.map(get_nodes, projects))

I'll get back to it later I guess, and for now just use the slower sequential version.

I worry about cases like this, where we have an obvious way to do something, but we're talking about adding a shortcut because composing existing operations is too slow. Minimalism was a big goal, but if we're not able to express the things we need in terms of the minimal operations, I'd rather step back for a moment and ask whether there's some structural change we can make to avoid this sort of scenario. I'd rather not get into a situation where we have calls for x y and z, and then also shortcuts for x-then-y, x-then-z, y-then-z, x-then-y-then-z... We should instead see if we can't find a way to provide primitives that actually compose nicely.

Good point. Let's see if I can do it with parallel calls, otherwise we'll think about the bigger picture first before adding yet another (similar) API.

@zenhack
Copy link
Contributor

zenhack commented Nov 16, 2018

How many threads is the server using? the example apache config in INSTALL.rst specifies two, which is going to limit concurrency.

@naved001
Copy link
Contributor Author

I specified four threads there

WSGIDaemonProcess hil user=hil group=hil threads=4

@zenhack
Copy link
Contributor

zenhack commented Nov 16, 2018

Another thought: the performance problem we hit with CI way back could be rearing its head when we're legitimately making a ton of api calls; password auth is going to be slow. If the parallelization doesn't help enough it might be interesting to see if that's a bottleneck.

@zenhack
Copy link
Contributor

zenhack commented Nov 17, 2018 via email

@naved001
Copy link
Contributor Author

Ok -- see if cranking that number up makes it any better.

I increased the number there to 16, and also upped the number of vCPUs allocated to the HIL server to 16, and there was no change.

@zenhack
Copy link
Contributor

zenhack commented Jan 17, 2019

A couple more ideas:

  • I noticed you have process=5 in the client -- if you up that to 16 as well does it make a difference?
  • Is it similarly problematic on a local dev instance? Does changing the auth backend (maybe to a noop, although obviously don't touch that on prod...) affect it at all?

@naved001
Copy link
Contributor Author

I noticed you have process=5 in the client -- if you up that to 16 as well does it make a difference?

Tried it and that didn't make a difference.

@apoorvemohan suggested that we try increasing the number of processes rather than threads, and that sped things up. Concurrent requests were served in 5 seconds as opposed to 25 seconds earlier.
Why did this work and would this be a problem?

@zenhack
Copy link
Contributor

zenhack commented Jan 18, 2019

The python interpreter doesn't actually support multiple threads executing Python code at once; there's a single mutex that covers the entire interpreter state (the "Giant Interpreter Lock (GIL)"). Threads can still help if the workload is IO-bound, since blocking IO calls will typically release the lock, allowing other threads to run while blocked on IO. But if it's a cpu-bound workload extra threads will do nothing -- if you want Python CPU parallelism you need processes.

Just having the server configured with more processes sounds perfectly reasonable.

This throws a little more evidence towards the idea that the bottleneck is the auth backend -- that's all CPU bound, whereas the queries themselves should just be hitting the database.

I'd still like to know if swapping in the null auth backend fixes it, though obviously you need to reproduce the problem somewhere other than prod first.

If so, maybe we should be looking into more performant alternatives.

@naved001
Copy link
Contributor Author

naved001 commented Jan 18, 2019

Thanks for the explanation, Ian!!

I'd still like to know if swapping in the null auth backend fixes it, though obviously you need to reproduce the problem somewhere other than prod first.

Yep, I'll try that.

@naved001 naved001 reopened this Jan 18, 2019
@naved001
Copy link
Contributor Author

naved001 commented Jan 18, 2019

So now that I know how to do multiple requests, I tried adding this as a CLI feature to get all nodes in all projects.

def _get_nodes(project):
    """Helper function to use with p.map"""
    key = {}
    nodes = client.project.nodes_in(project)
    key[project] = nodes
    return key


@project_node.command(name='listall')
@click.option('--json', 'jsonout', is_flag=True)
def project_node_list(jsonout):
    """List all nodes attached to a <project>"""

    projects = client.project.list()
    p = Pool(processes=len(projects))
    pprint(p.map(_get_nodes, projects))
    return

but it returns this

None: Max retries exceeded with url: /v0/project/<one-of-the-project-from-the-list>/nodes (Caused by None)

If I sleep for like 5 seconds after I get the list of projects, it works fine. Any insights on why this is happening? The other code in this thread which just uses requests directly works fine.

@naved001 naved001 reopened this Jan 18, 2019
@zenhack
Copy link
Contributor

zenhack commented Jan 18, 2019

Unrelated to your immediate question, but it got me looking more closely at the code so I'll bring it up:

Don't use ast.literal_eval to parse json; use json.loads. The Python syntax isn't a strict superset of json, so certain inputs will fail that shouldn't, chiefly those involving booleans or null, which don't share the same representation. There are also inputs that will parse when they should fail.

Also every time I see ast.literal_eval I almost have a heart attack until I realize that literal_eval isn't actually unsafe like eval is -- but you really have to know the API to be able to look at that and be sure it's not introducing an RCE; json.loads is more obviously tame and we use it everywhere so folks know what it does.

Also, a more minor point, instead of pprint I'd suggest using something like json.dumps(..., indent=' ', sort_keys=True); it's a little gross that we're showing python literal syntax to the end user.

Sorry for the lecture. To address your actual question, I'm not sure why there would be a difference in behavior between the client lib and the manual http code above. Is there anything interesting in the server logs about why it's refusing connections? Do we have a max concurrent connections limit set somewhere that you're exceeding?

@naved001
Copy link
Contributor Author

naved001 commented Jan 18, 2019 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants