-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
built-in measurements #3
Comments
I will work on porting the current netrics tests to individual python executables. I've ported the basic ping command (need to fix how it loads in custom command, currently hard-coded config). The python executable can be found under |
The following list of tests should be including in the
Many of these will likely get their own issue as we develop / port them from Netrics v1. |
Awesome 😎
…On Fri, May 13, 2022, 1:53 PM Kyle MacMillan ***@***.***> wrote:
The following list of tests should be including in the /builtin/ (or
stdlib) measurements:
- netrics-ping (simple ping to specified destinations)
- netrics-traceroute (traceroute to specified desintations)
- netrics-ookla (runs ookla speed test)
- netrics-ndt7 (runs NDT7 speed test)
- netrics-lml (measures the last mile latency)
- netrics-dns-latency (measure dns response time)
- netrics-devs (counts the number of devices)
Many of these will likely get their own issue as we develop / port them
from Netrics v1.
—
Reply to this email directly, view it on GitHub
<#3 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEBUNSLEIR7Z42PPP53JKLVJ2QMDANCNFSM5VWDPGQQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@kyle-macmillan – Yes looks totally reasonable. Brief comments on the ping module that might be helpful:
|
(Re: retries: Perhaps we'll want to require measurements to explicitly request a retry rather than doing it by default. That likely makes more sense, and better matches what we were doing before. I'm not sure ping has any need of retries. But e.g. a speedtest might get a "busy" response, and request an immediate-to-soon retry. Say, by default – |
Thanks for this great feedback! I will look into suggestions / basically agree with you on all points. A few comments/quesitons:
|
Cool 😎
I think parallelizing ping should be totally fine. I can work on that as well. I have used the multiprocess before. Do you have any suggestions?
So long as we're doing a subprocess of the `ping` program (rather than a Python process-internal implementation), the multiprocessing module isn't really necessary, (tho it certainly can be used) – the "multiple processes" are the parallel pings we invoke; they don't each need their own copy of our module program to manage them. (For that matter we _could_ use a thread pool; but, I'm willing to bet we don't need it here. Indeed we just have to manage a subprocess pool.)
(I will say, I'd be curious how much faster this sort of implementation actually makes the module … 🤔)
Anyway, keep `subprocess.run` in mind! But if we're parallelizing here then we might go lower-level (kind of pseudo-code but which might nearly work):
pings_complete = [] # IFF we want a cumulative rather than iterative report
# batch arguments by max concurrency
# (iterator so we don't have to track indices, etc.)
target_stream = iter(targets) # from config or whatever
while processing := {subprocess.Process(['ping', …, target], stdout=subprocess.PIPE, stderr=subprocess.PIPE) for target in itertools.islice(target_stream, MAX_PROCESSES)}:
while processing:
# wait for at least one to complete
proc0 = next(iter(processing))
proc0.wait()
complete = {proc for proc in processing if proc.done}
processing -= complete
# if individualized reporting could just write to stdout here. if instead it's a single cumulative report we'll collect results here and write them out later.
pings_complete.extend(complete)
Want to make sure I have these things straight: Upon completion, the module prints results to stdout and errors to stderr (both as JSON objects) as well as exiting with the appropriate exit code.
Right. (Neither has to be JSON but that makes sense as a default. And writes can happen at any point during execution but at the end is fine.)
…
So precisely how the framework handles these module outputs and exit codes is not set in stone. But obviously this will inform how modules are written. My current thinking is:
• Exit code zero: The module ran without error and has meaningful results to publish. Publish those results written to stdout. (Anything written to stderr should almost certainly be handled as well – sent wherever that will go. This might include warnings or other non-erroneous informational logging for the user.) Any subprocess, such as ping, _may_ have exited with some non-zero code; but apparently the module interpreted this as a meaningful result – say the target host is unavailable – (and ping's exit code is none of the framework's business).
• Exit code 42 (or something): The module requests to be rerun according to whatever logic we set up for that. (Likely a configurable cadence.)
• Exit code non-zero (not 42): The module encountered an error (and should not be retried except as otherwise scheduled). Maybe a subprocess had a real error that we need to flag. Maybe the module threw an uncaught exception (which I believe will generally cause it to exit with code 1).
What I'm unsure of is what to do with results printed to stdout in the event of a non-zero exit code. This might just be a bit nuanced at this point (when we're still drafting the framework). My initial thinking was that non-zero means "we're screwed" and we shouldn't even *trust* any results written to stdout. But I wonder if that should be left up to the module – any results it wrote were real results, and that's why it wrote them – it just so happens that it _also_ had a real error that needs special reporting. (And perhaps the latter contract takes better advantage of the fact that we have these three degrees of freedom.) But I am curious what others might think and how this will play out. (And certainly this is something else that can be made configurable; though we at least want a reasonable default.)
…
As for that "special reporting" – particularly if we are allowing stderr to be used as a general logging channel – I wonder what we might want to have, say:
• framework-level log file (defaults to ERROR only)
• receives DEBUG messages of "I will run this thing"
• … INFO messages of "this thing completed successfully with this (abbreviated) stdout and this stderr"
• …(perhaps WARN messages regarding retries)
• … ERROR messages of "this thing failed with this non-zero exit code, this (abbreviated) stdout and this stderr"
• stderr log file: basically just what modules write to stderr. perhaps use logging (built-in or otherwise) with a configurable, default level-less signature like "{datetime} {module} {message}".
…with the two of course configurable/disableable, (paths or /dev/null, and log level filters); (and using a rotating logging handler because we're nice folks).
And since I'm getting into log file formats, I'll say that these don't extend to the stdout results. Rather, for the stdout results we should probably try to respect their serialization in an individualized measurement results file – if we receive JSON then we write a .json file somewhere for the measurement result – just as we do now. (Really we should have a default option to .gz these as well.)
And we _could_ do the same for what's written to stderr. I'm split on what would be the most useful. My impression is that what's written there is less likely to end up inserted into anything like a time-series database. Rather, these are really standard logs – which might end up in a fancier store (even a structured store) – but which belong in this sort of rotated file format by default.
Even that needn't mean throwing out the useful structure of JSON written to stderr. We would likely want to again detect this serialization and reproduce it, if perhaps print something more compact and more legible to a log file:
2022-05-05T01:23:01 ping {flux_capacitor=true warp_speed=8 target="8.8.8.8"}
(I believe the above message format could nearly be achieved with something like `json.dumps(obj, sep='=')`. Not sure how to omit the superfluous quoting of the keys. Indeed I may have just written compact TOML 😅)
…Or something. …If that helps 😉
|
Thanks for the feedback, I've opened a separate issue (#5) for What exit code should be raised for measurements like |
Right:
Presumably, the results for all succeeding sub-tests will be written to stdout, and persisted. (This is up in the air but it seems like that's the direction we're headed, regardless of exit code.) But if the exit code is non-zero, the framework will also flag this exceptional result. Separately, indeed, we could add support for concurrency to the controller 😄 I dunno! Should we? (We might have a configuation key like |
I guess one important question regarding the possibility of pushing concurrency onto the framework: how should the results of concurrently-executed, parameterized tests be persisted? If they're simply persisted as unrelated tests run around the same microsecond, I'm not sure users would be happy. (Though I'm not sure – would that suffice for you?) On the other hand, is there a generic way for results to be joined that would make everyone happy? (And I say everyone because this might be a tough thing to make configurable.) Say with the following measurements configuration: ping:
parameters:
- google.com
- wikipedia.org
concurrency: 2 For the above, the framework might run the {
"parameter": "google.com",
"parameters": [
"google.com",
"wikipedia.org"
]
} And then perhaps the result persisted for the collection of test executions could be generically: {
"google.com": {
…
},
"wikipedia.org": {
…
}
} That might well work and it might well be worthwhile. But I'm unsure that it's sufficiently generic for all conceivable modules. Another possibility perhaps, rather than asking people to somehow write a proper reducer function into their configuration, would be to again leverage Jinja templating, to construct the results (e.g. to recreate the above proposed default): ping:
persist: |
{
{% for name, result in results %}
"{{ name }}": {{ result }}{{ "" if loop.last else "," }}
{% endfor %}
} (…But doesn't templating out JSON kind of stink?) |
…isting …Plus additional functionality: * complete measurement configurability (via fate/netrics measurements file) * validation of structured configuration via schema * pre-measurement checks of execution environment (localhost) and LAN (gateway) * parallelization of internet ping requests against configured targets Future work: * "near parity" is defined by the deferral of the command-debugging zip archive to future work Issue tracking: * part of #3: "built-in measurements" * resolves #12: "…common/global folder for stdlib netrics tests" * resolves #17: "netrics-ping returns errors if one address of the list is unreachable…"
…isting …Plus additional functionality: * complete measurement configurability (via fate/netrics measurements file) * validation of structured configuration via schema * pre-measurement checks of execution environment (localhost) and LAN (gateway) * parallelization of internet ping requests against configured targets Future work: * "near parity" is defined by the deferral of the command-debugging zip archive to future work Issue tracking: * part of #3: "built-in measurements" * resolves #12: "…common/global folder for stdlib netrics tests" * resolves #17: "netrics-ping returns errors if one address of the list is unreachable…"
…isting …Plus additional functionality: * complete measurement configurability (via fate/netrics measurements file) * validation of structured configuration via schema * pre-measurement checks of execution environment (localhost) and LAN (gateway) * parallelization of internet ping requests against configured targets Future work: * "near parity" is defined by the deferral of the command-debugging zip archive to future work Issue tracking: * part of #3: "built-in measurements" * resolves #12: "…common/global folder for stdlib netrics tests" * resolves #17: "netrics-ping returns errors if one address of the list is unreachable…"
Adds both task `lml` (executable `netrics-lml`) and task alias `lml-scamper` (`netrics-lml-scamper`) for clarity. This version of the measurement is added to the default configuration but commented out as it may not be enabled by default. For speed and niceness, this lml test will randomly select an endpoint to trace, from a default set of Google DNS as well as CloudFlare DNS. (The former is generally speedier to respond; regardless, it might be "nice" to round-robin, and for that matter to fall back in the event of response failure.) Otherwise: Measurements' schedules in default configuration now use hashed cron expressions to *discourage* their collision (though any which *may not* run concurrently will be configured as such). completes #13 part of #3
Adds both task `lml` (executable `netrics-lml`) and task alias `lml-scamper` (`netrics-lml-scamper`) for clarity. This version of the measurement is added to the default configuration but commented out as it may not be enabled by default. For speed and niceness, this lml test will randomly select an endpoint to trace, from a default set of Google DNS as well as CloudFlare DNS. (The former is generally speedier to respond; regardless, it might be "nice" to round-robin, and for that matter to fall back in the event of response failure.) Otherwise: Measurements' schedules in default configuration now use hashed cron expressions to *discourage* their collision (though any which *may not* run concurrently will be configured as such). completes #13 part of #3
Adds both task `lml` (executable `netrics-lml`) and task alias `lml-scamper` (`netrics-lml-scamper`) for clarity. This version of the measurement is added to the default configuration but commented out as it may not be enabled by default. For speed and niceness, this lml test will randomly select an endpoint to trace, from a default set of Google DNS as well as CloudFlare DNS. (The former is generally speedier to respond; regardless, it might be "nice" to round-robin, and for that matter to fall back in the event of response failure.) Otherwise: Measurements' schedules in default configuration now use hashed cron expressions to *discourage* their collision (though any which *may not* run concurrently will be configured as such). completes #13 part of #3
Adds task `lml-traceroute` (executable `netrics-lml-traceroute`). This task mirrors the `lml-scamper` version already published. It differs in that the endpoint is traced with `traceroute` and then `ping`'d; (and, traceroute output must be specially parsed, which is relatively laborious). As with `lml-scamper`, for speed and niceness, `lml-traceroute` will randomly select an endpoint to trace, from a default set of Google DNS as well as CloudFlare DNS. (The former is generally speedier to respond; regardless, it might be "nice" to round-robin, and for that matter to fall back in the event of response failure.) completes #25 part of #11 part of #3
Adds task `lml-traceroute` (executable `netrics-lml-traceroute`). This task mirrors the `lml-scamper` version already published. It differs in that the endpoint is traced with `traceroute` and then `ping`'d; (and, traceroute output must be specially parsed, which is relatively laborious). As with `lml-scamper`, for speed and niceness, `lml-traceroute` will randomly select an endpoint to trace, from a default set of Google DNS as well as CloudFlare DNS. (The former is generally speedier to respond; regardless, it might be "nice" to round-robin, and for that matter to fall back in the event of response failure.) completes #25 part of #11 part of #3
As a user of Netrics – either a "legacy" user or "new" – I expect the software to ship with a collection of built-in measurements (at least those available in the previous software). As such, existing measurement code, of potential use to us or to others, must be ported to the new Netrics framework.
This work is largely one of deletion rather than addition – measurements should be boiled down to their essential functionality, (without retry logic, etc.).
Under the Netrics framework, all measurements are simple executables.
Builtins will be installed with names prepended by
netrics-
, e.g.netrics-ping
. This convention will make their purpose clear, have the effect of "bundling" them despite their separate existence, and these names will be given special treatment in configuration: (references to the measurementping
will imply the executablenetrics-ping
, and whethernetrics-ping
is a built-in or not).However, builtins needn't be named this way in the codebase – (it is trivial to apply this convention upon installation) – within the codebase,
ping
is fine.The "ideal" or most simple measurement is a standalone script: Python or otherwise. (It must handle its own execution via shebang, etc.). Installation will copy/link each built-in measurement script to the appropriate path and under the appropriate name.
That said, Python-based measurements are free to be more sophisticated than standalone scripts. (So long as there is a module with a single function to invoke without arguments – e.g.
main()
– then installation may trivially construct a script to invoke it.) This might allow our builtins to share code (to be "DRY"), etc. Such a package of built-in measurements might be contained within thenetrics
package (atsrc/netrics/measurement/builtin/
or otherwise). (How this looks remains to be seen and is up to the implementer.)Upon execution, the measurement's contract with the framework is as follows:
ping
, then runping
.) If there's an exception, the return code should reflect this.print
,echo
, etc.) – presumably in JSON format, but the framework does not (at this point) care.*** (I'm really split on the best data format for this. JSON input and output make sense; and, if it's JSON then you'd just:
json.load(sys.stdin)
. Obviously this is more natural if the main measurements config is itself JSON or YAML. But, I think TOML is friendlier than YAML for such files. Regardless, the framework can send the measurement config in whatever format, and even let it be configurable. To the framework code, there's nothing exceptional about reading TOML and writing JSON; though, to human beings involved on either end, it might be a little odd.)The text was updated successfully, but these errors were encountered: