Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework toposort algorithm as an ancestor traversal #3646

Open
wants to merge 905 commits into
base: master
Choose a base branch
from

Conversation

PaulWay
Copy link
Contributor

@PaulWay PaulWay commented Jan 4, 2023

The 'toposort' function was identified as using significant amounts of processor time in real world archive processing. Looking at the code, it does a number of things that make it slow:

  • it copies the entire data dictionary first
  • it adds empty set values to that dictionary for items without a dependency
  • it traverses the entire data dictionary looking for empty set values
  • it rebuilds the data dictionary each time to remove the dependency values that have just been removed

While these operations provide a logical way of processing the data, they are wasteful of both space and processing time.

The new algorithm is a kind of inverse of that method - rather than working on items with a set of dependencies, it works on a dependency with a set of 'ancestors' that depend on that item. It can be seen as a 'data flow' model rather than a 'process flow' model: the original algorithm modelled objects (combiners, parsers, specs) calling other objects, and the new algorithm models the data flowing from the specs to the parsers to the combiners and so on.

The new algorithm starts by creating a dictionary of 'ancestors', whose keys are the objects in the value sets of the original data, and whose values are sets of the keys of the original data. It also stores the count of dependencies on each original data key (object). Before iterating, it also stores the set of 'terminators' - keys in the original data that are not ancestors of any object; these are the objects that nothing depends on. Rather than adding these to the dictionary of ancestors (and thus bulking out the ancestor dictionary), these are stored as a separate set for later.

We then iterate in much the same way as the original algorithm. We find the set of ancestors that (now) have no dependents (which is now a set difference rather than a dictionary traversal), yield that set, then decrement the dependency counts for each of the dependents of those 'free' ancestors and delete them from the dictionary of ancestors. If we found no 'free' ancestors, then we've either traversed our way to the 'terminators' set of items that nothing else depends on, or we've found circular dependencies (as in the original algorithm).

Thus we've reduced the set differences to integer subtractions, the dictionary traversal to a key set difference, and eliminated the original data copy and the set union and bulking.

In testing, this is between 1.4 to 1.5 times faster than the original algorithm over small to medium sized dependency graphs (over 100,000 iterations). In other words, this reduces the time taken by 45% to 50%.

We also update the 'toposort_flatten' function to act as a generator. This doesn't improve the test speed noticeably, but it allows for a slightly faster start to processing the first item and may allow better concurrency in multi-threaded or asynchronous implementations.

Signed-off-by: Paul Wayper paulway@redhat.com

All Pull Requests:

Check all that apply:

  • [y] Have you followed the guidelines in our Contributing document, including the instructions about commit messages?
  • [y] Is this PR to correct an issue?
  • [y] Is this PR an enhancement?

shlao and others added 30 commits March 2, 2022 13:56
* Feat: Add spec and parser for 'crictl_logs'

Signed-off-by: shlao <shlao@redhat.com>

* Updated the 'CrictlLogs' docstring

Signed-off-by: shlao <shlao@redhat.com>

* Used the Plurals

Signed-off-by: shlao <shlao@redhat.com>

* Updated the docstring

Signed-off-by: shlao <shlao@redhat.com>

* Use the singular form

Signed-off-by: shlao <shlao@redhat.com>

* Updated the docstring

Signed-off-by: shlao <shlao@redhat.com>
* fix: fix the regression bug of soscleaner IP obsfuscating

Signed-off-by: Xiangce Liu <xiangceliu@redhat.com>

* Update the ip obfuscating sort

Signed-off-by: Xiangce Liu <xiangceliu@redhat.com>

* Add tests

Signed-off-by: Xiangce Liu <xiangceliu@redhat.com>

* fix flake8

Signed-off-by: Xiangce Liu <xiangceliu@redhat.com>
* feat: New parser for systemd_perms

Signed-off-by: Xinting Li <xintli@redhat.com>

* fix: fix ETC_SYSTEMD in test_systemd_perms.py

Signed-off-by: Xinting Li <xintli@redhat.com>

* fix: Fix file name and class name

Signed-off-by: Xinting Li <xintli@redhat.com>

* fix: add doc entry

Signed-off-by: Xinting Li <xintli@redhat.com>

* fix: update insights_archive

Signed-off-by: Xinting Li <xintli@redhat.com>

* fix: fix docs-test error

Signed-off-by: Xinting Li <xintli@redhat.com>

* fix: fix length of '-'

Signed-off-by: Xinting Li <xintli@redhat.com>

* fix: Updating test_etc_systemd() in test_ls_systemd.py

Signed-off-by: Xinting Li <xintli@redhat.com>

Co-authored-by: Xinting Li <xintli@redhat.com>
…3352)

* feat: New spec and parser to get capsules and repos which have "background" download policy

* Also reorder the classes in satellite_postgresql_query by alphabetically.

Signed-off-by: Huanhuan Li <huali@redhat.com>

* Combiner the two querys about katello_root_repositories together.

* Also rename the spec to query the smart_proxies table

Signed-off-by: Huanhuan Li <huali@redhat.com>

* Use "deprecated" method for the deprecated class

Signed-off-by: Huanhuan Li <huali@redhat.com>

* Remove "-SatelliteKatelloEmptyURLRepositories" from title

* It is deprecated now

Signed-off-by: Huanhuan Li <huali@redhat.com>
* The code had an issue where it was looping over the dictionary updates
  that was being updated in another loop. So it was checking the
  beginning pkgs over and over each loop for updates which was
  un-necessary and caused the slowness. So I simplified the code down to
  the least amount of looping and variables.

Signed-off-by: Ryan Blakley <rblakley@redhat.com>
Signed-off-by: Rahul Srivastava <rasrivas@redhat.com>
* Currently core only collects /etc/sysctl.conf but none of the conf
  files in the /usr/lib/sysctl.d/ and /etc/sysctl.d/ directories.
  So I added a spec for them and added a spec for the /usr/lib/sysctl.d/
  directory as well.
* Updated the parsers to use a base class to avoid duplicate code.
* Added a new combiner for all of the conf files in the
  /usr/lib/sysctl.d/ and /etc/sysctl.d/ directories along with the
  /etc/sysctl.conf file.

Signed-off-by: Ryan Blakley <rblakley@redhat.com>
Signed-off-by: Huanhuan Li <huali@redhat.com>
* feat: New parsers for IBM proc files

- /proc/powerpc/lparcfg
- /proc/device-tree/openprom/ibm,fw-vernum_encoded

Signed-off-by: Xiangce Liu <xiangceliu@redhat.com>

* Fix tests

Signed-off-by: Xiangce Liu <xiangceliu@redhat.com>

* adjust order of specs

Signed-off-by: Xiangce Liu <xiangceliu@redhat.com>
* New parser for /usr/bin/hexdump -C /dev/cpu_dma_latency command

Signed-off-by: Akshay Ghodake <aghodake@redhat.com>

* Updated command from hexdump to od

Signed-off-by: Akshay Ghodake <aghodake@redhat.com>
* fix: Keep the results once one of them is good

* When some spec is collecting several files, we should keep the results
  once one of them is good, and only set results to empty when all of them
  fails with errors

Signed-off-by: Huanhuan Li <huali@redhat.com>

* Simplify the returned data of 'marshal' function

Signed-off-by: Huanhuan Li <huali@redhat.com>

* Remove useless variable

Signed-off-by: Huanhuan Li <huali@redhat.com>
* Based on the comments in bz 1759080, I switched the crontab specs to
  read the crontab files instead of running the command since it does
  user id look ups every time the command is ran. But by switching to
  simple_file the spec would be skipped if the file doesn't exist.

Signed-off-by: Ryan Blakley <rblakley@redhat.com>
* Enhance parser Grub2Config

Signed-off-by: jiazhang <jiazhang@redhat.com>

* Enhance combiner grub_conf

Signed-off-by: jiazhang <jiazhang@redhat.com>
* Using the xml element tree getiterator function has been deprecated
  for a while it seems. To get rid of the warning I updated to using
  iter wrapped in a list which is all the getiterator function did.

Signed-off-by: Ryan Blakley <rblakley@redhat.com>
…#3373)

* fix: Enhance "PCSStatus" to make it compatible with new output format

Signed-off-by: Huanhuan Li <huali@redhat.com>

* Simplify the data structure for PCSStatus

Signed-off-by: Huanhuan Li <huali@redhat.com>

* Remove the prefix "*" when do comparation

Signed-off-by: Huanhuan Li <huali@redhat.com>
* Any file under /proc/net/bonding/ is a valid bond interface. The bond
  interface names don't have to start with bond, so change the glob to
  collect all file under the directory.
* Updated the bond_dynamic_lb spec to glob all interface names,
  since the /sys/class/net/*/bonding/ directory only exist for bond
  devices.

Signed-off-by: Ryan Blakley <rblakley@redhat.com>
* Enhance combiner grub_conf_blscfg

Signed-off-by: jiazhang <jiazhang@redhat.com>

* Update grub_conf

Signed-off-by: jiazhang <jiazhang@redhat.com>

* Fix flake8 issue

Signed-off-by: jiazhang <jiazhang@redhat.com>

* Update test

Signed-off-by: jiazhang <jiazhang@redhat.com>

* Remove RHEL check

Signed-off-by: jiazhang <jiazhang@redhat.com>
* SOS Report no longer collects ps auxwww and ps auxwwwm, it removed the
  auxwww since auxwwwm basically shows the same thing with threads added.
* Adding this spec caused a ValueError in the ps combiner due to the
  thread entries containing a - in the pid column. So I updated the
  combiner to check to make sure the pid column entry is a digit before
  continuing.
* Fix some grammar and flake8 issue.

Signed-off-by: Ryan Blakley <rblakley@redhat.com>
Account for cases where IP addresses can appear at the end of a line in
the space-preserving netstat obfuscation function _sub_ip_netstat.

Resolves: RHBZ#2029997
Signed-off-by: Link Dupont <link@sub-pop.net>
Due to application workload or misbehavior the semaphores count might increase
significantly. Additionally, the current rules expects semids from selected
owners namely 'root', 'apache', and 'oracle'.

Closes-Bug: #2071058

Signed-off-by: Sachin Patil <psachin@redhat.com>
Signed-off-by: Xinting Li <xintli@redhat.com>

Co-authored-by: Xinting Li <xintli@redhat.com>
This patch replaces all the string occurrences of “cloud.redhat.com”
with “console.redhat.com”.

Resolved: rhbz#2006386

Signed-off-by: Gael Chamoulaud (Strider) <gchamoul@redhat.com>
* Exclude some Specs from obfuscation

Add a list of Specs that are excluded from the IP address obfuscation. This is to prevent invalid obfuscation of four-part package versions that are indistinguishable from IPv4 addresses. The excluded Specs are known to never contain IP addesses in their output.

Signed-off-by: Štěpán Tomsa <stomsa@redhat.com>

Fix cyclic imports

Test _blacklisted_files

Signed-off-by: Štěpán Tomsa <stomsa@redhat.com>

Clean up the blacklisted_files test

Signed-off-by: Štěpán Tomsa <stomsa@redhat.com>

Test IP obfuscation blacklist

Signed-off-by: Štěpán Tomsa <stomsa@redhat.com>

Test mocked clean_report

Signed-off-by: Štěpán Tomsa <stomsa@redhat.com>

Fix test for Python 2.6

Signed-off-by: Štěpán Tomsa <stomsa@redhat.com>

Rename obfuscation blacklist to exempted obfuscations

Signed-off-by: Štěpán Tomsa <stomsa@redhat.com>

Add function comment

Signed-off-by: Štěpán Tomsa <stomsa@redhat.com>

Add function comment

Signed-off-by: Štěpán Tomsa <stomsa@redhat.com>

Rename exempt to excluded

Signed-off-by: Štěpán Tomsa <stomsa@redhat.com>

Fix formatting

Signed-off-by: Štěpán Tomsa <stomsa@redhat.com>

* Move excluded Specs to SOSCleaner

The module architecture causes problems when the constants module is imported from the SOSCleaner. Also, the SOSCleaner is intended to be a standalone module.

Signed-off-by: Štěpán Tomsa <stomsa@redhat.com>

* Add more Specs to the exclude list

Provided by Michael Mraka <mmraka@redhat.com>.

Signed-off-by: Štěpán Tomsa <stomsa@redhat.com>

* Make excluded Specs a list

For semantics.

Signed-off-by: Štěpán Tomsa <stomsa@redhat.com>

* Fix excluded Specs test

It makes no sense to replicate the Specs list in the test. Kept the test, only made it simpler. Now it checks that the list is not empty and that its items are valid Spec names.

Signed-off-by: Štěpán Tomsa <stomsa@redhat.com>

* fixup! Fix excluded Specs test

Signed-off-by: Štěpán Tomsa <stomsa@redhat.com>
In old version of the client, the machine-id could be a non-hyphen
UUID. Upload still works but insights-client-results.service fails with
a `Error: failed to download advisor report.` error message.

This patch fixes this issue by actually returning a UUID4 object with
hyphens. Moreover, it also adds a unit test to check non-hyphen UUID
from old client.

Resolves: rhbz#1998560

Signed-off-by: Gael Chamoulaud (Strider) <gchamoul@redhat.com>
* fix: Update the spec "du_dirs" to filterable

Signed-off-by: Huanhuan Li <huali@redhat.com>

* Move the "du_dirs_list" datasource to the datasource directory

Signed-off-by: Huanhuan Li <huali@redhat.com>

* Rename to "dir_list"

* Also raise SkipComponent if there are no filters

Signed-off-by: Huanhuan Li <huali@redhat.com>
* Some unused files have syntax error, as a result, it won't be hit
  even there is a hit for the used files.

Signed-off-by: Huanhuan Li <huali@redhat.com>
* Parser was using the wrong spec and should be using yum_updates

Signed-off-by: Bob Fahr <20520336+bfahr@users.noreply.github.com>
xiangce and others added 5 commits January 12, 2023 11:19
* fix: Do not log Parsers' Traceback during collection

Signed-off-by: Xiangce Liu <xiangceliu@redhat.com>

* fix doc error by renaming

Signed-off-by: Xiangce Liu <xiangceliu@redhat.com>

* resolve rebase conflict
* revert blank line

Signed-off-by: Xiangce Liu <xiangceliu@redhat.com>
…arser (#3645)

* feat: New datasource "sys_fs_cgroup_uniq_memory_swappiness" and its parser

Signed-off-by: Qin Ping <piqin@redhat.com>

* Updated the datasource and its parser per the review comments.

Signed-off-by: Qin Ping <piqin@redhat.com>

* Fix the typo in docstring.

Signed-off-by: Qin Ping <piqin@redhat.com>
* feat: new RHEL component to identify "Red Hat Enterprise Linux" host

Signed-off-by: Xiangce Liu <xiangceliu@redhat.com>

* add it to pro-loaded

Signed-off-by: Xiangce Liu <xiangceliu@redhat.com>

* add more parsers and refine logic
* seems it's not need to be preloaded

Signed-off-by: Xiangce Liu <xiangceliu@redhat.com>

* tiny update
* move it to combiners

Signed-off-by: Xiangce Liu <xiangceliu@redhat.com>

* Update as per feedback
* update comments

Signed-off-by: Xiangce Liu <xiangceliu@redhat.com>

* fix py26
* to combiner
* Remove SubscriptionManagerID from the candidates

Signed-off-by: Xiangce Liu <xiangceliu@redhat.com>

* update comment
* Refine the logic

Signed-off-by: Xiangce Liu <xiangceliu@redhat.com>

* Update logic per discussion and rename to OSRelease
* one more case for coverage

Signed-off-by: Xiangce Liu <xiangceliu@redhat.com>

* fix doc error
* removed docstring line

Signed-off-by: Xiangce Liu <xiangceliu@redhat.com>

* update per feedback and resolve rebase conflicts

Signed-off-by: Xiangce Liu <xiangceliu@redhat.com>

* make points a local variable instead of a member
* installed_packages as well

Signed-off-by: Xiangce Liu <xiangceliu@redhat.com>

* expanded the tuples in rpms list
* more comments for pkgs

Signed-off-by: Xiangce Liu <xiangceliu@redhat.com>
… firstly (#3634)" (#3652)

This reverts commit ed29640.

Signed-off-by: Xiangce Liu <xiangceliu@redhat.com>
@PaulWay PaulWay requested a review from xiangce January 13, 2023 00:55
@xiangce
Copy link
Contributor

xiangce commented Jan 13, 2023

@PaulWay - The same as #3604, this PR changes a very basic level method, it also needs to be tested and compared to check the performance improvement. However, in another PR #3649, @csams added and exposed a new dr.run_comments method which allows users to order components and cache the results, once the corresponding change is also done in the insights-core-messaging project, I think it would significantly reduce the calling of the toposort_flatten. Could you please wait for a while? Sorry for the late reply.

xiangce and others added 20 commits January 17, 2023 13:42
)

Signed-off-by: Xiangce Liu <xiangceliu@redhat.com>
* fix: check the content first in class InstalledRpms

Signed-off-by: Chenlizhong <lichen@redhat.com>

* fix: remove the get_active_lines()

Signed-off-by: Chenlizhong <lichen@redhat.com>

* fix: update the importing to full path and raise the exception when content is empty

Signed-off-by: Chenlizhong <lichen@redhat.com>

* fix: remove empty line as some testing files are with empty line

Signed-off-by: Chenlizhong <lichen@redhat.com>

* fix: skip empty line in the loop

Signed-off-by: Chenlizhong <lichen@redhat.com>
…y into combiner rhel_for_edge (#3657)

* Add rpm_ostree check into combiner rhel_for_edge

Signed-off-by: jiazhang <jiazhang@redhat.com>

* Fix typo

Signed-off-by: jiazhang <jiazhang@redhat.com>
* Moved the remaining exceptions that were defined in various other
  modules to the new separate exceptions module. That way all exceptions
  are defined in their own module.
* Updated all imports to reference the new path.

Signed-off-by: Ryan Blakley <rblakley@redhat.com>
* feat: add JbossRuntimeVersions parser

Signed-off-by: Chen lizhong <lichen@redhat.com>

* fix: implement iterator and get for JbossRuntimeVersions class

Signed-off-by: Chenlizhong <lichen@redhat.com>

* fix: fix the error in python2

Signed-off-by: Chenlizhong <lichen@redhat.com>

* fix: use Context class directly instead of tests/context_wrap

Signed-off-by: Chenlizhong <lichen@redhat.com>

* fix: use nametuple and private function to parse the line of version

Signed-off-by: Chenlizhong <lichen@redhat.com>

* resort the importings

Signed-off-by: Xiangce Liu <xiangceliu@redhat.com>
* fix: add '-d 0' to yum_repolist spec

Signed-off-by: Xiangce Liu <xiangceliu@redhat.com>

* change to '-d 2'

Signed-off-by: Xiangce Liu <xiangceliu@redhat.com>
…s) (#3665)

Signed-off-by: Mark Huth <mhuth@redhat.com>

Signed-off-by: Mark Huth <mhuth@redhat.com>
* Print log and traceback of unexpected error

Signed-off-by: Mark Huth <mhuth@redhat.com>
Signed-off-by: XiaoXue Wang <xiaoxwan@redhat.com>
* Added the --show-skips cli arg to trigger SkipComponents to be
  captured in the broker for troubleshooting purposes.

Signed-off-by: Ryan Blakley <rblakley@redhat.com>
* Mark SkipException deprecated to help avoid confusion as the dirty
  parser difference isn't really needed.
* Replaced all internal references with SkipComponent.

Signed-off-by: Ryan Blakley <rblakley@redhat.com>
* feat: add CloudInstance and OSRelease to canonical_facts

Signed-off-by: Xiangce Liu <xiangceliu@redhat.com>

* insert value only when match, and add tests
* enable AWSInstanceIdDoc only
* fix: provider_type should be cloud_instance.provider
* make the new items must exist: None by default
* fix the tests
* remove is_rhel

Signed-off-by: Xiangce Liu <xiangceliu@redhat.com>
The default environment for "simple_command" includes LC_ALL=C, which
means ASCII (and not unicode); in case the language of the system is set
to a non-English locale that uses unicode characters (e.g. non-Latin1
languages), then subscription-manager will fail to output that with
encoding issues (e.g. [1]).

As a simple solution, enforce an unicode English locale for
subscription-manager, which should avoid those encoding issues.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2074745

Signed-off-by: Pino Toscano <ptoscano@redhat.com>
…ules (#3670)

* Enhance datasource kernel_module_filters to check the loaded modules
* Update the datasource test accordingly
* Add lsmod into insights.collect.default_manifest module, raise a SkipComponent when no modules are loaded.
* Update the preload parser name

Signed-off-by: jiazhang <jiazhang@redhat.com>
* Trying to make the spec name more specific, I'm working adding tags to
  the sos project it was requested for a more specific spec/tag name.

Signed-off-by: Ryan Blakley <rblakley@redhat.com>
* Capture the name of blacklisted specs, and dump them to a file in the
  archive.
* Added parser and spec to parse the blacklisted specs and use in rules.

Signed-off-by: Ryan Blakley <rblakley@redhat.com>
@candlepin-bot
Copy link

Can one of the admins verify this patch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.