Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add Job.cached_statepoint for fast read access to state point (#975)
* Make additional use of the statepoint cache. _StatePointDict takes significant time to initialize, even when the statepoint dict is known. Adjust `Job` initialization to make more use of the statepoint cache and initialize `_StatePointDict` only when `Job.statepoint` is accessed. Provide a faster path for cached *read* access to the statepoint dict via the new property `Job.statepoint_dict`. One side effect of this change is that some warnings are now deferred to `statepoint` access that were previously issued during `Job.__init__` (see changes in tests/). There are additional opportunities to use the cached statepoint dict in `Project.groupby` that this commit does not address. * Make JobsCursor.__len__ and .__contains__ O(1). Cache the ids matching the job filter. This enables O(1) cost for __len__ and __contains__ as users would expect. In some use-cases, signac-flow repeatedly calls __contains__ on a JobsCursor. The side effect of this change is that modifications to the workspace will not be reflected in existing JobsCursor instances. This behavior was not previously documented in the user API. * Add validate_statepoint argument to Job.init() `with job`, `Job.document`, and `Job.stores` call `init()` because they require that the job directory exists. Prior to this change, `init()` also forced a load of the `_StatepointDict`. These methods now call `init(validate_statepoints=False)` which exits early when the job directory exists. This change provides a reasonable performance boost (5x on NVME, more on network storage). There may be more room for improvement as there are currently 2N stat calls in this use-case: ```python for job in project: with job: pass ``` * Rename statepoint_dict to statepoint_mapping. deepcopy is unexpectedly expensive. Refactor the earlier commit to deepcopy only user-provided statepoint dicts. Statepoints from the cache are passed to the user read-only via MappingProxyType. * Read the cache from disk in `open_job`. `open_job` uses the statepoint cache to improve performance. Read the cache from disk in `open_job` (if it has not already been read). This provides consistently high performance in cases where `open_job` is called before any other method that may have triggered `_get_statepoint`. * Restore cache miss logger level to debug. Users may find the messages to verbose. At the same time, users might never realize that they should run `signac update-cache` without this message... * Instantiate Job by id directly when iterating over ids. `open_job` is a user-facing function and performs error checking on the id. This check involves a stat call to verify the job directory exists. When `Project` is looping over ids from `_get_job_ids`, the directory is known to exist (subject to race conditions). `stat` calls are expensive, especially on networked filesystems. Instantiating `Job` directly bypasses this check in `open_job`. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add statepoint_mapping test. * Pass information about the Job directories existence from Project to Job. This allows Job to avoid some `stat` calls which greatly improves performance on networked file systems. * Populate _statepoint_mapping in additional code paths. These missed opportunities to pre-populate _statepoint_mapping triggered slow code paths. * Increase test coverage. * Update change log. * Rename statepoint_mapping to cached_statepoint. Also attempt to fix sphinx-doc links. * Doc fixes. * Update code comments * Use cached_statepoint in to_dataframe. * Restore iteration order. Python set has a randomized iteration order. Preserve the original iteration order with a list and converto to set only for __contains__ checks. * Validate cached_statpoing when read from disk. This has the added benefit of validating all statepoints that are added to the cache. I needed to add a validate argument to update_cache because one of the unit tests relies on adding invalid statepoints to the cache. * Use cached_statepoint in groupby. * Remove validate argument from update_cache. Locally catch the JobsCorruptedError and ignore it in the test that needs to. * Write state point as two words in doc strings --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Corwin Kerr <cbkerr@umich.edu>
- Loading branch information