-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Statistical independence of pseudo-random numbers #1139
Comments
The comment in wlandau/crew#113 (comment) describes a solid approach to guarantee statistical independence. However, it's tough for reproducibility in |
(FYI @shikokuchuo) |
Each target already gets a "seed", which is just an integer from digest::digest2int(). These |
Better yet, what if I can map a target's name string directly to a unique deterministic number of calls to |
This is really tricky because L'Ecuyer RNG states are recursive, so a change in the seed of one target could invalidate a bunch of seemingly unrelated targets (causing them to rerun). So unless I can pin down independent RNG states deterministically, there is no way to do this in a way that preserves |
Maybe L'Ecueyer's original paper has advice about how to manually generate seeds: https://pubsonline.informs.org/doi/pdf/10.1287/opre.47.1.159. Then maybe I could use |
Just from a glance, I wouldn't trust myself to implement any part of https://pubsonline.informs.org/doi/pdf/10.1287/opre.47.1.159, or to manually advance streams an arbitrary number of steps. |
I think What this will look like:
Plan for the main implementation:
Other tasks:
|
Taking this idea - does it work if you simply store a hash table of the targets (names as keys) and assign each of them to an integer value? The order doesn't matter. If new targets are added, they are simply appended to the end of the hash table, adding to the integer sequence. Then when it is actually run, you initialise the seed |
If each successive target added gets the next integer in the sequence, then different runs of the pipeline could give different integers to the same targets. The alphabetical order of target names, as well as the topological order in the dependency graph, are race conditions. Without a way to lock each target name to a unique deterministic stream, independently of all the other targets, it seems like things would fall back to #1139 (comment). |
I'm suggesting that the hash map is stored as metadata. So assuming the target names are unique then it is a case of looking up the value, and if the key doesn't exist then creating one. That way each target name will always be associated with the same stream (as the table is immutable, but appendable). |
I see. I do already plan to store the state of the stream for each target in the metadata, but I would rather not make guarantees about being able to recover it. Users can clear some or all of the metadata at any time. |
Right. I think that's equivalent, but just storing the integer value may not only be more efficient but more reproducible, as if additional targets are added then they can be easily appended and the correct streams generated. Whereas if only the streams are stored, you would also need to separately mark the last-used stream in order to generate the next one etc. But just a suggestion based on your earlier observation - I don't profess to know all the intricacies involved. |
FYI sections 6 and 7 of the parallel package vignette (on RNG and load-balancing) is a very clear and concise summary: vignette("parallel", package = "parallel") |
Thanks for the reference, I find it much clearer than the L'Ecuyer papers I skimmed yesterday. From Section 6 ("Random-number generation"):
From the above, it seems like
(3) in particular cuts RNG sequences short, which makes overlap less likely. On top of all this, yesterday I had a really tough time trying to use In the short term, I think it would be much better if I write a whole chapter in https://books.ropensci.org/targets about pseudo-random number generation. The default behavior is worth explaining in depth. In addition, I can offer a workaround: set And I wonder if I there is some way I can make it easy for
For (2), I am wondering how best to simulate forward from the seed. (Maybe |
Agree, shouldn't be a 'crisis' situation for targets. The streams implementation in R was based on L'Ecuyer 2002 - so the 'state of the art' is still 20 years old! Highlighting this topic in the book is a sensible approach - and it would help to disseminate the ideas - it is completely by luck that I found the (very helpful) vignette! If you do end up developing testing, I just refer you to the end of the examples on sum(duplicated(runif(1e6))) # around 110 for default generator as a good starting point. |
I wrote https://books.ropensci.org/targets/random.html, was a long time coming anyway. Unless there is a non-sequential non-recursive way to generate widely-spaced seeds for base R, this is all I can do about the issue of statistical independence in |
DavisVaughan/furrr#265 (comment) spells out the biggest reason why recursively-defined RNG streams are not feasible in |
Your diagrams seem to make it clearer - when you add 'model', as long as you know (keep track of) what the last stream is, you can assign it the next stream - and keep going as new targets are added. I'm missing the reason for why the streams have to follow any particular order. The guarantee is just that they don't overlap. Just thinking through a bit more, it does get messy potentially, but you could assign each target a monotonically increasing integer sequence, which would remain immutable even if targets get deleted. You'd probably want to actually store the seed for each target. But this would mean at least theoretically that you could reproduce with just the inital seed (and enough |
As you mentioned here and in #1139 (comment), this would require on the metadata for reproducibility, and that metadata could be deleted. In fact, it is common for users to start over from scratch with |
Thanks for the explanation @wlandau it does all make sense. What I was missing was the bit above. If targets is meant to be stateless then I would agree that RNG streams don't work. What I was thinking of was effectively preserving the state after the first and each run. So you could literally repeat the experiment (with the recorded state) and therefore match the previous result. But precludes obviously starting from scratch as per the above. |
@HenrikBengtsson, I think we can continue part of our discussion from DavisVaughan/furrr#265 here. We were talking about why recursively defined L'Ecuyer streams are not feasible for Regarding DavisVaughan/furrr#265 (comment): if only it were that simple. A target is very different from an ordinary task because we need to know whether it is up to date. If a target's RNG stream changes from one run to the next, it is not up to date, which means the target needs to rerun. DavisVaughan/furrr#265 (comment) attempts to explain how that can happen in a non-sequential DAG. Maybe your example from DavisVaughan/furrr#265 (comment) would be easier because it uses independent tasks. Say we start with 3 tasks. Here are the seeds: RNGkind("L'Ecuyer-CMRG")
tasks <- c("task_a", "task_b", "task_c")
first_seed <- .Random.seed
seeds <- list()
seeds[[tasks[1L]]] <- first_seed
for (index in seq(2L, length(tasks))) {
seeds[[tasks[index]]] <- parallel::nextRNGStream(seeds[[index - 1L]])
}
str(seeds)
#> List of 3
#> $ task_a: int [1:7] 10407 -511557822 1908736475 498441056 -1632547871 1622487086 -726144297
#> $ task_b: int [1:7] 10407 149605160 279562417 922798074 -295944740 -2095294832 757822457
#> $ task_c: int [1:7] 10407 972714371 1811980947 954238892 2133507052 965840020 1200680439 Now suppose we insert a new task between tasks <- c("task_a", "task_new", "task_b", "task_c")
seeds <- list()
seeds[[tasks[1L]]] <- first_seed
for (index in seq(2L, length(tasks))) {
seeds[[tasks[index]]] <- parallel::nextRNGStream(seeds[[index - 1L]])
}
#> str(seeds)
#> List of 4
#> $ task_a : int [1:7] 10407 -511557822 1908736475 498441056 -1632547871 1622487086 -726144297
#> $ task_new: int [1:7] 10407 149605160 279562417 922798074 -295944740 -2095294832 757822457
#> $ task_b : int [1:7] 10407 972714371 1811980947 954238892 2133507052 965840020 1200680439
#> $ task_c : int [1:7] 10407 -1556413327 -444875998 37935421 -1864914985 -341121962 -1229046830 In ordinary parallel computing, this is not a problem. You simply run the computation and note the seed that each task ran with. But in |
In graph LR
task_a
task_new
task_b
task_c
But if we were to use recursive L'Ecuyer streams, that would induce dependency relationships we don't want: graph LR
task_a --> task_new
task_new --> task_b
task_b --> task_c
|
In terms of how # _targets.R file:
library(targets)
list(
tar_target(task_a, runif(1)),
tar_target(task_b, runif(1)),
tar_target(task_c, runif(1))
) We inspect the graph and correctly see that the tasks are independent (there are no edges). # R console:
library(targets)
tar_mermaid() graph LR
task_a
task_b
task_c
When we run the pipeline, the tasks run in topological order (with optional parallelism using integration with # R console
tar_make()
#> ▶ start target task_a
#> ● built target task_a [0 seconds]
#> ▶ start target task_b
#> ● built target task_b [0 seconds]
#> ▶ start target task_c
#> ● built target task_c [0 seconds]
#> ▶ end pipeline [0.476 seconds]
# R console
tar_read(task_c)
#> [1] 0.3329109
tar_meta(task_c, seed)
#> # A tibble: 1 × 2
#> name seed
#> <chr> <int>
#> 1 task_c -439103021#> Next time we run the pipeline, all our tasks are up to date. That's good because nothing changed since the last run. # R console:
tar_make()
#> ✔ skip target task_a
#> ✔ skip target task_b
#> ✔ skip target task_c
#> ✔ skip pipeline [0.593 seconds] To insert a new task, we modify # _targets.R file
library(targets)
list(
tar_target(task_a, runif(1)),
tar_target(task_a_new, runif(1)),
tar_target(task_b, runif(1)),
tar_target(task_c, runif(1))
) The new task # R console
tar_mermaid() graph LR
task_a
task_a_new
task_b
task_c
When we run the pipeline, only # R console:
tar_make()
#> ✔ skip target task_a
#> ▶ start target task_a_new
#> ● built target task_a_new [0 seconds]
#> ✔ skip target task_b
#> ✔ skip target task_c
#> ▶ end pipeline [0.458 seconds] However: if # What *would* have happened with recursive L'Ecuyer streams for each target:
tar_make()
#> ✔ skip target task_a
#> ▶ start target task_a_new
#> ● built target task_a_new [0 seconds]
#> ▶ start target task_b
#> ● built target task_b [0 seconds]
#> ▶ start target task_c
#> ● built target task_c [0 seconds]
#> ▶ end pipeline [0.476 seconds] |
Hmmm.... this might also slow down library(digest)
library(microbenchmark)
name <- "a"
microbenchmark(
int = digest2int(name),
md5 = digest(name, algo = "md5"),
sha256 = digest(name, algo = "sha256"),
sha512 = digest(name, algo = "sha512"),
mt = {
set.seed(utf8ToInt(name))
sample.int(n = .Machine$integer.max, size = 1L)
}
)
#> Unit: nanoseconds
#> expr min lq mean median uq max neval
#> int 507 730.5 1105.23 811.5 1069.0 17599 100
#> md5 17237 18389.0 133541.58 21435.0 24110.5 10555007 100
#> sha256 19013 20269.5 24911.84 22746.0 24862.5 152663 100
#> sha512 15633 17178.0 27703.10 19355.5 21550.5 406789 100
#> mt 11186 13200.5 30920.48 13951.5 16907.0 475125 100 Created on 2023-10-12 with reprex v2.0.2 I know the |
We don't need serialization, and that makes things 2x faster. library(digest)
library(microbenchmark)
name <- "a"
microbenchmark(
int = digest2int(name),
md5 = digest(name, algo = "md5", serialize = FALSE),
sha256 = digest(name, algo = "sha256", serialize = FALSE),
sha512 = digest(name, algo = "sha512", serialize = FALSE),
mt = {
set.seed(utf8ToInt(name))
sample.int(n = .Machine$integer.max, size = 1L)
}
)
#> Unit: nanoseconds
#> expr min lq mean median uq max neval
#> int 493 635.5 894.52 693.0 817.0 19359 100
#> md5 10345 10700.0 13325.78 10904.5 11450.5 234411 100
#> sha256 12306 12677.0 13282.54 13033.5 13580.5 29117 100
#> sha512 9043 9541.5 10088.13 9954.0 10444.5 13210 100
#> mt 11100 12095.5 12707.84 12509.0 12898.5 29362 100
name <- "longlonglonglonglonglonglonglonglonglonglonglonglonglonglonglong"
microbenchmark(
int = digest2int(name),
md5 = digest(name, algo = "md5", serialize = FALSE),
sha256 = digest(name, algo = "sha256", serialize = FALSE),
sha512 = digest(name, algo = "sha512", serialize = FALSE),
mt = {
set.seed(utf8ToInt(name))
sample.int(n = .Machine$integer.max, size = 1L)
}
)
#> Unit: nanoseconds
#> expr min lq mean median uq max neval
#> int 558 681.0 777.75 747.5 807.5 3081 100
#> md5 10466 10792.5 11365.36 11285.0 11840.0 13044 100
#> sha256 12660 13078.5 13749.33 13440.0 14168.5 20253 100
#> sha512 9021 9592.5 10440.70 9834.5 10391.0 48685 100
#> mt 11662 12613.5 13318.57 13152.0 13564.0 33029 100 Created on 2023-10-12 with reprex v2.0.2 |
From profiling #1168, the performance hit is should be less than 1% in most pipelines, and it is not likely to increase above 3.5% even with |
On a Mac it's a bit more overhead: 2% with tar_cue(file = TRUE), 5.8% with tar_cue(file = FALSE). But still not terrible and certainly not a bottleneck. The pipeline I was profiled was the one below (lots of quick small targets), and I ran library(targets)
library(tarchetypes)
tar_option_set(cue = tar_cue(file = FALSE))
list(
tar_target(x, 1:5000),
tar_target(ordinary_name, x, pattern = map(x))
) |
How so? Ubuntu focal has NNG 1.5.2 which supports You do realise that But it's very good you've implemented a 'correct' solution in any case. Also instead of hashing twice, have you considered just letting the value wrap round? The hash produces a hex value, which can be translated into an integer. If this overflows an R integer, then you treat it as if it were a C unsigned int, which cannot overflow, but simply wraps around (defined in the standard) so int.max + 1 = 0. |
A few months ago, @MilesMcBain reported having trouble with Ubuntu 18.04: #1069. Even with the impetus to upgrade, I wonder how many users still run these older OS's.
Is SHA1 is strong enough for this?
I thought about it, but it feels like modular arithmetic, which is essentially cutting off the most significant bits in the hash value. I don't know what cryptographic properties we lose if we just take part of the hash, so I thought it would be more rigorous to supply the entire hash to |
Hmm... turns out |
I clarified on the original thread, and seems like a non-issue.
It's slightly weakened for certain kinds of attacks - but an irrelevance here? Of course sha256/512 is also available if you'd rather not have to consider this. |
Yes, I think you're right it's not a solution. But I'm also not sure what effect the re-hashing would have. Have you considered an XOF, something like SHAKE256 or KangarooTwelve? |
That's interesting. I don't know of a convenient trusted implementation for an XOF in R. I trust |
I put this together in a spare afternoon: https://shikokuchuo.net/secretbase/ This takes a quarter of the runtime of the current double-hashing implementation. Moreover, SHAKE256 is more correct and fits precisely the use case here. |
Where can I buy those? :p |
@wlandau {secretbase} is now established on CRAN. This provides the solution that targets is seeking in reproducible random seeds that are safe for parallel processes. Namely, ones that can provide truly ‘random’ random seeds to R’s RNG that satisfy the requirement per Chapter 4 of doi:10.1016/j.matcom.2016.05.00 - the section “A single RNG with a ‘random’ seed for each stream.” Appreciate this is some time on from the discussion above, but at that time I did not have an answer to your ask (above) for a trusted and convenient XOF implementation in R. The SHA3 algorithms only landed in Mbed TLS at a later date. As I’ve wrapped their (validated) algorithm with an interface that allows hashing directly to integer - I think this satisfies both trustworthy and convenient. Performance gains I’ve mentioned before – this should reverse the slowdowns that you observed in your previous benchmarking. Also it’s a pretty conservative 9kb package, with no dependencies at the R or C level and compiles under C99. I’d be happy to provide a PR to get this started if you’d like. It would be up to you of course whether you want to invalidate existing seeds or not, but I think it makes sense for new analyses to benefit from these at least. |
Wow, this is fantastic! Thanks so much for To clarify: if I set The speed gain is incredible: microbenchmark::microbenchmark(
secretbase = secretbase::sha3("target_name", bits = 32L, convert = NA),
double_hash = {
digest::digest2int(
x = digest::digest(
object = "target_name",
algo = "sha512",
serialize = FALSE,
file = FALSE,
seed = 0L
)
)
}
)
#> Unit: nanoseconds
#> expr min lq mean median uq max neval cld
#> secretbase 656 697 815.08 738 779 6396 100 a
#> double_hash 4879 5043 5368.13 5125 5289 22468 100 b For hashes of output data, |
Implemented in |
I wonder if this would be relevant for |
I've followed this discussion at distance, and I should say, I'm not up-to-date with all comments here-in, so my comment/question might already have been discussed. I see two types of RNG reproducibility when working with DAGs. There's the reproducibilty of (i) the "final", fixed DAG, and (2) the DAG in flux. If you consider the final DAG that exists when a pipeline has stabilized, then you can always identify a unique, linear path for walking the DAG, which means you could use parallel RNGs such as L'Ecuyer. For the in-flux DAG, you need an ad-hoc solution, for instance, the one you've already proposed here, i.e. generate a good-enough RNG seed based on a fixed property of the node. My proposal would be to provide the option to work with both. The first version is important to ensure statistical sounds RNGs. With the second approach, there will always be the concern that the RNGs are not statistically sound. Not sure what the default should be, but I can imagine that a user develops their pipeline using "in-flux-DAG" RNGs, but at the end, wipes the cache and regenerate everything with "fixed-DAG" RNGs. |
Unfortunately, a "fixed" DAG is not a feasible concept for
Switching to a "fixed" DAG would generate a different set of seeds and thus a different set of results, which would invalidate the current conclusion in the moment it is being finalized. And in practice, I think users would switch between fixed and flux a lot during failed or premature attempts to finalize a project, which would cause a lot of painful rework.
From https://www.sciencedirect.com/science/article/pii/S0378475416300829?via%3Dihub:
|
It will be the SHAKE256 XOF which is used. The SHA-3 standard doi:10.6028/NIST.FIPS.202 defines the 4 SHA-3 hashes 224, 256, 384 and 512, and two XOFs - SHAKE128 and SHAKE256. They are all standardised instances of the same underlying Keccak algorithm. SHAKE256 will provide up to 256 bits of security (although limited by our output length).
Yes, SHA-3 is inherently slower even vs SHA-2 and the intention is not to replace the usage there. Having said that, I think with SHA1, I can get to a timing that's within 1.5x that of xxhash64. The place it will come in useful is in big-data / memory-constrained situations as my solution in |
So it's not as if a SHA3 is computed, then the SHAKE256 XOF is applied post-hoc? The whole thing from beginning to end is just the SHAKE256 XOF and it shares the underlying algorithm of SHA3?
Halving the memory footprint sounds really compelling! I will have to think about it more: a 50% reduction in memory usage at the price of a 50% increase in runtime... |
Yes, that's correct. SHAKE256 is one of the SHA-3 options, it's not applied on top.
No free lunch in this instance! :) |
targets
has two major challenges with pseudo-random number generation:For years,
targets
has controlled (1) in a simplistic way: set a deterministic seed for each target, where each target seed depends on the target name and an overarching global seed:targets/R/utils_digest.R
Lines 35 to 42 in 116f1ce
But this approach may violate (2). Different targets have different pseudo-random number generator states, and there is nothing to prevent the PRNG sequence of one target from overlapping with that of a different target.
This problem is not unique to
targets
, it is a general issue with parallel computing.C.f. wlandau/crew#113.
The text was updated successfully, but these errors were encountered: