Portability issue in hashes #1244

wlandau · 2024-02-23T17:28:33Z

@shikokuchuo discovered an issue with how targets is using digest (either that or a bug in digest, depending on which behaviors were intended). C.f. shikokuchuo/secretbase#5 (comment).

targets uses serialization version 3 in order to benefit from ALTREP:

targets/R/utils_digest.R

Lines 33 to 42 in 0625aa2

    
           digest_obj64 <- function(object, ...) { 
        
             vdigest64( 
        
               object = list(object), 
        
               serialize = TRUE, 
        
               serializeVersion = 3L, 
        
               file = FALSE, 
        
               seed = 0L, 
        
               ... 
        
             ) 
        
           }

targets/R/utils_digest.R

Line 46 in 0625aa2

vdigest64 <- digest::getVDigest(algo = "xxhash64")

In digest version 0.6.34, the default argument skip = "auto" does not appear to skip any of the extra headers introduced by serialization version 3. In other words, these extra serialization headers are hashed together with the contents of the R object. One of these extra headers contains the locale, which is brittle and depends on the user's computing environment. @shikokuchuo confirmed this:

$ LANG="C" R -e 'targets:::digest_obj64(NULL)'
#> "02f794ac27515874"

$ R -e 'targets:::digest_obj64(NULL)'
#> "da7e5646cbdfced7"

The upshot is that hashes in targets are currently not portable: if you migrate a project from one machine to another, or if you change the locale of your R session, your pipeline may no longer be up to date.

This portability issue needs to be corrected. But unfortunately, a completely back-compatible patch is impossible because:

Different people may have different locales, so setting a constant locale would preserve some pipelines but disrupt others.
There may be other aspects of the extra serialization headers which are equally brittle and impossible to predict or control.

Switching to serialization version 2 would fix the problem. However, I prefer not to head in that direction because I want to take this opportunity to solve multiple problems at once. Those other problems are:

targets uses both secretbase and digest, which is cumbersome as far as dependencies are concerned because these packages overlap a lot. And targets needs secretbase for Statistical independence of pseudo-random numbers #1139.
secretbase appears to be faster than digest, at least in my own cursory testing, and it does not need the same cumbersome vectorization tricks to achieve low overhead (see benchmarks in xxHash64 implementation (clean) shikokuchuo/secretbase#5).

So my current plan is to switch to secretbase::xxh64() (c.f. shikokuchuo/secretbase#5) and take time to think about the 32-bit case.

The text was updated successfully, but these errors were encountered:

wlandau · 2024-02-23T17:38:42Z

A couple minor notes:

It may be time for targets version 2.0.0. According to semantic versioning best practice, a change in major version should be used for breaking changes. It is debatable whether changing hashes actually a breaking change since the code will still work, but there have been so many improvements since 1.0.0 that there is a case to bump the major version.
tar_make(), tar_make_future(), and tar_make_clustermq() should prompt the user with a dialogue explaining the change in hashes and giving them the opportunity to downgrade targets to keep the pipeline up to date.

wlandau · 2024-02-23T17:48:25Z

Also:

Today I confirmed empirically that the inputs supplied to vdigest64(), vdigest64_file(), and vdigest32() always have length 1. In other words, targets always supplies scalars to these functions. That means targets can safely discard the vectorization features in functions like digest_obj64().
As I state in xxHash64 implementation (clean) shikokuchuo/secretbase#5 (comment), I will take this opportunity to move targets away from 32-bit hashes.

noamross · 2024-03-08T16:44:42Z

I have a couple of questions related to this:

Would switching hash implementations help for targets objects, given targets hashes the files as stored on-disk, and the serialized files on-disk have the R version and locale information stored in their headers?
Do we know if the header issue applies to qs or any other serialization format, and do those need to be handled differently?
- arrow's parquet files written from R, for instance, contain in their metatdata R-serialized copies of the data frame attributes (https://github.com/apache/arrow/blob/9ca7d787402c715ee84c1bb21cfca0e54ae2f12d/r/R/metadata.R#L19-L33), so this includes R version and locale information and hashes are not portable across machines.

wlandau · 2024-03-08T20:08:51Z

I'm not sure if this would help with local files created by the targets in the pipeline. RDS is the only file format where I would expect headers to be in a place for secretbase to detect. Hash stability seems like an issue important to qs, although I have not tested the outcomes on different machines.

shikokuchuo · 2024-03-08T20:14:39Z

This helps for in-memory R objects only - these are always implicitly serialized before they can be hashed.

For files, these are hashed as-is - it's a binary blob. So if the same file is moved to a different machine, it will hash the same. The actual serialization method used to generate the files will determine if the same files are produced on different machines. I hope that makes sense.

noamross · 2024-03-08T20:28:29Z

This makes sense, but an area of ambiguity is RDS targets, not generated by format="file" but regular R objects saved in the local or remote store. If they are generated by two machines, identical as R objects but under different locales, can they be moved between machines and reused because they have the same hash? Can hashing the target as saved on-disk involve skipping the header if we are aware that it is RDS format?

shikokuchuo · 2024-03-08T20:33:00Z

As far as I'm aware that shouldn't be a problem. You should be able to move RDS files to a different machine with different R version and locale and get identical R objects when loaded (assuming both support R serialization version 3 i.e. R >= 3.5).

noamross · 2024-03-08T20:47:25Z

Ah, but while the R objects loaded are identical, I believe targets is hashing the RDS files after writing and before reading.

shikokuchuo · 2024-03-08T21:07:06Z

If the same object, identical on 2 different computers with different locales are both saved as RDS files and these files are hashed, then the hashes will be different.

The R language only guarantees that a round trip serialization and unserialization gives you identical objects regardless of where this is done. Unfortunately it does not guarantee hash stability.

shikokuchuo · 2024-03-08T21:34:37Z

I'm not sure if this would help with local files created by the targets in the pipeline. RDS is the only file format where I would expect headers to be in a place for secretbase to detect. Hash stability seems like an issue important to qs, although I have not tested the outcomes on different machines.

@wlandau so you're clear on this, the underlying functions called for files and objects are completely separate - files don't go through the R serialization mechanism, so {secretbase} wouldn't 'detect' anything if hashing a file.

It might be possible to stream unserialize hash an RDS file (if we know it's an RDS file), but I'm not certain and it'd be quite some effort!

shikokuchuo · 2024-03-11T09:29:43Z

@wlandau in case you were wondering, the PR shikokuchuo/secretbase#5 was ultimately not merged. After further research, I think it would be worth the while to implement another hash, given we'll be breaking anyway. My original motivation for trying XXH64 was to make it non-breaking if you remember.

wlandau · 2024-03-13T15:06:56Z

Thanks @shikokuchuo for chiming in here.

@noamross, I believe files moved from one machine to another should still have the same hashes. After #1244 is fixed, a pipeline moved to a different machine should stay up to date. If the contents of e.g. arrow files are dependent on locale, that only comes into play when a target reruns for other reasons. The hash of new file will be different, but that in itself will not have caused the target to rerun in the first place.

noamross · 2024-03-14T13:33:23Z

@wlandau Thanks. I see that moving files and and metadata together from one machine to another should maintain the state of the pipeline. I'm trying to understand (and if possible, expand) the conditions under which different systems can produce byte-identical R objects and be able to compare or share them (for purposes related to the extensions/experiments I discuss in #1232 (comment)). It seems it will probably require loading and re-hashing an object in memory, but I was curious if it is possible on objects serialized to disk.

shikokuchuo · 2024-03-14T18:57:20Z

@noamross as in my earlier response to @wlandau I think theoretically it would be possible to special case RDS files and pass these to our hash function attached to R unserialize (as opposed to serialize which is what we do for R in-memory objects), and skip the headers in the process. I don't know in practice if there would be any implementation obstacles. But this is a lot of effort and unless this is somehow more broadly useful, I don't think anyone will implement such functionality.

shikokuchuo · 2024-03-14T18:59:37Z

As in if this is of primary concern, then probably save the files as some other invariant format.

noamross · 2024-03-14T19:12:57Z

Indeed, I was just thinking I should test how this all ends up working with qs (which I think has built-in xxhash).

shikokuchuo · 2024-03-16T18:11:28Z

Reverting back to the issue, shikokuchuo/secretbase#6 is now ready. This implements SipHash (specifically SipHash-1-3 which is a highly-performant variant).

This is a higher quality hash than the non-cryptographic hashes such as 'xxhash', and has some security guarantees whilst still being around the same speed. Technically it is not a cryptographic hash, but a pseudorandom function. Whilst we do not need the security guarantees here (and we are using it with the fixed reference key to maximise performance), the collision resistance is guaranteed vs. something like 'xxhash' where trivial collisions have been found and the quality of the hash has been questioned.

It has also seen wide adoption e.g. as the default hash for newer Python, in Rust and various other languages. I'll invite you to try it out before merging. Feel free to post comments on the PR itself.

shikokuchuo · 2024-03-17T17:45:00Z

@wlandau SipHash is now merged into the main branch of secretbase as I'm happy with its quality, and after testing against the reference implementation.

wlandau · 2024-03-18T12:28:58Z

Awesome! I will benchmark it later this week.

wlandau · 2024-04-02T19:23:57Z

As we discussed: given its deliberate focus on small data, as well as shikokuchuo/secretbase#8, I am having second thoughts about SipHash for targets.

@shikokuchuo, for secretbase, do you have plans for alternative hashes designed to be fast for large data? Otherwise, to solve this immediate issue, I am considering keeping digest and downgrading to serialization version 2.

wlandau · 2024-04-02T19:29:21Z

Or, as long as targets needs two different hashing packages anyway, I might try xxhashlite as @njtierney suggested in #1212.

shikokuchuo · 2024-04-03T11:13:37Z

I'm still quite jetlagged - I forgot I was meant to be implementing SipHash anyway. From your benchmarking, the total time for file hashing is much lower than for in-memory hashing.

I still think SipHash is a good choice. It is a fast hash and I am not aware of another good quality hash that is faster.
Edit: see shikokuchuo/secretbase#8 (reply in thread)

wlandau · 2024-04-04T21:00:48Z

I forgot I implemented this, but targets calculates hashes on parallel workers if storage = "worker" in tar_option_set() or tar_target(). So that makes me willing to accept a slightly slower hash. And our conversation today helped convince me that performance shouldn't nonlinearly plummet at some unknown threshold beyond 1GB.

Before switching targets to SipHash, I think I just need to read more about the quality of xxhash vs SipHash vs cryptographic hashes. I don't take this decision lightly, and after the switch, I hope to never change the hash in targets ever again.

wlandau · 2024-04-05T13:51:06Z

From a closer look at https://eprint.iacr.org/2012/351.pdf, I see comments like:

However, essentially all standardized MACs and state-of-the-art MACs are
optimized for long messages, not for short messages. Measuring long-message
performance hides the overheads caused by large MAC keys, MAC initialization,
large MAC block sizes, and MAC finalization.

When the authors claim SipHash is "optimized for short inputs", I think they mean it reduces this overhead. It does not necessarily mean the hash is slow for large inputs, which was originally my main concern.

But also:

We comment that SipHash is not meant to be, and (obviously) is not, collision-resistant

Maybe the authors are comparing SipHash with cryptographic hash functions, maybe not. So I am not sure how much better SipHash-1-3 is than xxhash64 with respect to collision resistance. (Although I would not expect xxhash64 to be better).

shikokuchuo · 2024-04-05T16:26:27Z

Your guess is as good as mine as to that throw-away comment. But it is designed to guard against generated collisions which are trivial for the weaker hashes https://www.youtube.com/watch?v=Vdrab3sB7MU That's why it's been adopted as the hash used by Python, Rust etc. By definition if it is a true PRF as claimed (indistinguishable from a random uniform generator), then it should not be any worse at collisions than other hashes.

… the hashing functions from digest::digest() to secretbase::siphash13().

wlandau added the type: bug label Feb 23, 2024

wlandau self-assigned this Feb 23, 2024

wlandau changed the title ~~Lack of portability in hashes~~ Portability issue in hashes Feb 23, 2024

wlandau mentioned this issue Feb 23, 2024

xxHash64 implementation (clean) shikokuchuo/secretbase#5

Merged

wlandau mentioned this issue Apr 5, 2024

Switch to secretbase::siphash13() #1262

Merged

2 tasks

wlandau-lilly pushed a commit to ropensci/tarchetypes that referenced this issue Apr 5, 2024

To align with ropensci/targets#1244 and ropensci/targets#1262, switch…

9de10e6

… the hashing functions from digest::digest() to secretbase::siphash13().

wlandau-lilly added a commit to ropensci/jagstargets that referenced this issue Apr 5, 2024

To align with ropensci/targets#1244 and ropensci/targets#1262, switch…

07ac340

… the hashing functions from digest::digest() to secretbase::siphash13().

wlandau-lilly added a commit to ropensci/stantargets that referenced this issue Apr 5, 2024

To align with ropensci/targets#1244 and ropensci/targets#1262, switch…

bbdda1b

… the hashing functions from digest::digest() to secretbase::siphash13().

wlandau closed this as completed in 98490c1 Apr 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Portability issue in hashes #1244

Portability issue in hashes #1244

wlandau commented Feb 23, 2024 •

edited

Loading

wlandau commented Feb 23, 2024

wlandau commented Feb 23, 2024 •

edited

Loading

noamross commented Mar 8, 2024 •

edited

Loading

wlandau commented Mar 8, 2024

shikokuchuo commented Mar 8, 2024

noamross commented Mar 8, 2024

shikokuchuo commented Mar 8, 2024

noamross commented Mar 8, 2024

shikokuchuo commented Mar 8, 2024

shikokuchuo commented Mar 8, 2024 •

edited

Loading

shikokuchuo commented Mar 11, 2024 •

edited

Loading

wlandau commented Mar 13, 2024

noamross commented Mar 14, 2024 •

edited

Loading

shikokuchuo commented Mar 14, 2024

shikokuchuo commented Mar 14, 2024

noamross commented Mar 14, 2024

shikokuchuo commented Mar 16, 2024 •

edited

Loading

shikokuchuo commented Mar 17, 2024

wlandau commented Mar 18, 2024

wlandau commented Apr 2, 2024

wlandau commented Apr 2, 2024

shikokuchuo commented Apr 3, 2024 •

edited

Loading

wlandau commented Apr 4, 2024

wlandau commented Apr 5, 2024 •

edited

Loading

shikokuchuo commented Apr 5, 2024

Portability issue in hashes #1244

Portability issue in hashes #1244

Comments

wlandau commented Feb 23, 2024 • edited Loading

wlandau commented Feb 23, 2024

wlandau commented Feb 23, 2024 • edited Loading

noamross commented Mar 8, 2024 • edited Loading

wlandau commented Mar 8, 2024

shikokuchuo commented Mar 8, 2024

noamross commented Mar 8, 2024

shikokuchuo commented Mar 8, 2024

noamross commented Mar 8, 2024

shikokuchuo commented Mar 8, 2024

shikokuchuo commented Mar 8, 2024 • edited Loading

shikokuchuo commented Mar 11, 2024 • edited Loading

wlandau commented Mar 13, 2024

noamross commented Mar 14, 2024 • edited Loading

shikokuchuo commented Mar 14, 2024

shikokuchuo commented Mar 14, 2024

noamross commented Mar 14, 2024

shikokuchuo commented Mar 16, 2024 • edited Loading

shikokuchuo commented Mar 17, 2024

wlandau commented Mar 18, 2024

wlandau commented Apr 2, 2024

wlandau commented Apr 2, 2024

shikokuchuo commented Apr 3, 2024 • edited Loading

wlandau commented Apr 4, 2024

wlandau commented Apr 5, 2024 • edited Loading

shikokuchuo commented Apr 5, 2024

wlandau commented Feb 23, 2024 •

edited

Loading

wlandau commented Feb 23, 2024 •

edited

Loading

noamross commented Mar 8, 2024 •

edited

Loading

shikokuchuo commented Mar 8, 2024 •

edited

Loading

shikokuchuo commented Mar 11, 2024 •

edited

Loading

noamross commented Mar 14, 2024 •

edited

Loading

shikokuchuo commented Mar 16, 2024 •

edited

Loading

shikokuchuo commented Apr 3, 2024 •

edited

Loading

wlandau commented Apr 5, 2024 •

edited

Loading