Useless LCG hash function in LZ4 compression

Created by: jeberger

The implementation of the LCG hash function used for LZ4 compression is buggy, which renders the hash useless: it gives exactly the same compressed data as using self.get_batch_at_cursor() % DICTIONARY_SIZE.

It can be fixed by replacing it with:

const DICTIONARY_SHIFT: usize = 12;
const DICTIONARY_SIZE: usize = 1 << DICTIONARY_SHIFT;
…
self.get_batch_at_cursor().wrapping_mul(134775813).wrapping_add(1) as usize >> (32 - DICTIONARY_SHIFT)

This fix brings a 10% better compression ratio, theoretically at no cost.

A couple of notes however:

On the asm side, the fix replaces an and instruction with a shrd instruction, so should not have any impact on the speed of the hash function.
Unit benchmarks show that the hash function does indeed have the same performance with or without the fix (roughly 500ps per hash on my machine).
Integration benchmarks show that compression is slightly slower overall with the fix. Profiling with cachegrind suggests that this is due to there being twice as many data cache misses, even though there are slightly fewer instructions executed and fewer data accesses overall.
I haven't checked the impact on decompression, although I suspect that it will be slightly slower too, due to a similar effect on cache misses.
I don't know how this would impact the speed of tfs, and if the reduced data transfer times thanks to the improved compression is enough to compensate the extra compression time. However, if you decide to opt for the faster compression with larger size, then you should probably drop the LCG completely and go for self.get_batch_at_cursor() % DICTIONARY_SIZE instead.