The mathematics of compression in database systems

thesz 4 hours ago

Arithmetic coding of a single bit preserves ordering of encoded bits, if CDF(1) > CDF(0). If byte's encoding process is going from higher bits to lower bits, arithmetic coding (even with dynamic model) will preserve ordering of individual bytes.

In the end, arithmetic coding preserves ordering of encoded strings. Thus, comparison operations can be performed on the compressed representation of strings (and big-endian representations of integers and even floating point values), without the need to decompress data until that decompressed strings are needed.

Another view: strings are compared by memcmp as if they are mantissas with the base 256. "hi!" is 'h'(1/256)+'i'(1/256)^2+'!'(1/256)^3+0(1/256)^4 and then there are zeroes to the infinity. Arithmetic encoding represents encoded strings as mantissas where base is 2. Range coding can utilize other bases such as 256.

srean 6 hours ago

Then there is this eternal conversation about whether on should encrypt and then compress or compress and then encrypt.

Encrypted data will not compress well because encryption needs to remove patterns and patterns are what one exploits for compression.

If you compress and then encrypt, yes you can leak information through the file sizes, but there isn't really a way out of this. Encryption and compression are fundamentally at odds with each other.

rcxdude 3 hours ago

The biggest risk if if you are compressing secrets alongside attacker-controlled data. Then there's a host of attacks on the secret that become possible.
smilliken 6 hours ago

Compress then encrypt is not an option because your encryption is broken if it can be compressed at all. Mathematically it's a near certainty that the compression would increase the file size when given an encrypted input.
- adrian_b 3 hours ago
  
  You mistyped "compress then encrypt".
  Your argument explains correctly that "encrypt then compress" is not an option, because in this order compress will do nothing, except wasting time and energy.
  On the other hand "compress then encrypt" is more secure then simple encryption, because even a weak encryption method may be difficult to break when applied only to compressed data, because this use is similar to encrypting random numbers, i.e. the statistical properties of the plaintext have already been hidden.
  The only disadvantage of "compress then encrypt" is in the less frequent cases when you are more concerned with deceiving traffic analysis of the amount of data sent than with saving resources, when you will pad anyway your useful data with irrelevant junk, to hide the real length of the data.
  If you send data that is highly compressible, even if you want to hide the length it may be advantageous to first compress the data, and then pad it with junk before encryption, to hide the length.
  Thus, you might e.g. compress the data 8 times, then double the length by padding and still send less data than without compression, while also masking the true length.
- srean 6 hours ago
  
  If you are saying that good encryption ought to make the result incompressible, then I agree with you.

ozgrakkurt 5 hours ago

Really interesting.

I was trying to implement a compression algorithm selection heuristic in some file format code I am developing. I found this to be too hard for me to reason about so basically gave up on it.

Feels like this blog post is getting there but there could be a more detailed sets of equations that actually calculate this based on some other parameters.

Having the code completely flexible and doing a full load production test with desired parameters to find the best tuning is an option but is also very difficult.

Also read this previously which I find similar.

https://rocksdb.org/blog/2021/12/29/ribbon-filter.html

esafak 6 hours ago

Your SSL cert is invalid.

pixl97 2 hours ago

Unless the user just updated it, the current cert has been valid for the last 12 days. Maybe you're getting MITM'ed or don't have the root in your trust store?
- esafak an hour ago
  
  Could be. I was in an educational environment at time of access.