Arithmetic coding of a single bit preserves ordering of encoded bits, if CDF(1) > CDF(0). If byte's encoding process is going from higher bits to lower bits, arithmetic coding (even with dynamic model) will preserve ordering of individual bytes.
In the end, arithmetic coding preserves ordering of encoded strings. Thus, comparison operations can be performed on the compressed representation of strings (and big-endian representations of integers and even floating point values), without the need to decompress data until that decompressed strings are needed.
Another view: strings are compared by memcmp as if they are mantissas with the base 256. "hi!" is 'h'(1/256)+'i'(1/256)^2+'!'(1/256)^3+0(1/256)^4 and then there are zeroes to the infinity. Arithmetic encoding represents encoded strings as mantissas where base is 2. Range coding can utilize other bases such as 256.
Then there is this eternal conversation about whether on should encrypt and then compress or compress and then encrypt.
Encrypted data will not compress well because encryption needs to remove patterns and patterns are what one exploits for compression.
If you compress and then encrypt, yes you can leak information through the file sizes, but there isn't really a way out of this. Encryption and compression are fundamentally at odds with each other.
The biggest risk if if you are compressing secrets alongside attacker-controlled data. Then there's a host of attacks on the secret that become possible.
Compress then encrypt is not an option because your encryption is broken if it can be compressed at all. Mathematically it's a near certainty that the compression would increase the file size when given an encrypted input.
Your argument explains correctly that "encrypt then compress" is not an option, because in this order compress will do nothing, except wasting time and energy.
On the other hand "compress then encrypt" is more secure then simple encryption, because even a weak encryption method may be difficult to break when applied only to compressed data, because this use is similar to encrypting random numbers, i.e. the statistical properties of the plaintext have already been hidden.
The only disadvantage of "compress then encrypt" is in the less frequent cases when you are more concerned with deceiving traffic analysis of the amount of data sent than with saving resources, when you will pad anyway your useful data with irrelevant junk, to hide the real length of the data.
If you send data that is highly compressible, even if you want to hide the length it may be advantageous to first compress the data, and then pad it with junk before encryption, to hide the length.
Thus, you might e.g. compress the data 8 times, then double the length by padding and still send less data than without compression, while also masking the true length.
I was trying to implement a compression algorithm selection heuristic in some file format code I am developing. I found this to be too hard for me to reason about so basically gave up on it.
Feels like this blog post is getting there but there could be a more detailed sets of equations that actually calculate this based on some other parameters.
Having the code completely flexible and doing a full load production test with desired parameters to find the best tuning is an option but is also very difficult.
Unless the user just updated it, the current cert has been valid for the last 12 days. Maybe you're getting MITM'ed or don't have the root in your trust store?
Arithmetic coding of a single bit preserves ordering of encoded bits, if CDF(1) > CDF(0). If byte's encoding process is going from higher bits to lower bits, arithmetic coding (even with dynamic model) will preserve ordering of individual bytes.
In the end, arithmetic coding preserves ordering of encoded strings. Thus, comparison operations can be performed on the compressed representation of strings (and big-endian representations of integers and even floating point values), without the need to decompress data until that decompressed strings are needed.
Another view: strings are compared by memcmp as if they are mantissas with the base 256. "hi!" is 'h'(1/256)+'i'(1/256)^2+'!'(1/256)^3+0(1/256)^4 and then there are zeroes to the infinity. Arithmetic encoding represents encoded strings as mantissas where base is 2. Range coding can utilize other bases such as 256.
Then there is this eternal conversation about whether on should encrypt and then compress or compress and then encrypt.
Encrypted data will not compress well because encryption needs to remove patterns and patterns are what one exploits for compression.
If you compress and then encrypt, yes you can leak information through the file sizes, but there isn't really a way out of this. Encryption and compression are fundamentally at odds with each other.
The biggest risk if if you are compressing secrets alongside attacker-controlled data. Then there's a host of attacks on the secret that become possible.
Compress then encrypt is not an option because your encryption is broken if it can be compressed at all. Mathematically it's a near certainty that the compression would increase the file size when given an encrypted input.
You mistyped "compress then encrypt".
Your argument explains correctly that "encrypt then compress" is not an option, because in this order compress will do nothing, except wasting time and energy.
On the other hand "compress then encrypt" is more secure then simple encryption, because even a weak encryption method may be difficult to break when applied only to compressed data, because this use is similar to encrypting random numbers, i.e. the statistical properties of the plaintext have already been hidden.
The only disadvantage of "compress then encrypt" is in the less frequent cases when you are more concerned with deceiving traffic analysis of the amount of data sent than with saving resources, when you will pad anyway your useful data with irrelevant junk, to hide the real length of the data.
If you send data that is highly compressible, even if you want to hide the length it may be advantageous to first compress the data, and then pad it with junk before encryption, to hide the length.
Thus, you might e.g. compress the data 8 times, then double the length by padding and still send less data than without compression, while also masking the true length.
If you are saying that good encryption ought to make the result incompressible, then I agree with you.
Really interesting.
I was trying to implement a compression algorithm selection heuristic in some file format code I am developing. I found this to be too hard for me to reason about so basically gave up on it.
Feels like this blog post is getting there but there could be a more detailed sets of equations that actually calculate this based on some other parameters.
Having the code completely flexible and doing a full load production test with desired parameters to find the best tuning is an option but is also very difficult.
Also read this previously which I find similar.
https://rocksdb.org/blog/2021/12/29/ribbon-filter.html
Your SSL cert is invalid.
Unless the user just updated it, the current cert has been valid for the last 12 days. Maybe you're getting MITM'ed or don't have the root in your trust store?
Could be. I was in an educational environment at time of access.