Skip to content

Poor compression & quality for difficult-to-compress data #87

@lindstro

Description

@lindstro

I am doing some compression studies that involve difficult-to-compress (even incompressible) data. Consider the chaotic data generated by the logistic map xi+1 = 4 xi (1 - xi):

#include <cstdio>

int main()
{
  double x = 1. / 3;
  for (int i = 0; i < 256 * 256 * 256; i++) {
    fwrite(&x, sizeof(x), 1, stdout);
    x = 4 * x * (1 - x);
  }
  return 0;
}

We wouldn't expect this data to compress at all, but the inherent randomness at least suggests a predictable relationship between (RMS) error, E, and rate, R. Let σ = 1/√8 denote the standard deviation of the input data and define the accuracy gain as

α = log₂(σ / E) - R.

Then each increment in storage, R, by one bit should result in a halving of E, so that α is essentially constant. The limit behavior is slightly different as R → 0 or E → 0, but over a large range α ought to be constant.

Below is a plot of α(R) for SZ 2.1.12.3 and other compressors applied to the above data interpreted as a 3D array of size 256 × 256 × 256. Here SZ's absolute error tolerance mode was used: sz -d -3 256 256 256 -M ABS -A tolerance -i input.bin -z output.sz. The tolerance was halved for each subsequent data point, starting with tolerance = 1.

The plot suggests an odd relationship between R and E, with very poor compression observed for small tolerances. For instance, when the tolerance is in {2-13, 2-14, 2-15, 2-16}, the corresponding rate is {13.9, 15.3, 18.2, 30.8}, while we would expect R to increase by one bit in each case. Is this perhaps a bug in SZ? Similar behavior is observed for other difficult-to-compress data sets (see rballester/tthresh#7).

logistic

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions