diff --git a/_posts/2019-12-01-rle-array.markdown b/_posts/2019-12-01-rle-array.markdown index 8e73acb..f61be73 100644 --- a/_posts/2019-12-01-rle-array.markdown +++ b/_posts/2019-12-01-rle-array.markdown @@ -80,13 +80,13 @@ Run-length encoding is a simple yet powerful technique. Instead of storing array so called "runs" --- consecutive elements of the array where the same value is stored. For each run, it then just keeps its value and length: -![run-length encoding, step 1](/assets/images/2019-12-01-rle-array/rle_array1.png) +![run-length encoding, step 1](/assets/images/2019-12-01-rle-array/rle_array1.svg) Pandas requires us to be able to do quick [random access](https://en.wikipedia.org/wiki/Random_access), e.g. for sorting and group-by operations. Instead of the actual run-lengths we store the end positions of each run (this is the cumulative sum of the lengths): -![run-length encoding, step 2](/assets/images/2019-12-01-rle-array/rle_array2.png) +![run-length encoding, step 2](/assets/images/2019-12-01-rle-array/rle_array2.svg) This way, we can use [binary search](https://en.wikipedia.org/wiki/Binary_search_algorithm) to implement random access. @@ -119,7 +119,7 @@ created as followed: The whole setup can also be visualized: -![cube](/assets/images/2019-12-01-rle-array/cube.png) +![cube](/assets/images/2019-12-01-rle-array/cube.svg) You can generate the same data using [`rle_array.testing.generate_test_dataframe`](https://jdasoftwaregroup.github.io/rle-array/_rst/rle_array.testing.html#rle_array.testing.generate_test_dataframe). @@ -162,7 +162,7 @@ encouraged to try these and others. Dictionary encoding replaces the actual payload data with a mapping. The trick is that mapped values can often be more memory-efficient, especially when the original data is very long (e.g. for strings) and are repeated multiple times: -![dictionary encoding memory layout](/assets/images/2019-12-01-rle-array/dictionary_encoding.png) +![dictionary encoding memory layout](/assets/images/2019-12-01-rle-array/dictionary_encoding.svg) This is what [Pandas Categoricals](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html) implement. For data-at-rest, this is implemented by @@ -187,7 +187,7 @@ This distinction between semantics and data size is also made by the Here is how this looks like in memory (for [big endian machines](https://en.wikipedia.org/wiki/Endianness)): -![data types memory layout](/assets/images/2019-12-01-rle-array/data_types.png) +![data types memory layout](/assets/images/2019-12-01-rle-array/data_types.svg) In this example, we can easily use 16 bits per element instead of 64, resulting in a 75% memory reduction. @@ -198,7 +198,7 @@ noticeable exceptions due to the lacking hardware support on most CPUs), it also ### Bit-packing Bit-packing is similar to [Data Types](#data-types), but allows to create types with non-standard width: -![bit packing memory layout](/assets/images/2019-12-01-rle-array/bit_packing.png) +![bit packing memory layout](/assets/images/2019-12-01-rle-array/bit_packing.svg) The advantage is that you can save even more memory, but it comes with heavy performance penalties, since CPUs cannot read unaligned data that efficiently. In some cases however, it can be even faster due to the saved memory @@ -212,7 +212,7 @@ Often we find columns in our DataFrames where information only occurs for a very is often more efficient to explicitly store and look-up these few cases --- e.g. by using a [HashTable](https://en.wikipedia.org/wiki/Hash_table) --- than using a simple array: -![sparse data memory layout](/assets/images/2019-12-01-rle-array/sparse_data.png) +![sparse data memory layout](/assets/images/2019-12-01-rle-array/sparse_data.svg) This is what [Pandas SparseArray](https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html) implements. Note that the default value does not need to be `0`, but can be an arbitrary element. One downside of sparse arrays is that diff --git a/assets/css/main.scss b/assets/css/main.scss index 433e2e1..acd2bbb 100644 --- a/assets/css/main.scss +++ b/assets/css/main.scss @@ -77,6 +77,10 @@ h6 padding-right: 0; } +.page__content img[src$=".svg"] { + width: 80%; +} + .page__footer { background-color: $primary-color; diff --git a/assets/images/2019-12-01-rle-array.graffle b/assets/images/2019-12-01-rle-array.graffle index aaeb89b..d3036dd 100644 Binary files a/assets/images/2019-12-01-rle-array.graffle and b/assets/images/2019-12-01-rle-array.graffle differ diff --git a/assets/images/2019-12-01-rle-array/bit_packing.png b/assets/images/2019-12-01-rle-array/bit_packing.png deleted file mode 100644 index 6fb1a36..0000000 Binary files a/assets/images/2019-12-01-rle-array/bit_packing.png and /dev/null differ diff --git a/assets/images/2019-12-01-rle-array/bit_packing.svg b/assets/images/2019-12-01-rle-array/bit_packing.svg new file mode 100644 index 0000000..7a4f3c3 --- /dev/null +++ b/assets/images/2019-12-01-rle-array/bit_packing.svg @@ -0,0 +1,480 @@ + + + + + + + + + + + + + + + + + + + + + Produced by OmniGraffle 7.16 + 2019-12-18 15:47:17 +0000 + + + bit_packing + + Layer 1 + + + + 0 + + + + + + 0 + + + + + + 0 + + + + + + 0 + + + + + + 0 + + + + + + 0 + + + + + + 0 + + + + + + 0 + + + + + + 0 + + + + + + 0 + + + + + + 0 + + + + + + 0 + + + + + + 0 + + + + + + 1 + + + + + + 0 + + + + + + 1 + + + + + + 0 + + + + + + 0 + + + + + + 0 + + + + + + 0 + + + + + + 0 + + + + + + 0 + + + + + + 1 + + + + + + 0 + + + + + + 0 + + + + + + 0 + + + + + + 0 + + + + + + 0 + + + + + + 0 + + + + + + 1 + + + + + + 0 + + + + + + 1 + + + + + + 5 + + + + + + 2 + + + + + + 5 + + + + + + 0 + + + + + + + + Values + + + + + + 1 + + + + + + 2 + + + + + + 0 + + + + + + 0 + + + + + + 0 + + + + + + 0 + + + + + + 0 + + + + + + 0 + + + + + + 0 + + + + + + 1 + + + + + + 0 + + + + + + 0 + + + + + + 0 + + + + + + 0 + + + + + + 0 + + + + + + 0 + + + + + + 1 + + + + + + 0 + + + + + + 0 + + + + + + 0 + + + + + + 1 + + + + + + 0 + + + + + + 0 + + + + + + 1 + + + + + + 1 + + + + + + 0 + + + + + + 0 + + + + + + 0 + + + + + + 0 + + + + + + 1 + + + + + uint8 + + + + + uint3 + + + + + + 0 + + + + + + 1 + + + + + + 0 + + + + + + 1 + + + + + + 1 + + + + + + 0 + + + + + diff --git a/assets/images/2019-12-01-rle-array/cube.png b/assets/images/2019-12-01-rle-array/cube.png deleted file mode 100644 index 16e7c5b..0000000 Binary files a/assets/images/2019-12-01-rle-array/cube.png and /dev/null differ diff --git a/assets/images/2019-12-01-rle-array/cube.svg b/assets/images/2019-12-01-rle-array/cube.svg new file mode 100644 index 0000000..0e33828 --- /dev/null +++ b/assets/images/2019-12-01-rle-array/cube.svg @@ -0,0 +1,89 @@ + + + + + + + + + + + + + + + + Produced by OmniGraffle 7.16 + 2019-12-18 15:47:17 +0000 + + + cube + + Layer 1 + + + + + const 0×1×2 + + + + + + + + + dim 1 + + + + + + + + + + dim 0 + + + + + + + + + + dim 2 + + + + + + + + + + + + + + const 1×2 + + + + + const 0×1 + + + + + + + + + const 0×2 + + + + + diff --git a/assets/images/2019-12-01-rle-array/data_types.png b/assets/images/2019-12-01-rle-array/data_types.png deleted file mode 100644 index d12833f..0000000 Binary files a/assets/images/2019-12-01-rle-array/data_types.png and /dev/null differ diff --git a/assets/images/2019-12-01-rle-array/data_types.svg b/assets/images/2019-12-01-rle-array/data_types.svg new file mode 100644 index 0000000..95f572b --- /dev/null +++ b/assets/images/2019-12-01-rle-array/data_types.svg @@ -0,0 +1,444 @@ + + + + + + + + + + + + + + + + + + + + + Produced by OmniGraffle 7.16 + 2019-12-18 15:47:17 +0000 + + + data_types + + Layer 1 + + + + 00 + + + + + + 00 + + + + + + 00 + + + + + + 00 + + + + + + 00 + + + + + + 00 + + + + + + 00 + + + + + + 11 + + + + + + 00 + + + + + + 00 + + + + + + 00 + + + + + + 00 + + + + + + 00 + + + + + + 00 + + + + + + 00 + + + + + + C2 + + + + + + 00 + + + + + + 00 + + + + + + 00 + + + + + + 00 + + + + + + 00 + + + + + + 00 + + + + + + 03 + + + + + + 50 + + + + + + 00 + + + + + + 00 + + + + + + 00 + + + + + + 00 + + + + + + 00 + + + + + + 00 + + + + + + 02 + + + + + + 83 + + + + + + 194 + + + + + + 848 + + + + + + 643 + + + + + + 17 + + + + + + + + Values + + + + + + 221 + + + + + + 398 + + + + + + 00 + + + + + + 00 + + + + + + 00 + + + + + + 00 + + + + + + 00 + + + + + + 00 + + + + + + 00 + + + + + + DD + + + + + + 00 + + + + + + 00 + + + + + + 00 + + + + + + 00 + + + + + + 00 + + + + + + 00 + + + + + + 01 + + + + + + 8E + + + + + + 00 + + + + + + 11 + + + + + + 00 + + + + + + C2 + + + + + + 03 + + + + + + 50 + + + + + + 02 + + + + + + 83 + + + + + + 00 + + + + + + DD + + + + + + 01 + + + + + + 8E + + + + + uint64 + + + + + uint16 + + + + + diff --git a/assets/images/2019-12-01-rle-array/dictionary_encoding.png b/assets/images/2019-12-01-rle-array/dictionary_encoding.png deleted file mode 100644 index 131caab..0000000 Binary files a/assets/images/2019-12-01-rle-array/dictionary_encoding.png and /dev/null differ diff --git a/assets/images/2019-12-01-rle-array/dictionary_encoding.svg b/assets/images/2019-12-01-rle-array/dictionary_encoding.svg new file mode 100644 index 0000000..c9dcc81 --- /dev/null +++ b/assets/images/2019-12-01-rle-array/dictionary_encoding.svg @@ -0,0 +1,174 @@ + + + + + + + + + + + + + + + + + + + + + Produced by OmniGraffle 7.16 + 2019-12-18 15:47:17 +0000 + + + dictionary_encoding + + Layer 1 + + + + 1 + + + + + + “Hamburg” + + + + + + “Munich” + + + + + + “Hamburg” + + + + + + “Berlin” + + + + + + + + Values + + + + + + “Berlin” + + + + + + “Dresden” + + + + + encoded + + + + + mapping + + + + + + “Berlin” + + + + + + “Dresden” + + + + + + “Hamburg” + + + + + + “Munich” + + + + + + 1 + + + + + + 2 + + + + + + 3 + + + + + + 4 + + + + + + 3 + + + + + + 4 + + + + + + 3 + + + + + + 1 + + + + + + 2 + + + + + + 1 + + + + + diff --git a/assets/images/2019-12-01-rle-array/rle_array1.png b/assets/images/2019-12-01-rle-array/rle_array1.png deleted file mode 100644 index c313ec7..0000000 Binary files a/assets/images/2019-12-01-rle-array/rle_array1.png and /dev/null differ diff --git a/assets/images/2019-12-01-rle-array/rle_array1.svg b/assets/images/2019-12-01-rle-array/rle_array1.svg new file mode 100644 index 0000000..28538f1 --- /dev/null +++ b/assets/images/2019-12-01-rle-array/rle_array1.svg @@ -0,0 +1,168 @@ + + + + + + + + + + + + + + + + + + + + + Produced by OmniGraffle 7.16 + 2019-12-18 15:47:17 +0000 + + + rle_array1 + + Layer 1 + + + + a + + + + + + a + + + + + + x + + + + + + a + + + + + + + + Values + + + + + + c + + + + + + c + + + + + run-lengths + + + + + + a + + + + + + a + + + + + + a + + + + + + 3x + + + + + + 1x + + + + + + 2x + + + + + + 2x + + + + + + x + + + + + + c + + + + + + a + + + + + + x + + + + + + a + + + + + + c + + + + + + a + + + + + runs + + + + + diff --git a/assets/images/2019-12-01-rle-array/rle_array2.png b/assets/images/2019-12-01-rle-array/rle_array2.png deleted file mode 100644 index 184ed2f..0000000 Binary files a/assets/images/2019-12-01-rle-array/rle_array2.png and /dev/null differ diff --git a/assets/images/2019-12-01-rle-array/rle_array2.svg b/assets/images/2019-12-01-rle-array/rle_array2.svg new file mode 100644 index 0000000..52f3183 --- /dev/null +++ b/assets/images/2019-12-01-rle-array/rle_array2.svg @@ -0,0 +1,192 @@ + + + + + + + + + + + + + + + + + + + + + Produced by OmniGraffle 7.16 + 2019-12-18 15:47:17 +0000 + + + rle_array2 + + Layer 1 + + + + a + + + + + + a + + + + + + x + + + + + + a + + + + + + + + Values + + + + + + c + + + + + + c + + + + + run-lengths + + + + + + a + + + + + + a + + + + + + a + + + + + + 3x + + + + + + 1x + + + + + + 2x + + + + + + 2x + + + + + + x + + + + + + c + + + + + + a + + + + + offsets + + + + + + a + + + + + + 3 + + + + + + 4 + + + + + + 6 + + + + + + 8 + + + + + + x + + + + + + c + + + + + + a + + + + + diff --git a/assets/images/2019-12-01-rle-array/sparse_data.png b/assets/images/2019-12-01-rle-array/sparse_data.png deleted file mode 100644 index 917a3cb..0000000 Binary files a/assets/images/2019-12-01-rle-array/sparse_data.png and /dev/null differ diff --git a/assets/images/2019-12-01-rle-array/sparse_data.svg b/assets/images/2019-12-01-rle-array/sparse_data.svg new file mode 100644 index 0000000..03734ca --- /dev/null +++ b/assets/images/2019-12-01-rle-array/sparse_data.svg @@ -0,0 +1,115 @@ + + + + + + + + + + + + + + + + + + + + + Produced by OmniGraffle 7.16 + 2019-12-18 15:47:17 +0000 + + + sparse_data + + Layer 1 + + + + index 0 + + + + + + 0 + + + + + + 0 + + + + + + 0 + + + + + + 2 + + + + + + + + Values + + + + + + 3 + + + + + + 0 + + + + + mapping + + + + + + index 4 + + + + + + default + + + + + + 2 + + + + + + 3 + + + + + + 0 + + + + +