Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions evobench-evaluator/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
XXX Todo

* [usage](docs/usage.md)
* [overview](docs/overview.md)
* [hacking](docs/hacking.md)

29 changes: 16 additions & 13 deletions evobench-evaluator/docs/hacking.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,19 +6,22 @@ Also see the [overview](overview.md).

### Style / details

* Types with names ending in "Opts" (or also "Opt" XX) are generally
(XX?) precursor types (at least if a sister type without the "Opts"
suffix exists): used for configuration or command line options, but
translated before use.

* Using `Arc` for the parts that come from the config or are derived
from it during load time, as that process is quite a bit convoluted,
and worse, there's config file reload, too. It might still be
feasible to use references instead, but so what. But, trying to use
`clone_arc()` (from `src/utillib/arc.rs`) consistently whenever an
`Arc` is cloned, for clarity and easy searching when interested
where it happens. Please keep this up.

* Types with names ending in "Opts" (or "Opt", if they only contain a
single option) are types directly taking options from humans, either
via the command line (`Clap`) or config files (`serde`).

Sometimes they are used by the application as is. Sometimes they are
verified and translated before use; the types they are translated to
are *not* using names ending in "Opts" (but rather, generally,
"Options").

Types holding configuration that is generated by the program are
using names generally or often ending in "Options", but never "Opts".

* When using `Arc` (e.g. for some parts that come from the config,
which has convoluted life times due to reloading of the config at
runtime), use `clone_arc()` (from `src/utillib/arc.rs`) to clone it,
for clarity and easy searching when interested where it happens.

## Specifics

Expand Down
308 changes: 308 additions & 0 deletions evobench-evaluator/docs/internals/evaluator/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,308 @@
# How `evobench-evaluator` works internally

## Statistics levels

1. The benchmark log file resulting from a benchmarking run is
processed to a statistics called "single" (for "single run"). Probe
timings are collected into a tree so that for each dynamic location
of the probe within the runtime call graph a path (like a
backtrace, but only containing probe names, not function names) can
be derived. For each such location within each thread (optionally),
but also across threads, but also for the probes irrespective of
location in the call graph, timings are collected and represented
with statistical values (count, sum, average, standard deviation,
median, percentiles) as a row in the Excel file; for flamegraphs,
only the path based representation is used.

2. If there is an interest in detecting performance deviations,
multiple benchmarking runs (e.g. 5 or 10) should be executed for a
single combination of commit id of the target project and
benchmarking invocation parameters (directory within target
project, command and arguments, and environment variables if any),
so that statistical significance for a deviation can be
calculated. `evobench-evaluator` is run with the `summary`
subcommand to calculate this second statistical level: the
statistics for a particular result of each benchmarking run
(example: take the *median* values of each probe-location of each
run, calculate the count, sum, average, standard deviation, median
and percentiles for *those*).

3. Then, given benchmarking logs from multiple commit ids (with
multiple runs each), a trend or graph can be derived or performance
deviation be calculate and reported. This third level is not
implemented yet (but much has been prepared for it already).

## Types

### `options.rs`

The evaluator translates to Excel or flamegraph files (and in the
FUTURE: caches, graphs, perhaps reports).

It can translate to both of those output types in the same run: the
paths are specified in `OutputOpts`. They are given as options on the
same level (via `#[clap(flatten)]` from the
[clap](https://crates.io/crates/clap) command line parser crate) as
the parameters for the evaluation, which are in `EvaluationOpts`.

`OutputOpts` is checked and converted to `CheckedOutputOptions` before
use, which wraps a `OutputVariants`, which is a parameterized type
that holds Excel and flamegraph variants of data through the pipeline.

#### StatsField

When summarizing data (i.e. level 2 or 3 as described in [Statistics
levels](#Statistics levels) above), but also when generating
flamegraphs, a decision has to be taken about which statistical number
to build the higher level statistical evaluation over. The selection
of the field is represented by the `stats::StatsField<TILE_COUNT>`
enum type; the type parameter is an integer for how many tiles are
used in the statistics, currently the `evobench-evaluator` uses 101
everywhere (percentiles, 0..100 inclusive). To be used as command line
option, it implements `FromStr`, i.e. can be created from a string
(like "average", "stdev", "10").

This field is used in the types `evaluator::options::FlameFieldOpt`
(choice of field for the flamegraph output),
`evaluator::options::FieldSelectorDimension3Opt` (choice of field for
the level 2 statistics (summary)), and
`evaluator::options::FieldSelectorDimension4Opt` (choice of field for
the unfinished level 3 statistics). The point of these wrapper types
is to hold both help text and default value for `clap` as much as to
disambiguate the option usage in the code.

## Processing chain

### 1. Parsing and tree building

This part of the processing is done by the code in [evaluator/data/](../../../src/evaluator/data/mod.rs).

1. Parsing:

The benchmarking log files are currently in an NLJSON based
format, with version and context information at the beginning,
optionally zstd compressed. The log lines are parsed into a
vector of
[`LogMessage`](../../../src/evaluator/data/log_message.rs), which
contain [`Timing`](../../../src/evaluator/data/log_message.rs)
records for probes, held by a
[`LogData`](../../../src/evaluator/data/log_data.rs) instance.

Note that `Timing` records contain just a single absolute data
point (but for multiple different kinds of values, e.g. real time,
cpu time etc.); it is by later pairing up the `Timing` records for
the start (logging from object constructor) and end (logging from
object destructor) of the same scope (identified by scope name,
which must be unique!) and taking the difference that the cost
becomes known. This design (calculating the difference during
evaluation, not recording) was chosen to try to keep the cost of
logging lower, but potentially the absolute timings could allow
for additional evaluations (e.g. end of scope to end of parent
scope) or event correlations, too (not currently done).

2. Tree building:

Then `LogMessage` entries for probes (more precisely, references
to their `Timing` parts, with the timings for the scope start and
end for each probe paired up) are collected into a
[`LogDataTree`](../../../src/evaluator/data/log_data_tree.rs). Both
the LogData and derived LogDataTree are bundled in a
[`LogDataAndTree`](../../../src/evaluator/data/log_data_and_tree.rs)
instance.

[evaluator/data/log_data_tree.rs (`path_string()` on `Span`)](../../../src/evaluator/data/log_data_tree.rs)
also contains the code to turn a location in the tree into a path
("probe-span backtrace").

### 2. Path index, calculating statistics, collection into tables

#### Path index

The `LogDataAndTree` structure from the previous step contains all the
original, individual `Timing` records, two per each logging probe
encounter (the `EVOBENCH_SCOPE_EVERY` probes only log once for every n
encounters): one for the start and one for when the scope ends and the
destructor runs. The tree just holds them together according to the
dynamic context (thread, then call context). This detail data now
needs to be condensed down as descriptive statistics.

There are multiple ways how the tree could be condensed down:

- One might wish to know the total cost of a particular scope,
irrespective of its dynamic context (i.e. regardless where it was
called from).

- Or one might wish to know the total cost of a particular scope *in a
particular calling context*. In that case,

- one might also care about which thread that context (call path)
was executed on,
- or one might just want to know the total cost of the same call
path across all threads.

The tree in `LogDataAndTree` has the most precise location
information. Some of that location information needs to be ignored for
collecting the `Timing` entries for the statistics, depending on the
interest as listed above.

In each case, a human-readable description of what the statistics was
calculated about (the location or overlaid locations in the tree) is
needed. A path string with separators and a few more features is
chosen for this; the `evaluator::data::Span::path_string` method
produces those strings. (For performance reasons, it generates these
strings into a mutable reference into a string, and for that reason
there is no custom type definition for those path strings.) This
method takes a `PathStringOptions` value to specify the details how
the path should be generated, e.g. whether the thread should be
mentioned or not, etc. The same location (represented by a
`evaluator::data::Span`) could produce such different path strings as:

1. With the specific thread (threads are numbered in order of new
thread ids in timings occurring in the log, starting from 0):

N:thread00 > main|main > sum_of_fibs|all > sum_of_fibs n=22 > sum_of_fibs|body > main|fib > fib|fib

2. Union across all threads:

A:thread > main|main > sum_of_fibs|all > sum_of_fibs n=22 > sum_of_fibs|body > main|fib > fib|fib

3. The same path in reverse order:

AR:fib|fib < main|fib < sum_of_fibs|body < sum_of_fibs n=22 < sum_of_fibs|all < main|main < thread

4. Or ignoring location altogether (only showing the probe name, not the location):

fib|fib

Path 1 will represent the fewest data points since it is the most
specific, 2 and 3 (representing the same data points) represent
possibly more points since those paths potentially cover multiple
threads, 4 represents the most data points.

So, to collect the data points, for each point (again, represented by
a `evaluator::data::Span`) the path is calculated according to a
chosen `PathStringOptions` value, and then the path is keyed into a
hash map, and a reference to that `Span` is added to a vector held in
that map. Afterwards, for each entry in the map, the statistics over
its vector can be calculated. This map is wrapped in the
`evaluator::IndexByCallPath` type, and the indexing happens in the
`evaluator::IndexByCallPath::from_logdataindex` method.

Some of the parameters for generating the paths can be chosen via
command line arguments to `evobench-evaluator` (for the `single` or
`summary` subcommands). But for Excel output, multiple runs are done
with different `PathStringOptions` options to fill the
`IndexByCallPath` with entries for different usecases at once: the
resulting Excel sheets are "multi use" in this regard; the path
formats are chosen so that the generated paths are not ambiguous for
those cases.

#### Calculating statistics

So each path in `evaluator::IndexByCallPath` maps to the vector of
spans of timings for that path. Statistics are calculated for each of
those vectors, separately for each of the fields in the timings that
the user (explicitly or implicitly) is interested in. (We have 2
dimensions of statistical output here: the paths are one dimension,
the field the second dimension (although that one has a statically
fixed selection of values--"real time", "cpu time" etc.).)

Remember, the `Timing` records contain all the kinds of timings that
are collected: real time, cpu time, system time, multiple kinds of
context switches, and more. Some are not currently generated on macOS,
thus the currently extracted values are currently just real, cpu,
system times and a sum of all kinds of context switches.

For each of those timing kinds, a separate statistics is
calculated. For Excel output, the statistics for all timing kinds are
integrated into the same file as separate worksheets. For flamegraphs,
a separate SVG file is generated for each kind, adding the timing kind
name to the file name (like `single-real time.svg`, `single-cpu
time.svg`, etc.).

The `evaluator::AllFieldsTable` struct has the job of holding all 4 statistics kinds.

The `evaluator::AllFieldsTableWithOutputPathOrBase` struct bundles
that with output path (XXX: what is the logic exactly with
`is_final_file`?). Those instances are specific to one of the output
formats (Excel, flamegraphs), the program evaluates separate ones
because the path syntax needs to be different for flamegraphs (to
follow the required format for the
[inferno](https://crates.io/crates/inferno) crate), and also for
flamegraphs only one kind of path is generated (and also influenced by
flamegraph-specific user options?).

The `evaluator::AllOutputsAllFieldsTable` struct bundles the separate
`evaluator::AllFieldsTableWithOutputPathOrBase` instances for all
requested output formats.

The 3 structs above (`evaluator::AllFieldsTable` /
`evaluator::AllFieldsTableWithOutputPathOrBase` /
`evaluator::AllOutputsAllFieldsTable`) are type-parameterized with a
`<Kind: AllFieldsTableKind>` type. Current such types (implementors of
`evaluator::AllFieldsTableKind`) are `SingleRunStats`, `SummaryStats`,
`TrendStats`, they are currently all empty marker types, used just to
mark the structs to clarify what kind of statistical results they
hold.

The `evaluator::AllOutputsAllFieldsTable` instance is then written to
files via its `write_to_files` method.

<!-- `evaluator::AllFieldsTable::from_log_data_tree` -->

XXXWRONG a single choice is taken, and can specified on
the command line via `evaluator::options::FlameFieldOpt`, which was
mentioned in the "StatsField" section above; although currently the
program still evaluates the statistics for all 4 kinds first and only
then picks the chosen one for the flamegraphs.



The data structure to hold the 4 kinds of data points

`AllOutputsAllFieldsTable<Kind: AllFieldsTableKind>`


XXX move this OUT of src/evaluator/data/ ? ! und eben why do i do it kitschi a little .?.

wohin? AllOutputsAllFieldsTable::from_log_data_tree is next step -- which is in src/evaluator/

3. Path index:

After building the tree, an index over all paths is created.
XXX (which types, and why?, what is different from the tree directly?)

### 2.

### X.

`AllOutputsAllFieldsTable<Kind: AllFieldsTableKind>`


### X. Creating the outputs

`StatsField<TILE_COUNT>`


#### Excel



#### Flamegraphs

The [inferno](https://crates.io/crates/inferno) library used for
generating the flamegraphs requires a format where parent scopes'
timing numbers do not include the numbers of child scopes. This is
unlike in Excel files, where the parent scope is shown with the whole
costs for that scope, regardless of which child scopes there may be,
which is both more natural when child scopes can be added to or
removed from the project over time, and also are not immediately
visible when reading the Excel file (those scope are on different rows
in the sheet). The function
[`fix_tree`](../../../src/evaluator/all_outputs_all_fields_table.rs)
converts from the child-inclusive to this child-exclusive format.

The processing is as follows:

1. First, the same code path as for Excel is used to generate an
`AllOutputsAllFieldsTable<_>`. XXX
35 changes: 35 additions & 0 deletions evobench-evaluator/docs/internals/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Source directory overview

## `bin` subdirectory

The source files representing program binaries.

These are the user-relevant programs:

* [`bin/evobench-evaluator.rs`](../../src/bin/evobench-evaluator.rs): produce human-readable outputs from benchmarking log files; does not know about where to place files (needs explicit paths), and doesn't know about running benchmarks
* [`bin/evobench-run.rs`](../../src/bin/evobench-run.rs): runs benchmarking jobs, i.e. produces benchmarking log files in a structured and automatic way (i.e. offers a service plus tools to change and query the service status); calls `evobench-evaluator` to turn them into human-readable outputs.

Other programs (not normally in use, feel free to ignore):

* [`bin/jobqueue.rs`](../../src/bin/jobqueue.rs): a general purpose program to work with queues (just an application of the `key_val_fs` module, perhaps generally useful?)
* [`bin/trying-git.rs`](../../src/bin/trying-git.rs): a program to play with git graphs, mostly to verify the workings of the `git` module.

## Other subdirectories

* [`serde/`](../../src/serde/mod.rs): custom types in config files and other places with user interaction via text
* [`key_val_fs/`](../../src/key_val_fs/mod.rs): a simple key-value database via files, and a queue implementation on top
* [`stats/`](../../src/stats/mod.rs): simple statistics, keeping track of the unit (ns, us, counts) via the type system
* [`tables/`](../../src/tables/mod.rs): tabular output for Excel, works with [`stats/`](../../src/stats/mod.rs) keeping track of the unit (ns, us, counts) via the type system
* [`evaluator/`](../../src/evaluator/mod.rs): the meat of the `evobench-evaluator` tool
* [`run/`](../../src/run/mod.rs): the meat of the `evobench-run` tool

(There are some more, utilities without group documentation:
[`date_and_time/`](../../src/date_and_time/mod.rs),
[`utillib/`](../../src/utillib/mod.rs),
[`io_utils/`](../../src/io_utils/mod.rs).)

## Tool internals documentation

* [evobench-evaluator](evaluator/index.md)

* [evobench-run](runner/index.md)
2 changes: 2 additions & 0 deletions evobench-evaluator/docs/internals/runner/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# How `evobench-run` works internally

Loading