Wunderbar: Robust parser for .wandb log files

Robust Python parser for W&B's .wandb binary structured log format.

Implemented in Python.
- Dependencies only on protobuf and the protobuf schema from the wandb SDK.
- No dependency on any other internals of the wandb SDK.
- No requirement to authenticate with or sync data to the W&B cloud.
The output of the parser is a stream of pure Python objects (dictionaries) rather than protobuf objects.
The parsers can partially recover from encountering formatting errors in the .wandb files (which could be caused by data corruption, interrupted writes, or if the files were generated by a buggy version of the wandb SDK).
Simple, mostly-'functional'-style implementation with highly-localised state management. Should be easy to understand and build from if you need more features, or to port to other languages if you need more speed.

Quick start

Install:

pip install git+https://github.com/matomatical/wunderbar.git

Command-line interface (similar to wandb sync --view):

wunderbar path/to/example-run.wandb           # print list of log records
wunderbar --peek path/to/example-run.wandb    # print overview of each record
wunderbar --verbose path/to/example-run.wandb # print everything
wunderbar --help                              # more options

Library (for example, to extract all logged metrics):

import wunderbar

PATH = 'path/to/example-run.wandb'

records = wunderbar.parse_filepath(path=PATH)
for record in records:
    print(f"Record {record.number} ({record.type})")
    if record.type == "history": # a call to wandb.log(step, data)
        step: int = record.data["step"]
        data: dict = record.data["item"]
        print(f"{len(data)} metrics logged at {step=}:")
        for metric, value in data.items():
            print(f"* {metric}: {value}")

API overview

Types:

LogRecord(number: int, type: RecordType, data: dict, ...): an entry in the .wandb log.
RecordType: the type of the entry.
- "run": Various metadata.
- "config": Set or change the run configuration.
- "history": A call to wandb.log(step, data).
- "files": A file was added.
- "stats": A sample of system statistics.
- "output_raw": Printed to stdout.
- ... Plus several more.
Corruption(note: str): Some un-parse-able binary content, with a brief justification of the problem (e.g. checksum failed).
- TODO: Several sub-types of corruption.

Commonly-used functions:

Parsing functions:
- parse_filepath(path: str | pathlib.Path) -> Generator[LogRecord] parses a file at a given path.
- parse_file(file: typing.IO[bytes]) -> Generator[LogRecord] parses an already-open file-like object.
- parse_data(data: bytes) -> Generator[LogRecord] parses data already in memory.
Same three functions with suffix _with_corruption with type -> Genenerator[LogRecord | Corruption]
- purify(g: Generator[LogRecord | Corruption]) -> Generator[LogRecord] filters out corruption.

See code for full details.

TODO: document code.

About the .wandb file format and the W&B SDK

The .wandb files that this library is designed to parse are included with every wandb run folder store experiment configuration information and metrics in a custom binary 'structured log' format.

In brief, each fact, file, system statistic, experiment metric or other thing logged by W&B is encoded via protobuf and then the data from each logging event is appended to the binary file in a format akin to the log files from LevelDB.

In slightly more detail, the log format has two conceptual layers:

W&B LevelDB-like log format At a low level, the log is structured using a robust storage format that is a variant of the LevelDB log format. This means the file is a sequence of 32 KiB 'blocks', and each block contains a sequence of 'chunks' containing individual log items together with a 7-byte header storing their size and a checksum.

Actually, if a log item would straddle a block boundary, it's broken up into a sequence of partial chunks, so that each block always starts with the start of a chunk. This system allows safely recovering from the next block boundary in the event of encountering corrupt data while reading (the Python reader in the W&B codebase doesn't support this kind of error recovery, but the newer W&B core Golang reader does).

This is essentially the description of the LevelDB log format itself, but in W&B's case, there are some small differences from the LevelDB log format, namely the choice of checksum algorithm and the inclusion of an additional 7-byte file header at the beginning of the first block.

W&B Protobuf record format The contents of the log items are in this case binary records serialised with protobuf using this schema.

The schema includes different record types for various notable aspects of a running experiment, including most of the stuff stored in other files in the run folder (config, metadata, any strings written to stderr/stdout), samples of system statistics, environment telemetry (thanks!), and, of course, all data logged explicitly with wandb.log.

In the latter case, each dictionary logged is stored as a list of key/value pairs with the keys encoded as strings and the values encoded as JSON strings. This means that if you were to actually inspect the bytes of the .wandb file, you'd see your metrics in plain text, interspersed with binary separators from protobuf (and occasionally interrupted by a LevelDB header if the record straddles a block boundary).

The W&B SDK includes the following code related to this format.

The protobuf schema used for encoding log items (https://github.com/wandb/wandb/blob/main/wandb/proto/wandb_internal.proto).
Code used for writing .wandb databases during an experiment, including in the old Python backend (https://github.com/wandb/wandb/blob/main/wandb/sdk/internal/datastore.py) and the new "core" (Go) backend (https://github.com/wandb/wandb/blob/main/core/pkg/leveldb/record.go). These are used for creating the log files during an experiment.
The same code also supports reading the binary logs, which is done during cloud sync. The wandb CLI also supports printing a string rendering of the contents of the database to stdout via wandb sync --view --verbose. Note that the Python backend reader does not support recovering data from partially-corrupted (or partially-improperly-written) .wandb files.

This library is a Python replacement for (3) that draws on (1) but with an independent implementation of a decoder for the LevelDB log format that is more resilient to errors, and produces pure-Python output objects.

Response to a historical bug in the W&B core `.wandb` log writer

Another feature of this library is that it is resilient to a historical bug in the W&B core backend prior to wandb version 0.17.6. Prior to this fix, the new Golang .wandb writer failed to account for the 7-byte file header in computing the 32KiB block boundaries. As a result, when faced with writing data that would straddle a block boundary, the writer would make the following decisions:

With 7 or more bytes remaining before the next 32KiB block boundary, the broken writer would split the record into multiple parts, as expected, but the first part would be sized so as to end 7 bytes into the next 32KiB block.
With 7 or fewer bytes remaining before the next 32KiB block boundary, the broken writer would write a small record with a 7 byte header and a small amount of data within the first 7 bytes of the next block, when the expected behaviour would be to pad to the block boundary with zeros.
When fewer than 7 bytes past the start of a new block, the broken writer would pad to the 7 byte mark with zeros, when the expected behaviour would be to write the next chunk immediately.

The W&B SDK's reader actually doesn't check whether chunk lengths fit inside the current block, so if (1) were the only issue, then the W&B library would be able to read these logs without any issue. However, issues (2) and (3) cause problems for the standard readers, and any logs that happen to display these symptoms of the bug would be unreadable to them. The standard error recovery protocol is also invalidated by (1) and (2) as in most cases the attempt to resume reading from a next block will fail due to the presence of the last 7 bytes of a chunk at the beginning of a block boundary where a chunk header is expected. As a result, it is impossible to extract the data in these .wandb logs with the W&B SDK.

The parse methods in this library include an optional Boolean flag, exclude_header_from_first_block, which, if set to True, will correctly parse logs generated with this broken writer by anticipating its mistakes.

You might be able to tell if your logs were written by this broken writer (and therefore whether to parse them with this flag) by checking the version of the SDK and the backend used to generate them. However, if you don't have easy access to this information, you can just try parsing the file once with the flag and once without, and see which option recovers more data.

Roadmap

Planned features:

Verification and testing:

Type annotations and comprehensive type-checking with mypy.
Test parsing on a large log without errors.
- Fix off-by-one error causing the problem.
Test error recovery on a large log with a mysterious padding error.
- Trace the cause to a historical bug in wandb core
Automatic unit tests for the individual layer parsers.
Automatic integration tests for end-to-end parsers (Generate some small (<1MB) logs with wandb SDK, including core and legacy backends, including buggy version of core (0.17.5); compare output with wandb SDK's readers).

Performance:

Speed up with Rust extension (check out pyo3).
Performance benchmarking:
- wunderbar Python parsers.
- wandb-legacy Python parsers.
- wandb-core Golang parsers.
- wunderbar Rust parsers.

Documentation:

Brief README.
Document the format.
Document the historic variations in the format.
Docstrings in the code.
Generate a (single-page?) API reference.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
wunderbar		wunderbar
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wunderbar: Robust parser for .wandb log files

Quick start

API overview

About the .wandb file format and the W&B SDK

Response to a historical bug in the W&B core `.wandb` log writer

Roadmap

About

Uh oh!

Languages

License

matomatical/wunderbar

Folders and files

Latest commit

History

Repository files navigation

Wunderbar: Robust parser for .wandb log files

Quick start

API overview

About the .wandb file format and the W&B SDK

Response to a historical bug in the W&B core .wandb log writer

Roadmap

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages

Response to a historical bug in the W&B core `.wandb` log writer