Skip to content

Collection of CSV/TAB sample files for testing #21

@wavexx

Description

@wavexx

@firecat53, @scls19fr

Random issue here: on a related project I'm working at work™ I've got the chance to adapt some existing code I wrote to parse large text files. In fact, the API I designed relied on a dumb/simple streaming approach which never actually loads any part of the file in memory (not even cell contents).

It allows, for example, to seek in a random portion of a text file, align to line/cell boundaries and start reading both forward and reverse from there. I'm using it mostly to preview large genetic data files (10+gb in size nowdays), as an "enhanced od(1)". The tool I'm using on top of it though is crap: done in a rush in less than a day, so nothing worth saving.

My idea is as follows: clean up the csv reading code a bit, and make it part of the tabview-common series of projects. Build a cython module for it, exposing an API ideally identical to the "csv" module, so that I can be swapped under the hood (for those who need the extra speed) without creating additional dependencies. For instance, the separator/format detection is actually much more robust CSV_sniffer and doesn't suffer from the python2/3 universal newlines issues.

The main issue is that the existing code doesn't handle all the possible CSV escaping/quoting/crappy-format issues. I'd like to build a small set of tabular text files (10x10 cells at most), which exercise all the possible text format issues, including the broken outputs coming from Excel, mysql, and whatnot, different line endings, files with/without the UTF8 BOM, several encodings, different header styles, etc.

I couldn't find any project that has something similar. For instance, I expected to find some demo files in libcsv, fccp, xsv, or python's csv library code, but nothing.

It would be nice if you could help: if you see a different text file format, just save a chunk of it somewhere so we can build a proper test suite for all the tabview projects.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions