Skip to content

[C++] CSV reader: add a default column type (or sentinel mapping) to avoid per-column enumeration #47502

@dxdc

Description

@dxdc

Motivation / use-case

Many users need to coerce all columns of a CSV to the same Arrow type—most commonly string() to keep raw text—when the schema is unknown or very wide.

Today the API only permits either:

  • passing an explicit column_types={"colA": pa.string(), …} map, or
  • letting the reader infer per-column types.

That forces callers to a) know every header in advance and b) enumerate them, which is painful for dynamic files.
The limitation was raised in ARROW-5811.

Current docs confirm no built-in way exists beyond the explicit map.


Proposed change

Option A – sentinel entry in column_types

Honor a magic key (e.g. "*", "__default__", or a constant kWildcardColumn) inside ConvertOptions.column_types.
Lookup order in MakeConversionSchema() becomes:

  1. exact match in column_types
  2. sentinel key
  3. current fallback (type inference)

Option B – new field default_column_type

Add std::shared_ptr<DataType> default_column_type = nullptr to ConvertOptions.
If non-null, columns not listed in column_types are converted with that type.

Both approaches are backwards-compatible; Option B is explicit and avoids magic strings, while Option A is a one-line API addition.


Python examples

import pyarrow as pa, pyarrow.csv as pcsv

# Option A (sentinel)
opts = pcsv.ConvertOptions(column_types={"*": pa.string(), "id": pa.int64()})
tbl  = pcsv.read_csv("data.csv", convert_options=opts)

# Option B (explicit field)
opts = pcsv.ConvertOptions(
        default_column_type=pa.string(),   # NEW
        column_types={"id": pa.int64()}  # explicit override
)
tbl = pcsv.read_csv("data.csv", convert_options=opts)

Affected code (C++ path overview)

Layer File(s) Change summary Notes
Public API cpp/src/arrow/csv/options.h Add std::shared_ptr<DataType> default_column_type; to struct ConvertOptions (Option B) or define static const std::string kWildcardColumn = "__default__"; (Option A).
• Document the new knob in the Doxygen comment.
Keeps the setting user-visible.
cpp/src/arrow/csv/options.cc • In ConvertOptions::Defaults(), initialise opts.default_column_type = nullptr;.
• Extend ConvertOptions::Validate() to raise Status::Invalid for an illegal dtype or duplicate sentinel.
Ensures default behaviour remains unchanged.
Core logic cpp/src/arrow/csv/reader.cc — inside MakeConversionSchema() Replace the existing two-branch decision with a three-branch cascade:
1. explicit mapping
2. default_column_type / sentinel
3. infer type (legacy path).
~10 LOC patch; confined to one lambda.
Unit tests (C++) cpp/src/arrow/csv/options_test.cc (new) Add three cases:
• default only – every column gets that type.
• default + explicit overrides – explicit wins.
• default == nullptr – legacy inference.
Guards against regressions.
Python binding python/pyarrow/_csv.cpp (Cython) Expose default_column_type keyword (accept None or DataType).
• Map to/from the underlying C++ field.
Maintains PyArrow feature parity.
python/pyarrow/tests/test_csv.py Mirror the three C++ test scenarios. Confirms binding wiring.
Documentation docs/source/cpp/csv.rst, docs/source/python/csv.rst Add one bullet and a quick example for the new option. Makes the feature discoverable.
Other bindings (optional) R, GLib, Rust wrappers Add the field/property if those wrappers already expose ConvertOptions. Can be staged separately.

Build system: No CMake or Meson tweaks are required—the dataset/file-CSV paths automatically inherit the updated ConvertOptions.


Cross-language bindings checklist

Language File / area Binding note
Python (pyarrow) _csv.cpp add default_column_type kwarg with Nonenullptr
R (arrow::r::csv) r/src/ mirror the field in convert_options() constructor
GLib glib/arrow-gio/csv-options.cpp expose property default-column-type
Rust arrow-csv crate add default_column_type: Option<DataType>
Java / JNI none (CSV reader lives in C++ backend) no change

These additions are mechanical once the C++ core is in place.


Component(s)

C++

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions