-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Motivation / use-case
Many users need to coerce all columns of a CSV to the same Arrow type—most commonly string() to keep raw text—when the schema is unknown or very wide.
Today the API only permits either:
- passing an explicit
column_types={"colA": pa.string(), …}map, or - letting the reader infer per-column types.
That forces callers to a) know every header in advance and b) enumerate them, which is painful for dynamic files.
The limitation was raised in ARROW-5811.
Current docs confirm no built-in way exists beyond the explicit map.
Proposed change
Option A – sentinel entry in column_types
Honor a magic key (e.g. "*", "__default__", or a constant kWildcardColumn) inside ConvertOptions.column_types.
Lookup order in MakeConversionSchema() becomes:
- exact match in
column_types - sentinel key
- current fallback (type inference)
Option B – new field default_column_type
Add std::shared_ptr<DataType> default_column_type = nullptr to ConvertOptions.
If non-null, columns not listed in column_types are converted with that type.
Both approaches are backwards-compatible; Option B is explicit and avoids magic strings, while Option A is a one-line API addition.
Python examples
import pyarrow as pa, pyarrow.csv as pcsv
# Option A (sentinel)
opts = pcsv.ConvertOptions(column_types={"*": pa.string(), "id": pa.int64()})
tbl = pcsv.read_csv("data.csv", convert_options=opts)
# Option B (explicit field)
opts = pcsv.ConvertOptions(
default_column_type=pa.string(), # NEW
column_types={"id": pa.int64()} # explicit override
)
tbl = pcsv.read_csv("data.csv", convert_options=opts)Affected code (C++ path overview)
| Layer | File(s) | Change summary | Notes |
|---|---|---|---|
| Public API | cpp/src/arrow/csv/options.h |
• Add std::shared_ptr<DataType> default_column_type; to struct ConvertOptions (Option B) or define static const std::string kWildcardColumn = "__default__"; (Option A).• Document the new knob in the Doxygen comment. |
Keeps the setting user-visible. |
cpp/src/arrow/csv/options.cc |
• In ConvertOptions::Defaults(), initialise opts.default_column_type = nullptr;.• Extend ConvertOptions::Validate() to raise Status::Invalid for an illegal dtype or duplicate sentinel. |
Ensures default behaviour remains unchanged. | |
| Core logic | cpp/src/arrow/csv/reader.cc — inside MakeConversionSchema() |
Replace the existing two-branch decision with a three-branch cascade: 1. explicit mapping → 2. default_column_type / sentinel → 3. infer type (legacy path). |
~10 LOC patch; confined to one lambda. |
| Unit tests (C++) | cpp/src/arrow/csv/options_test.cc (new) |
Add three cases: • default only – every column gets that type. • default + explicit overrides – explicit wins. • default == nullptr – legacy inference. |
Guards against regressions. |
| Python binding | python/pyarrow/_csv.cpp (Cython) |
• Expose default_column_type keyword (accept None or DataType).• Map to/from the underlying C++ field. |
Maintains PyArrow feature parity. |
python/pyarrow/tests/test_csv.py |
Mirror the three C++ test scenarios. | Confirms binding wiring. | |
| Documentation | docs/source/cpp/csv.rst, docs/source/python/csv.rst |
Add one bullet and a quick example for the new option. | Makes the feature discoverable. |
| Other bindings (optional) | R, GLib, Rust wrappers | Add the field/property if those wrappers already expose ConvertOptions. |
Can be staged separately. |
Build system: No CMake or Meson tweaks are required—the dataset/file-CSV paths automatically inherit the updated
ConvertOptions.
Cross-language bindings checklist
| Language | File / area | Binding note |
|---|---|---|
| Python (pyarrow) | _csv.cpp |
add default_column_type kwarg with None ⇒ nullptr |
R (arrow::r::csv) |
r/src/ |
mirror the field in convert_options() constructor |
| GLib | glib/arrow-gio/csv-options.cpp |
expose property default-column-type |
| Rust | arrow-csv crate |
add default_column_type: Option<DataType> |
| Java / JNI | none (CSV reader lives in C++ backend) | no change |
These additions are mechanical once the C++ core is in place.
Component(s)
C++