Skip to content

Automatically load dataset as pandas #1251

@mfeurer

Description

@mfeurer

This issue is a proposal that we (1) load datasets as pandas by default and (2) rewrite the dataset loader to be pandas by default and convert to numpy if the user requests a numpy array.

The reasons for this proposal are:

  1. pandas is much more stable as it used to be a few years ago when we started this project and can now also properly handle strings (see Proposal: Use pandas str type for str datasets #1107).
  2. pandas can properly encode categorical columns, which can make it easier for projects building on OpenML-Python to handle these categories.
  3. We will use parquet in the background to store files anyway, which has to be interfaced with pandas.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions