Create TypeInfo, add type inference unit tests #142

AJChapman · 2019-12-18T03:14:56Z

I was trying to figure out how Frames' type inference works. I have to move on to something else for now, but some of what I've done may be useful so I'm opening this pull request to contribute it.

One of my changes was to replace Either (String -> Q [Dec]) Type with a new data type: TypeInfo.

The other change is to add some type inference unit tests. They all pass, although the behaviour they expect is not what I would like it to be. I have ideas for a more general type inference mechanism, but no time to implement it at this stage.

Replaces `Either (String -> Q [Dec]) Type` with a new data type: `TypeInfo`. Also adds unit tests to explore what type inference currently does.

acowley · 2019-12-18T14:26:03Z

I like this idea, thank you! I'm going to look at it more closely before merging.

Are the tests you don't like the ones that take, e.g., 1.0 to Int? I think we added that in at some point because folks had data coming from languages that represented all numbers that way. Then you'd have a column that used numbers as a kind of enum (e.g. 1.0, 2.0, and 3.0). Since we look at a prefix of the column, rather than just one number, it seemed vaguely safe that if we didn't see anything other than a zero after the decimal point that the textual representation was a quirk and those numbers could be treated as Int. Another option would be to require a preprocessing step on the user's part, but the silent inference that's in place now never prompted any reported issues.

AJChapman · 2019-12-18T23:34:04Z

No, I was ok with the 1.0 being Int. It was when I added the custom datatype (ZipT, from one of the examples), that things got weird. ZipT accepts five-character strings, so when you add it to your universe of types, suddenly the value "False" switches from Definitely Bool to an uncertain type (I forget which). This may be fair, because "False" really could be a postcode. But ["False", "True"] is also uncertain -- it falls back to Text, instead of realising that it should be Bool.

For really thorough type inference I'd like to see it test each column for a fit against each candidate type, then decide which types fit, then choose the type with the smallest cardinality (the smallest number of values in that type). So Int would trump Double for 1.0 because Int is smaller than Double. Similarly Bool would trump Int which would trump Text. In addition, it would keep track of values which don't parse for a type, and then at the end decide what to do with them. If there are hundreds of different unknown values then the type doesn't fit. But if there's only one or two (e.g. "" and "N/A"), then maybe they are sentinel values, and the column should be a Maybe _. Or if none of the values parse but there are only five distinct values then create a new categorical type for that column.

Create TypeInfo, add type inference unit tests

b45c431

Replaces `Either (String -> Q [Dec]) Type` with a new data type: `TypeInfo`. Also adds unit tests to explore what type inference currently does.

acowley force-pushed the master branch 3 times, most recently from 50157fc to aeca953 Compare March 23, 2021 15:04

acowley deleted the branch acowley:master October 22, 2023 18:29

acowley closed this Oct 22, 2023

acowley reopened this Oct 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create TypeInfo, add type inference unit tests #142

Create TypeInfo, add type inference unit tests #142

Uh oh!

AJChapman commented Dec 18, 2019

Uh oh!

acowley commented Dec 18, 2019

Uh oh!

AJChapman commented Dec 18, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Create TypeInfo, add type inference unit tests #142

Are you sure you want to change the base?

Create TypeInfo, add type inference unit tests #142

Uh oh!

Conversation

AJChapman commented Dec 18, 2019

Uh oh!

acowley commented Dec 18, 2019

Uh oh!

AJChapman commented Dec 18, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants