Skip to content

Conversation

@AJChapman
Copy link

I was trying to figure out how Frames' type inference works. I have to move on to something else for now, but some of what I've done may be useful so I'm opening this pull request to contribute it.

One of my changes was to replace Either (String -> Q [Dec]) Type with a new data type: TypeInfo.

The other change is to add some type inference unit tests. They all pass, although the behaviour they expect is not what I would like it to be. I have ideas for a more general type inference mechanism, but no time to implement it at this stage.

Replaces `Either (String -> Q [Dec]) Type` with a new data type: `TypeInfo`.
Also adds unit tests to explore what type inference currently does.
@acowley
Copy link
Owner

acowley commented Dec 18, 2019

I like this idea, thank you! I'm going to look at it more closely before merging.

Are the tests you don't like the ones that take, e.g., 1.0 to Int? I think we added that in at some point because folks had data coming from languages that represented all numbers that way. Then you'd have a column that used numbers as a kind of enum (e.g. 1.0, 2.0, and 3.0). Since we look at a prefix of the column, rather than just one number, it seemed vaguely safe that if we didn't see anything other than a zero after the decimal point that the textual representation was a quirk and those numbers could be treated as Int. Another option would be to require a preprocessing step on the user's part, but the silent inference that's in place now never prompted any reported issues.

@AJChapman
Copy link
Author

No, I was ok with the 1.0 being Int. It was when I added the custom datatype (ZipT, from one of the examples), that things got weird. ZipT accepts five-character strings, so when you add it to your universe of types, suddenly the value "False" switches from Definitely Bool to an uncertain type (I forget which). This may be fair, because "False" really could be a postcode. But ["False", "True"] is also uncertain -- it falls back to Text, instead of realising that it should be Bool.

For really thorough type inference I'd like to see it test each column for a fit against each candidate type, then decide which types fit, then choose the type with the smallest cardinality (the smallest number of values in that type). So Int would trump Double for 1.0 because Int is smaller than Double. Similarly Bool would trump Int which would trump Text. In addition, it would keep track of values which don't parse for a type, and then at the end decide what to do with them. If there are hundreds of different unknown values then the type doesn't fit. But if there's only one or two (e.g. "" and "N/A"), then maybe they are sentinel values, and the column should be a Maybe _. Or if none of the values parse but there are only five distinct values then create a new categorical type for that column.

@acowley acowley force-pushed the master branch 3 times, most recently from 50157fc to aeca953 Compare March 23, 2021 15:04
@acowley acowley deleted the branch acowley:master October 22, 2023 18:29
@acowley acowley closed this Oct 22, 2023
@acowley acowley reopened this Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants