-
Notifications
You must be signed in to change notification settings - Fork 179
Open
Labels
BugSomething isn't workingSomething isn't working
Description
General Information:
- OS:
Ubuntu 22.04 - Python version:
3.10.12 - Library version:
0.10.9
Describe the bug:
I have a parquet file column org_number that should be treated as text but is being profiled into an int.
Pandas info reports it as an object:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26943 entries, 0 to 26942
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
<snip>
2 org_number 26943 non-null object
<snip>
When I use Pandas describe(), it doesn't show any numeric statistics like min, max, stddev, etc. which is correct.
The output from the profiler:
{
"column_name": "org_number",
"data_type": "int",
"categorical": false,
"order": "random",
"samples": "['01321', '07618', '08257', '02321', '09123']",
"statistics": {
"min": 1.0,
"max": 105121.0,
"mode": "[6781.24]",
"median": 6573.749,
"sum": 220034705.0,
"mean": 8166.6743,
"variance": 150256092.2856,
"stddev": 12257.8992,
"skewness": 5.3242,
"kurtosis": 30.6063,
"histogram": {
"bin_edges": "[ 1. , 363.48275862, ... , 104758.51724138, 105121. ]",
"bin_counts": "[ 259., 539., 126., 1006., 2057., ... , 0., 0., 0., 0., 7.]"
},
"quantiles": {
"0": 3350.0226,
"1": 6573.749,
"2": 8726.115
},
"median_abs_deviation": 2195.6598,
"num_zeros": 0,
"num_negatives": 0,
"times": {
"min": 0.0001,
"max": 0.0001,
"sum": 0.0001,
"variance": 0.0002,
"skewness": 0.0046,
"kurtosis": 0.0046,
"histogram_and_quantiles": 0.0042,
"num_zeros": 0.0002,
"num_negatives": 0.0001
},
"unique_count": 1367,
"unique_ratio": 0.0507,
"sample_size": 26943,
"null_count": 0,
"null_types": [],
"null_types_index": {},
"data_type_representation": {
"datetime": 0.0,
"int": 1.0,
"float": 1.0,
"string": 1.0
}
}
},
To Reproduce:
The code I'm using:
data = dp.Data(filename)
profile_options = dp.ProfilerOptions()
df = pd.read_parquet(filename)
print(df.info())
profile_options.set({
"structured_options.data_labeler.is_enabled": False,
"unstructured_options.data_labeler.is_enabled": False,
"structured_options.correlation.is_enabled": False,
"structured_options.multiprocess.is_enabled": True,
"structured_options.chi2_homogeneity.is_enabled": False,
"structured_options.category.max_sample_size_to_check_stop_condition": 1,
"structured_options.category.stop_condition_unique_value_ratio": 0.001,
"structured_options.sampling_ratio": 1.0,
"structured_options.null_replication_metrics.is_enabled": False
})
profile = dp.Profiler(data, options=profile_options)
human_readable_report = profile.report(report_options={"output_format":"pretty"})
with open("reportfile.json", "w") as outfile:
outfile.write(json.dumps(human_readable_report, indent=4))
I can't provide the raw data but I can test things. The data is interesting in that it's almost integer, but many of the entries have 0's prepended as you can see in the samples.
Expected behavior:
I would expect the type to be string/text.
Screenshots:
Additional context:
Metadata
Metadata
Assignees
Labels
BugSomething isn't workingSomething isn't working