Reasoning inconsistency general statistics offered for 'vmcategory' for AzurePublicDatasetV2

Hi,
I was exploring over [AzurePublicDatasetV2](https://github.com/Azure/AzurePublicDataset/blob/master/AzurePublicDatasetV2.md) and I checked the side notebook: [Azure 2019 Public Dataset V2 - Trace Analysis.ipynb](https://github.com/Azure/AzurePublicDataset/blob/master/analysis/Azure%202019%20Public%20Dataset%20V2%20-%20Trace%20Analysis.ipynb) including _General Statistics_. There it has been shown that after they read [vmtable.csv](https://azurepublicdatasettraces.blob.core.windows.net/azurepublicdatasetv2/trace_data/vmtable/vmtable.csv.gz) and slight transformation they added 2 more columns from out of other features and introduced `'corehour'` and `'lifetime'` within `trace_dataframe`.

so I could reproduce the results indicated in with minor updates:
```Python

TraceLegend = "Azure 2019 - Public Dataset V2"
AzureLegend = "Azure 2019"

def CategoryPlot(df, azure2019):
    #dataset = pd.DataFrame(df.groupby('vmcategory')['corehour'].sum().rename('corehour'))
    #dataset = dataset.rename(columns={'vmcategory': 'Bucket'})
    #dataset['value'] = dataset['corehour']/dataset['corehour'].sum() * 100
    #dataset= dataset.drop('corehour', 1)
    #dataset = dataset.sort_index().T
    #azure2019 = azure2019.to_frame().T
    #frames = [dataset, azure2019]
    #result = pd.concat(frames)

    dataset = pd.DataFrame(df.groupby('vmcategory')['corehour'].sum().rename('corehour'))
    dataset['percentage'] = dataset['corehour']/dataset['corehour'].sum() * 100
    del dataset['corehour']
    dataset =dataset.reset_index().rename({'index':'vmcategory'}, axis = 'columns')
    result = pd.concat([dataset,azure2019], axis=1, ignore_index=False, sort=True)
    del result['vmcategory']
    result = result.T
    ax = result.plot.bar(stacked=True, title='VM Category Distribution', color=['lightskyblue', 'orange', '0.75'], ylim=(0,100))
    ax.set_ylabel('% of core hours')
    ax.set_xticklabels([TraceLegend, AzureLegend], rotation=0)
    ax.legend(["Delay-insensitive", "Interactive", "Unknown"], loc='upper center', title='Categories', bbox_to_anchor=(0.5, -0.10), ncol=3, fontsize=10.5)

CategoryPlot(trace_dataframe, category_azure2019)
```
![image](https://github.com/user-attachments/assets/4c6ffe22-38c1-49bf-84e6-732f8a9e0851)
 in short using this way we have following for "**Azure 2019 - Public Dataset V2**" and "**Azure 2019**" :
```Python
dataset = pd.DataFrame(trace_dataframe.groupby('vmcategory')['corehour'].sum().rename('corehour'))
dataset['percentage'] = dataset['corehour']/dataset['corehour'].sum() * 100
#dataset= dataset.drop('corehour', axis=1)
del dataset['corehour']
dataset =dataset.reset_index().rename({'index':'vmcategory'}, axis = 'columns')
#dataset = dataset.sort_index()#.T
print(dataset.to_markdown(tablefmt="grid"))
+----+-------------------+--------------+
|    | vmcategory        |   percentage |
+====+===================+==============+
|  0 | Delay-insensitive |     58.4396  |
+----+-------------------+--------------+
|  1 | Interactive       |     32.5105  |
+----+-------------------+--------------+
|  2 | Unknown           |      9.04996 |
+----+-------------------+--------------+
```
```Python
azure2019 = category_azure2019#.T
print(azure2019.to_markdown(tablefmt="grid"))
+----+-------------------+---------+
|    | vmcategory        |   value |
+====+===================+=========+
|  0 | Delay-insensitive |      51 |
+----+-------------------+---------+
|  1 | Interactive       |      43 |
+----+-------------------+---------+
|  2 | Unknown           |       6 |
+----+-------------------+---------+
```
so far we reproduced the results of the offered notebook but when you approach the following and you read directly [vmtable.csv](https://azurepublicdatasettraces.blob.core.windows.net/azurepublicdatasetv2/trace_data/vmtable/vmtable.csv.gz) and just process on interested column `'vmcategory'` you get different result!
```Python
import numpy as np
import pandas as pd
from IPython.display import display
import matplotlib.pyplot as plt
%matplotlib inline

#data_path = 'https://azurecloudpublicdataset2.z19.web.core.windows.net/azurepublicdatasetv2/trace_data/vmtable/vmtable.csv.gz'
data_path  = 'https://azurepublicdatasettraces.blob.core.windows.net/azurepublicdatasetv2/trace_data/vmtable/vmtable.csv.gz'

headers   = ['vmid','subscriptionid','deploymentid','vmcreated', 'vmdeleted', 'maxcpu', 'avgcpu', 'p95maxcpu',
         'vmcategory', 'vmcorecountbucket', 'vmmemorybucket']
data = pd.read_csv(data_path, header=None, index_col=False,names=headers,delimiter=',')
#data.head(10)

#slice data
df= data[['vmcategory', 'vmid']]

#count occurrence of each value in 'vmcategory' column
counts = df.vmcategory.value_counts()

#count occurrence of each value in 'vmcategory' column as percentage of total
percs = df.vmcategory.value_counts(normalize=True)

#count occurrence of each value in 'team' column as percentage of total
perc = df.vmcategory.value_counts(normalize=True).mul(100).round(1).astype(str) + '%'

#concatenate results into one DataFrame
tf = pd.concat([counts,percs,perc], axis=1, keys=['count', 'percentage', '%'])
print(tf.to_markdown(tablefmt="grid"))

+-------------------+---------+--------------+-------+
| vmcategory        |   count |   percentage | %     |
+===================+=========+==============+=======+
| Unknown           | 2457455 |    0.911672  | 91.2% |
+-------------------+---------+--------------+-------+
| Delay-insensitive |  159615 |    0.0592143 | 5.9%  |
+-------------------+---------+--------------+-------+
| Interactive       |   78478 |    0.0291139 | 2.9%  |
+-------------------+---------+--------------+-------+
```

![image](https://github.com/user-attachments/assets/5ab59873-3e73-489d-90e7-268fee3184f0)

observation shows that the latter indicates `Unknown` class is 91% while 1st approach shows `Delay-insensitive` class has a 58% proportion of all VMs!! 

- Which one is correct? can someone explain why we need to calculate and instead of using `df.vmcategory.value_counts(normalize=True)` directly on interested column `'vmcategory'`, the data provider used `df.groupby('vmcategory')['corehour'].sum().rename('corehour'))`?
- Can someone shed light on these inconsistency outputs from _domain knowledge in cloud data_ and the reasoning for the **role** of `'corehour'` in VM categorization or classification in the first approach is.
---
PS: if one is interested in how calculated out of `'lifetime'` and `'vmcorecountbucket'`  and how they are related:
```Python
#Compute VM Lifetime based on VM Created and VM Deleted timestamps and transform to Hour
trace_dataframe['lifetime'] = np.maximum((trace_dataframe['vmdeleted'] - trace_dataframe['vmcreated']),300)/ 3600

#Compute VM corehour
trace_dataframe['corehour'] = trace_dataframe['lifetime'] * trace_dataframe['vmcorecountbucket']
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reasoning inconsistency general statistics offered for 'vmcategory' for AzurePublicDatasetV2 #38

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reasoning inconsistency general statistics offered for 'vmcategory' for AzurePublicDatasetV2 #38

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions