Skip to content

Reasoning inconsistency general statistics offered for 'vmcategory' for AzurePublicDatasetV2 #38

@clevilll

Description

@clevilll

Hi,
I was exploring over AzurePublicDatasetV2 and I checked the side notebook: Azure 2019 Public Dataset V2 - Trace Analysis.ipynb including General Statistics. There it has been shown that after they read vmtable.csv and slight transformation they added 2 more columns from out of other features and introduced 'corehour' and 'lifetime' within trace_dataframe.

so I could reproduce the results indicated in with minor updates:

TraceLegend = "Azure 2019 - Public Dataset V2"
AzureLegend = "Azure 2019"

def CategoryPlot(df, azure2019):
    #dataset = pd.DataFrame(df.groupby('vmcategory')['corehour'].sum().rename('corehour'))
    #dataset = dataset.rename(columns={'vmcategory': 'Bucket'})
    #dataset['value'] = dataset['corehour']/dataset['corehour'].sum() * 100
    #dataset= dataset.drop('corehour', 1)
    #dataset = dataset.sort_index().T
    #azure2019 = azure2019.to_frame().T
    #frames = [dataset, azure2019]
    #result = pd.concat(frames)

    dataset = pd.DataFrame(df.groupby('vmcategory')['corehour'].sum().rename('corehour'))
    dataset['percentage'] = dataset['corehour']/dataset['corehour'].sum() * 100
    del dataset['corehour']
    dataset =dataset.reset_index().rename({'index':'vmcategory'}, axis = 'columns')
    result = pd.concat([dataset,azure2019], axis=1, ignore_index=False, sort=True)
    del result['vmcategory']
    result = result.T
    ax = result.plot.bar(stacked=True, title='VM Category Distribution', color=['lightskyblue', 'orange', '0.75'], ylim=(0,100))
    ax.set_ylabel('% of core hours')
    ax.set_xticklabels([TraceLegend, AzureLegend], rotation=0)
    ax.legend(["Delay-insensitive", "Interactive", "Unknown"], loc='upper center', title='Categories', bbox_to_anchor=(0.5, -0.10), ncol=3, fontsize=10.5)

CategoryPlot(trace_dataframe, category_azure2019)

image
in short using this way we have following for "Azure 2019 - Public Dataset V2" and "Azure 2019" :

dataset = pd.DataFrame(trace_dataframe.groupby('vmcategory')['corehour'].sum().rename('corehour'))
dataset['percentage'] = dataset['corehour']/dataset['corehour'].sum() * 100
#dataset= dataset.drop('corehour', axis=1)
del dataset['corehour']
dataset =dataset.reset_index().rename({'index':'vmcategory'}, axis = 'columns')
#dataset = dataset.sort_index()#.T
print(dataset.to_markdown(tablefmt="grid"))
+----+-------------------+--------------+
|    | vmcategory        |   percentage |
+====+===================+==============+
|  0 | Delay-insensitive |     58.4396  |
+----+-------------------+--------------+
|  1 | Interactive       |     32.5105  |
+----+-------------------+--------------+
|  2 | Unknown           |      9.04996 |
+----+-------------------+--------------+
azure2019 = category_azure2019#.T
print(azure2019.to_markdown(tablefmt="grid"))
+----+-------------------+---------+
|    | vmcategory        |   value |
+====+===================+=========+
|  0 | Delay-insensitive |      51 |
+----+-------------------+---------+
|  1 | Interactive       |      43 |
+----+-------------------+---------+
|  2 | Unknown           |       6 |
+----+-------------------+---------+

so far we reproduced the results of the offered notebook but when you approach the following and you read directly vmtable.csv and just process on interested column 'vmcategory' you get different result!

import numpy as np
import pandas as pd
from IPython.display import display
import matplotlib.pyplot as plt
%matplotlib inline

#data_path = 'https://azurecloudpublicdataset2.z19.web.core.windows.net/azurepublicdatasetv2/trace_data/vmtable/vmtable.csv.gz'
data_path  = 'https://azurepublicdatasettraces.blob.core.windows.net/azurepublicdatasetv2/trace_data/vmtable/vmtable.csv.gz'

headers   = ['vmid','subscriptionid','deploymentid','vmcreated', 'vmdeleted', 'maxcpu', 'avgcpu', 'p95maxcpu',
         'vmcategory', 'vmcorecountbucket', 'vmmemorybucket']
data = pd.read_csv(data_path, header=None, index_col=False,names=headers,delimiter=',')
#data.head(10)

#slice data
df= data[['vmcategory', 'vmid']]

#count occurrence of each value in 'vmcategory' column
counts = df.vmcategory.value_counts()

#count occurrence of each value in 'vmcategory' column as percentage of total
percs = df.vmcategory.value_counts(normalize=True)

#count occurrence of each value in 'team' column as percentage of total
perc = df.vmcategory.value_counts(normalize=True).mul(100).round(1).astype(str) + '%'

#concatenate results into one DataFrame
tf = pd.concat([counts,percs,perc], axis=1, keys=['count', 'percentage', '%'])
print(tf.to_markdown(tablefmt="grid"))

+-------------------+---------+--------------+-------+
| vmcategory        |   count |   percentage | %     |
+===================+=========+==============+=======+
| Unknown           | 2457455 |    0.911672  | 91.2% |
+-------------------+---------+--------------+-------+
| Delay-insensitive |  159615 |    0.0592143 | 5.9%  |
+-------------------+---------+--------------+-------+
| Interactive       |   78478 |    0.0291139 | 2.9%  |
+-------------------+---------+--------------+-------+

image

observation shows that the latter indicates Unknown class is 91% while 1st approach shows Delay-insensitive class has a 58% proportion of all VMs!!

  • Which one is correct? can someone explain why we need to calculate and instead of using df.vmcategory.value_counts(normalize=True) directly on interested column 'vmcategory', the data provider used df.groupby('vmcategory')['corehour'].sum().rename('corehour'))?
  • Can someone shed light on these inconsistency outputs from domain knowledge in cloud data and the reasoning for the role of 'corehour' in VM categorization or classification in the first approach is.

PS: if one is interested in how calculated out of 'lifetime' and 'vmcorecountbucket' and how they are related:

#Compute VM Lifetime based on VM Created and VM Deleted timestamps and transform to Hour
trace_dataframe['lifetime'] = np.maximum((trace_dataframe['vmdeleted'] - trace_dataframe['vmcreated']),300)/ 3600

#Compute VM corehour
trace_dataframe['corehour'] = trace_dataframe['lifetime'] * trace_dataframe['vmcorecountbucket']

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions