-
Notifications
You must be signed in to change notification settings - Fork 174
Description
Hi,
I was exploring over AzurePublicDatasetV2 and I checked the side notebook: Azure 2019 Public Dataset V2 - Trace Analysis.ipynb including General Statistics. There it has been shown that after they read vmtable.csv and slight transformation they added 2 more columns from out of other features and introduced 'corehour' and 'lifetime' within trace_dataframe.
so I could reproduce the results indicated in with minor updates:
TraceLegend = "Azure 2019 - Public Dataset V2"
AzureLegend = "Azure 2019"
def CategoryPlot(df, azure2019):
#dataset = pd.DataFrame(df.groupby('vmcategory')['corehour'].sum().rename('corehour'))
#dataset = dataset.rename(columns={'vmcategory': 'Bucket'})
#dataset['value'] = dataset['corehour']/dataset['corehour'].sum() * 100
#dataset= dataset.drop('corehour', 1)
#dataset = dataset.sort_index().T
#azure2019 = azure2019.to_frame().T
#frames = [dataset, azure2019]
#result = pd.concat(frames)
dataset = pd.DataFrame(df.groupby('vmcategory')['corehour'].sum().rename('corehour'))
dataset['percentage'] = dataset['corehour']/dataset['corehour'].sum() * 100
del dataset['corehour']
dataset =dataset.reset_index().rename({'index':'vmcategory'}, axis = 'columns')
result = pd.concat([dataset,azure2019], axis=1, ignore_index=False, sort=True)
del result['vmcategory']
result = result.T
ax = result.plot.bar(stacked=True, title='VM Category Distribution', color=['lightskyblue', 'orange', '0.75'], ylim=(0,100))
ax.set_ylabel('% of core hours')
ax.set_xticklabels([TraceLegend, AzureLegend], rotation=0)
ax.legend(["Delay-insensitive", "Interactive", "Unknown"], loc='upper center', title='Categories', bbox_to_anchor=(0.5, -0.10), ncol=3, fontsize=10.5)
CategoryPlot(trace_dataframe, category_azure2019)
in short using this way we have following for "Azure 2019 - Public Dataset V2" and "Azure 2019" :
dataset = pd.DataFrame(trace_dataframe.groupby('vmcategory')['corehour'].sum().rename('corehour'))
dataset['percentage'] = dataset['corehour']/dataset['corehour'].sum() * 100
#dataset= dataset.drop('corehour', axis=1)
del dataset['corehour']
dataset =dataset.reset_index().rename({'index':'vmcategory'}, axis = 'columns')
#dataset = dataset.sort_index()#.T
print(dataset.to_markdown(tablefmt="grid"))
+----+-------------------+--------------+
| | vmcategory | percentage |
+====+===================+==============+
| 0 | Delay-insensitive | 58.4396 |
+----+-------------------+--------------+
| 1 | Interactive | 32.5105 |
+----+-------------------+--------------+
| 2 | Unknown | 9.04996 |
+----+-------------------+--------------+azure2019 = category_azure2019#.T
print(azure2019.to_markdown(tablefmt="grid"))
+----+-------------------+---------+
| | vmcategory | value |
+====+===================+=========+
| 0 | Delay-insensitive | 51 |
+----+-------------------+---------+
| 1 | Interactive | 43 |
+----+-------------------+---------+
| 2 | Unknown | 6 |
+----+-------------------+---------+so far we reproduced the results of the offered notebook but when you approach the following and you read directly vmtable.csv and just process on interested column 'vmcategory' you get different result!
import numpy as np
import pandas as pd
from IPython.display import display
import matplotlib.pyplot as plt
%matplotlib inline
#data_path = 'https://azurecloudpublicdataset2.z19.web.core.windows.net/azurepublicdatasetv2/trace_data/vmtable/vmtable.csv.gz'
data_path = 'https://azurepublicdatasettraces.blob.core.windows.net/azurepublicdatasetv2/trace_data/vmtable/vmtable.csv.gz'
headers = ['vmid','subscriptionid','deploymentid','vmcreated', 'vmdeleted', 'maxcpu', 'avgcpu', 'p95maxcpu',
'vmcategory', 'vmcorecountbucket', 'vmmemorybucket']
data = pd.read_csv(data_path, header=None, index_col=False,names=headers,delimiter=',')
#data.head(10)
#slice data
df= data[['vmcategory', 'vmid']]
#count occurrence of each value in 'vmcategory' column
counts = df.vmcategory.value_counts()
#count occurrence of each value in 'vmcategory' column as percentage of total
percs = df.vmcategory.value_counts(normalize=True)
#count occurrence of each value in 'team' column as percentage of total
perc = df.vmcategory.value_counts(normalize=True).mul(100).round(1).astype(str) + '%'
#concatenate results into one DataFrame
tf = pd.concat([counts,percs,perc], axis=1, keys=['count', 'percentage', '%'])
print(tf.to_markdown(tablefmt="grid"))
+-------------------+---------+--------------+-------+
| vmcategory | count | percentage | % |
+===================+=========+==============+=======+
| Unknown | 2457455 | 0.911672 | 91.2% |
+-------------------+---------+--------------+-------+
| Delay-insensitive | 159615 | 0.0592143 | 5.9% |
+-------------------+---------+--------------+-------+
| Interactive | 78478 | 0.0291139 | 2.9% |
+-------------------+---------+--------------+-------+observation shows that the latter indicates Unknown class is 91% while 1st approach shows Delay-insensitive class has a 58% proportion of all VMs!!
- Which one is correct? can someone explain why we need to calculate and instead of using
df.vmcategory.value_counts(normalize=True)directly on interested column'vmcategory', the data provider useddf.groupby('vmcategory')['corehour'].sum().rename('corehour'))? - Can someone shed light on these inconsistency outputs from domain knowledge in cloud data and the reasoning for the role of
'corehour'in VM categorization or classification in the first approach is.
PS: if one is interested in how calculated out of 'lifetime' and 'vmcorecountbucket' and how they are related:
#Compute VM Lifetime based on VM Created and VM Deleted timestamps and transform to Hour
trace_dataframe['lifetime'] = np.maximum((trace_dataframe['vmdeleted'] - trace_dataframe['vmcreated']),300)/ 3600
#Compute VM corehour
trace_dataframe['corehour'] = trace_dataframe['lifetime'] * trace_dataframe['vmcorecountbucket']