Skip to content

Unexpected data in Blob Dataset 2020 #50

@darkryder

Description

@darkryder

The schema of the 2020 Blob Dataset presents AnonFunctionInvocationId and AnonAppName as unique IDs.

However, there are sometimes discrepancies where the invocationId spans multiple application names. For example,

full_df[full_df['AnonFunctionInvocationId'] == 1967128581]
Timestamp AnonRegion AnonUserId AnonAppName AnonFunctionInvocationId AnonBlobName BlobType AnonBlobETag BlobBytes Read Write Datetime
1606814873193 q2d 1209884869 01qqaww4 1967128581 1wx5dgohq1kiwjum BlockBlob/text/plain; charset=utf-8 f1x5p2nqh6 28.0 True False 2020-12-01 09:27:53.193
1607004493391 q2d 1209884869 j2alqt8s 1967128581 1wx5dgohq1kiwjum BlockBlob/text/plain; charset=utf-8 w5mohi6523 28.0 True False 2020-12-03 14:08:13.391

This seems to be a recurrent pattern with this user, for example consider other functionInvocationIds 830734703, 440926898, or 900464655.

This leads me to believe that the cause is unlikely to be unfortunate prefixes of hashed IDs. Is there any way to explain this discrepancy, apart from the data being potentially unclean?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions