Skip to content

Cannot reproduce the evaluation score of HellaSwag, WiC #37

@rycont

Description

@rycont

I evaluated polyglot-ko-1.3b model with HellaSwag and WiC from KoBEST, and I got different results with paper and model card from huggingface.

Environment

  • Few-shot examples: 5
  • Model: EleutherAI/polyglot-ko-1.3b
  • Metrics: F1(Macro) Score
  • Computing: Colab / GPU(T4) Instance

I'm going to share a notebook that I tested with.
https://colab.research.google.com/drive/1lyQQisuB5JzuGk72haSdxXfXP20q4YGr?usp=sharing

1. WiC

The paper says the score 0.486, But I got only 0.4541.

  • The paper
params 0-shot 5-shot 10-shot 50-shot
1.3B 0.489 0.486 0.506 0.487
  • In my test

hf-causal-experimental (pretrained=EleutherAI/polyglot-ko-1.3b), limit: None, provide_description: False, num_fewshot: 5, batch_size: 8

Task Version Metric Value Stderr
kobest_wic 0 acc 0.4952 ± 0.0141
macro_f1 0.4541 ± 0.0138

2. HellaSwag

The paper says the score 0.526, But I got only 0.3984.

  • In the paper
params 0-shot 5-shot 10-shot 50-shot
1.3B 0.525 0.526 0.528 0.543
  • In my test

hf-causal-experimental (pretrained=EleutherAI/polyglot-ko-1.3b), limit: None, provide_description: False, num_fewshot: 5, batch_size: 8

Task Version Metric Value Stderr
kobest_hellaswag 0 acc 0.4020 ± 0.0219
acc_norm 0.5280 ± 0.0223
macro_f1 0.3984 ± 0.0218

And I found out a Wandb Report Polyglot-Ko: Open-Source Korean Autoregressive Language Model
, And there's a HellaSwag score that is same as my test, 0.3984.

params n=0 n=5 n=10 n=50
1.3B 0.4013 0.3984 0.417 0.4416

In case of other models

There are also differences in kakaobrain/kogpt and skt/ko-gpt-trinity-1.2B-v0.5.

  • kakaobrain/kogpt
    Note that I tested kakaobrain/kogpt with Int 8 quantized model.
In the paper (FP16) In my test (Int8) In the Wandb Report
CoPA 0.7287 0.7277 (↓0.01%) 0.7287
HellaSwag 0.5833 0.4560 (↓21.82%) 0.456
BoolQ 0.5981 0.6015 (↑0.56%) -
WiC 0.4775 0.3706 (↓22.38%) -
  • skt/ko-gpt-trinity-1.2B-v0.5
In the paper In my test In the Wandb Report
WiC 0.4313 0.3953 -
HellaSwag 0.5272 0.400 0.4

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions