-
Notifications
You must be signed in to change notification settings - Fork 42
Description
I evaluated polyglot-ko-1.3b model with HellaSwag and WiC from KoBEST, and I got different results with paper and model card from huggingface.
Environment
- Few-shot examples: 5
- Model: EleutherAI/polyglot-ko-1.3b
- Metrics: F1(Macro) Score
- Computing: Colab / GPU(T4) Instance
I'm going to share a notebook that I tested with.
https://colab.research.google.com/drive/1lyQQisuB5JzuGk72haSdxXfXP20q4YGr?usp=sharing
1. WiC
The paper says the score 0.486, But I got only 0.4541.
- The paper
| params | 0-shot | 5-shot | 10-shot | 50-shot |
|---|---|---|---|---|
| 1.3B | 0.489 | 0.486 | 0.506 | 0.487 |
- In my test
hf-causal-experimental (pretrained=EleutherAI/polyglot-ko-1.3b), limit: None, provide_description: False, num_fewshot: 5, batch_size: 8
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| kobest_wic | 0 | acc | 0.4952 | ± | 0.0141 |
| macro_f1 | 0.4541 | ± | 0.0138 |
2. HellaSwag
The paper says the score 0.526, But I got only 0.3984.
- In the paper
| params | 0-shot | 5-shot | 10-shot | 50-shot |
|---|---|---|---|---|
| 1.3B | 0.525 | 0.526 | 0.528 | 0.543 |
- In my test
hf-causal-experimental (pretrained=EleutherAI/polyglot-ko-1.3b), limit: None, provide_description: False, num_fewshot: 5, batch_size: 8
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| kobest_hellaswag | 0 | acc | 0.4020 | ± | 0.0219 |
| acc_norm | 0.5280 | ± | 0.0223 | ||
| macro_f1 | 0.3984 | ± | 0.0218 |
And I found out a Wandb Report Polyglot-Ko: Open-Source Korean Autoregressive Language Model
, And there's a HellaSwag score that is same as my test, 0.3984.
| params | n=0 | n=5 | n=10 | n=50 |
|---|---|---|---|---|
| 1.3B | 0.4013 | 0.3984 | 0.417 | 0.4416 |
In case of other models
There are also differences in kakaobrain/kogpt and skt/ko-gpt-trinity-1.2B-v0.5.
- kakaobrain/kogpt
Note that I tested kakaobrain/kogpt with Int 8 quantized model.
| In the paper (FP16) | In my test (Int8) | In the Wandb Report | |
|---|---|---|---|
| CoPA | 0.7287 | 0.7277 (↓0.01%) | 0.7287 |
| HellaSwag | 0.5833 | 0.4560 (↓21.82%) | 0.456 |
| BoolQ | 0.5981 | 0.6015 (↑0.56%) | - |
| WiC | 0.4775 | 0.3706 (↓22.38%) | - |
- skt/ko-gpt-trinity-1.2B-v0.5
| In the paper | In my test | In the Wandb Report | |
|---|---|---|---|
| WiC | 0.4313 | 0.3953 | - |
| HellaSwag | 0.5272 | 0.400 | 0.4 |