-
Notifications
You must be signed in to change notification settings - Fork 99
Description
🐛 Bug
Hi!
I am using Unbabel/XCOMET-XL for reference-based MT evaluation and observed a negative score. Is this expected?
I am aware from Issue #38 that older COMET models were trained on unbounded z-scores. However, the XCOMET paper explicitly states: "we employ min-max scaling on our DA corpus to set its range of scores to [0, 1]".
If the model architecture does not include a final sigmoid layer or hard clipping, I assume it is possible for limit cases to output values slightly below 0 or above 1. Is this the case? If so, is clipping a good method to force the values to be in the range [0, 1]?
To Reproduce
from comet import download_model, load_from_checkpoint
model_name = "Unbabel/XCOMET-XL"
model_path = download_model(model_name)
model = load_from_checkpoint(model_path)
src = ["There were protests worldwide, several criminal prosecutions, and the leaders of the governments of Iceland and Pakistan both resigned."]
mt = ["Було проведено протести по всьому світу, кілька кримінальних переслідувань, а лідери урядів Ісландії та Пакистану both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both"]
ref = ["У світі відбулися протести, кілька кримінальних переслідувань, а обидва лідери урядів Ісландії та Пакистану пішли у відставку."]
# inference
data = [{"src": s, "mt": t, "ref": r} for s, t, r in zip(src, mt, ref)]
scores = model.predict(data, batch_size=1, gpus=0, accelerator="cpu")
print(scores) # "scores": -0.0025453418493270874
assert scores["scores"][0] < 0.0, scoresExpected behaviour
I expected to obtain a value for the score in the range [0, 1].
Environment
OS: Linux (Ubuntu 24.04.3 LTS)
Packaging: comet installed through pip inside a conda environment
Version: comet 2.2.7 (pypi), python 3.11.9 (conda)