Skip to content

Negative score Unbabel/XCOMET-XL #260

@cgr71ii

Description

@cgr71ii

🐛 Bug

Hi!

I am using Unbabel/XCOMET-XL for reference-based MT evaluation and observed a negative score. Is this expected?

I am aware from Issue #38 that older COMET models were trained on unbounded z-scores. However, the XCOMET paper explicitly states: "we employ min-max scaling on our DA corpus to set its range of scores to [0, 1]".

If the model architecture does not include a final sigmoid layer or hard clipping, I assume it is possible for limit cases to output values slightly below 0 or above 1. Is this the case? If so, is clipping a good method to force the values to be in the range [0, 1]?

To Reproduce

from comet import download_model, load_from_checkpoint

model_name = "Unbabel/XCOMET-XL"
model_path = download_model(model_name)
model = load_from_checkpoint(model_path)

src = ["There were protests worldwide, several criminal prosecutions, and the leaders of the governments of Iceland and Pakistan both resigned."]
mt = ["Було проведено протести по всьому світу, кілька кримінальних переслідувань, а лідери урядів Ісландії та Пакистану both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both both"]
ref = ["У світі відбулися протести, кілька кримінальних переслідувань, а обидва лідери урядів Ісландії та Пакистану пішли у відставку."]

# inference
data = [{"src": s, "mt": t, "ref": r} for s, t, r in zip(src, mt, ref)]
scores = model.predict(data, batch_size=1, gpus=0, accelerator="cpu")

print(scores) # "scores": -0.0025453418493270874

assert scores["scores"][0] < 0.0, scores

Expected behaviour

I expected to obtain a value for the score in the range [0, 1].

Environment

OS: Linux (Ubuntu 24.04.3 LTS)
Packaging: comet installed through pip inside a conda environment
Version: comet 2.2.7 (pypi), python 3.11.9 (conda)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions