Good job! My question is that why to use different class tokens for each stage but **only the final class token is used for classification**? https://github.com/microsoft/CvT/blob/34d1af94c95442b19fb9470e0c9dd5ee11be2024/lib/models/cls_cvt.py#L607