CLIPper is a reimplementation of CLIPseg, optimized for better performance for multi-class segmentation. The previous implementations both on the original CLIPseg repo and hugginface both require an in image for each text input. If you want to segment multiple classes for the same image you need to encode the image for each text input.
CLIPper fixes this by encoding the image only once and then encoding each text unput,
and decoding for text input. The image encoder is the bulk of the inference time,
so doing this once leads to great speed ups as you add classes, as shown in plot.

I do hope to commit this back to Huggingface, if they'll take it. I'm also working on a cpp implementation.
To build the model in c++ use:
cd clipper
cmake -S . -B build
cd build
cmake --build .
sudo cmake --install . --prefix /usr/local