Skip to content

Conversation

@O-J1
Copy link
Collaborator

@O-J1 O-J1 commented Mar 19, 2025

Given the ongoing complaints/requests about captioning, I decided it was time for a comprehensive (and final) overhaul. The PR is currently marked as a draft since I fully expect Johnny to point out numerous issues or questionable decisions that deserve attention. To users reading this; OT will never be as focused as other dedicated tools, this is at the limit.

Additions:

  1. Visual redesign
  2. New captioning models added: Moondream2, WD EVA02 v3, WD Swinv2 v3, and JoyTag. Removed Blip1 due to it being garbage. Consider upgrading Blip2 to Blip3 once released, if feasible.
  3. Integrated Moondream+SAM2.1 for a grounded SAM style approach. Moondream performed notably better for object detection in my tests (though this is somewhat subjective). Additionally, its repository was significantly easier to work with compared to GS.
  4. Added common and practical image operations
  5. Implemented useful caption management operations
  6. The window now properly resizes, including img
  7. Added functionality including undo, redo, clear current caption, save button, and corresponding shortcuts, along with additional shortcut improvement.
  8. Multi-line caption support added (fixes previous issues with losing multi-line input)
  9. Fixed bugs related to samples handling
  10. File list now includes filtering capabilities
  11. Enabled opening of the file browser directly at the current directory for all platforms (previously Windows-only)
  12. Adds JXL support with a Pillow plugin (rust) as the PIL team does not seem to be moving to support it anytime soon and it offers lossless JPEG transcoding and significant space savings.
  13. A cursor indicating whether you have brush or fill on!

image

Initially, this PR was also supposed to include a samples rework, but the effort involved was beyond my expectations just reaching the current stage. I've self-reviewed it as best I can, but after looking at it for so long, I'm certain Ive become blind to some things.

If someone knows a well tested, lightweight ish photoreal replacement for Blip2, then I am open to outright replacing it but you have to provide lots of examples (preferably a peer reviewed paper)

P.S After this update, aside from major model improvements or truly groundbreaking developments (not incremental tweaks), I personally won't be addressing further data tool requests—and based on Nero's recent comments, I doubt he will either.

O-J1 added 30 commits March 5, 2025 20:12
…or Caption model too. (To work more reliably)
@O-J1
Copy link
Collaborator Author

O-J1 commented Jun 30, 2025

Its absolutely not perfect but it works satisfactorily now. Marking ready for review. Many changes and refactors will probably have to happen but I am committed to this being merged at some point.

@O-J1 O-J1 marked this pull request as ready for review June 30, 2025 12:37
@O-J1 O-J1 added this to the Maxwell/Pascal sunset milestone Oct 14, 2025
@dxqb dxqb marked this pull request as draft October 15, 2025 04:03
@dxqb dxqb linked an issue Oct 24, 2025 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feat]: JXL support [Bug]: Additional lines in a caption text get deleted when creating a mask.

5 participants