Inquiry regarding caption confounding issue and data category

Thank you very much for your work and your tremendous contributions to the community.

After reviewing the data samples provided on [Hugging Face](https://huggingface.co/datasets/vector-institute/open-pmc-18m), I noticed that a significant number of samples [[1](https://huggingface.co/datasets/vector-institute/open-pmc-18m/viewer/default/train?row=14)] [[2](https://huggingface.co/datasets/vector-institute/open-pmc-18m/viewer/default/train?row=16)] [[3](https://huggingface.co/datasets/vector-institute/open-pmc-18m/viewer/default/train?row=50)] still exhibit caption confounding issues. While the authors claim to have resolved this problem using ChatGPT, the actual effectiveness may be limited. How should we address this issue? Is the version we're reviewing incorrect, or do we need additional post-processing steps?

Additionally, the authors included statistics on the quantity of each data category in their paper [[Fig 4a](https://arxiv.org/pdf/2506.02738)]. However, the current version of the samples does not contain a “category” field. How was this statistical functionality implemented? Can we quickly extract data for specific categories, such as all images and corresponding captions for the radiology category?

Thank you again for the great work and I'm looking forawrd to your reply.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inquiry regarding caption confounding issue and data category #37

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inquiry regarding caption confounding issue and data category #37

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions