Skip to content

Question: Clarification needed on required columns for pretrain.py input #134

@newntch

Description

@newntch

Hello, thank you for sharing the project.

I’m currently trying to reproduce your project using the MIMIC-IV Demo data.
I followed the instructions and successfully completed the data preprocessing step, resulting in a patient sequence in .parquet format.

Now I’m attempting to train the pretrain model by running the pretrain.py script, but I'm encountering issues during execution. KeyError: 'type_tokens'

It seems the model requires input columns: ["concept_ids"], ["type_ids"], ["time_stamps"], ["ages"], ["visit_orders"], and ["visit_segments"].
From what I understand, these can be added by setting the additional_token_types argument when initializing PretrainDataset like so:

PretrainDataset(
    data=pre_train,
    tokenizer=tokenizer,
    max_len=args.max_len,
    mask_prob=args.mask_prob,
    additional_token_types=['type_ids', 'ages', 'time_stamps', 'visit_orders', 'visit_segments'], 
    padding_side=args.padding_side,
)

However, the patient sequence obtained from preprocessing does not contain the necessary columns like ['type_tokens', 'age_tokens', 'time_tokens', 'position_tokens', 'visit_tokens'], which causes issues when running pretrain.py.

I’d like to ask for more information so I can resolve this and proceed further:

  1. What columns should the patient sequence contain after preprocessing?
    From my current result, it only includes ['subject_id', 'code'].

  2. Is there an additional processing step required for the patient sequence before running pretrain.py?
    If so, could you provide details or code for that step?

  3. Do you have more detailed instructions or examples to guide through the entire process from preprocessing, pretraining, and fine-tuning?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions