-
Notifications
You must be signed in to change notification settings - Fork 14
Description
Hello, thank you for sharing the project.
I’m currently trying to reproduce your project using the MIMIC-IV Demo data.
I followed the instructions and successfully completed the data preprocessing step, resulting in a patient sequence in .parquet format.
Now I’m attempting to train the pretrain model by running the pretrain.py script, but I'm encountering issues during execution. KeyError: 'type_tokens'
It seems the model requires input columns: ["concept_ids"], ["type_ids"], ["time_stamps"], ["ages"], ["visit_orders"], and ["visit_segments"].
From what I understand, these can be added by setting the additional_token_types argument when initializing PretrainDataset like so:
PretrainDataset(
data=pre_train,
tokenizer=tokenizer,
max_len=args.max_len,
mask_prob=args.mask_prob,
additional_token_types=['type_ids', 'ages', 'time_stamps', 'visit_orders', 'visit_segments'],
padding_side=args.padding_side,
)However, the patient sequence obtained from preprocessing does not contain the necessary columns like ['type_tokens', 'age_tokens', 'time_tokens', 'position_tokens', 'visit_tokens'], which causes issues when running pretrain.py.
I’d like to ask for more information so I can resolve this and proceed further:
-
What columns should the patient sequence contain after preprocessing?
From my current result, it only includes['subject_id', 'code']. -
Is there an additional processing step required for the patient sequence before running
pretrain.py?
If so, could you provide details or code for that step? -
Do you have more detailed instructions or examples to guide through the entire process from preprocessing, pretraining, and fine-tuning?