-
Notifications
You must be signed in to change notification settings - Fork 33
Description
Hello,
I am planning to work with the UK Biobank WES final release and want to start by generating a MatrixTable (MT), as this is required for performing a series of quality control (QC) steps on the data.
However, the dataset is very large (~15 TB for 500k samples), and my attempts to generate a single MT have not completed successfully — the longest run before termination was 9 hours.
I would appreciate guidance on:
Best practices for working with such large WES data
Instance type and number of nodes recommended for generating MTs and for qc later
Whether it is better to split the data by chromosome or any other strategy to make MT generation feasible
Any advice on resource configuration in Hail / RAP for this scale of data
My goal is to generate MTs that can be used for:
Sample QC
Variant QC
Genotype QC
Thank you very much in advance for any insights or practical advice — any recommendations from those who have successfully worked with the 500k WES final release would be very helpful.