Skip to content

working with UK biobank WES final release #48

@marise0

Description

@marise0

Hello,
I am planning to work with the UK Biobank WES final release and want to start by generating a MatrixTable (MT), as this is required for performing a series of quality control (QC) steps on the data.
However, the dataset is very large (~15 TB for 500k samples), and my attempts to generate a single MT have not completed successfully — the longest run before termination was 9 hours.
I would appreciate guidance on:
Best practices for working with such large WES data
Instance type and number of nodes recommended for generating MTs and for qc later
Whether it is better to split the data by chromosome or any other strategy to make MT generation feasible
Any advice on resource configuration in Hail / RAP for this scale of data
My goal is to generate MTs that can be used for:
Sample QC
Variant QC
Genotype QC
Thank you very much in advance for any insights or practical advice — any recommendations from those who have successfully worked with the 500k WES final release would be very helpful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions