Skip to content

Question about what happens if hapcount/dosage files contain people who are not in your phenotype file #50

@jtb324

Description

@jtb324

Issue (Really a Question):

What is the behavior of tractor when you have people in the hapcount/dosage files who are not in the phenotype file? I can't find anywhere in the documentation that says whether tractor drops these people out or it assumes they are controls, or if the behavior is undefined in the design of tractor. tldr of what I think happens: from my understanding of the code it seems like the actually vector going into the regression function call will contain all individuals in the hapcount/dosage files but the covariates will be filled with NAs for those individuals and then the lm or glm function seems to silently ignore individuals who are missing covariates. So basically the regressions acts the same as if those individuals were never included in the analysis but the calculated AF and LA proportions are off because it uses the number of individuals in the hapcount/dosage file. The "My understanding of the code" part of this issue explains everything in more detail with relevant line number

Context:

The lab I am at is running a tractor analysis in a cohort of ~26,000 individuals. Someone else in my lab had already performed the local ancestry inference using RFmix and then performed the unkinking step using a larger cohort of ~35,000 individuals. This part took a fair amount of time for this individual to do and we would like to not repeat this work if we can avoid it.

My understanding of the code:
I see that in the run_tractor.R script, there is a block of code from lines 273-281 that drops individuals who are in the phenotype file but are not in the genotype file. I don't see any functionality that would ensure that all individuals in the genotype file are in the phenotype file.

If it were the situation that there are individuals in the hapcount/dosage file that are not in the phenotype file then it seems like there would be 2 issues (based on my understanding of the code):

  1. The ID vector generated at 285 would have the incorrect IDs

I think when the left join is used like ID = left_join(sampleID_hapdos, df_phe) and sampleID_hapdos has individuals who are not in df_phe, it will return a data.table for with IDs for everyone in the sampleID_hapdos and it will fill in the missing covariates for thes individuals as NA. (See attached image where I created 2 fake data.tables of different dimensions and joined them.)

Image

I think these NAs are then silently dropped in the latter "lm" or "glm" call because that is the default behavior of the R library. This behavior isn't necessary an issue but it might be undefined to users what is happening.

  1. in the matrix formation at line 313 (see code block below) to not have the appropriate dimensions and then would affect the AF and LAprop calculation
# lines 313-326
LAG_ = sapply(data, function(dt) dt[i, 6:ncol(dt), with = FALSE]) %>% 
              as.numeric() %>%
              matrix(nrow=nrow(df_phe), ncol=length(data))
            
colnames(LAG_) = c(LAcolnames, Gcolnames)

AF            = as.numeric(colSums(LAG_[,Gcolnames])/colSums(LAG_[,LAcolnames]))
LAprop        = as.numeric(colSums(LAG_[,LAcolnames])/sum(LAG_[,LAcolnames]))

# Sum of Local Ancestry equals 1, so exclude one LA term
LAG_          = LAG_[,c(LAcolnames[1:(length(LAcolnames) - 1)], Gcolnames)]
coef_rownames = paste0("LAG_", colnames(LAG_))
LA_rownames   = paste0("LAG_", LAcolnames[1:(length(LAcolnames) - 1)])
G_rownames    = paste0("LAG_", Gcolnames)

Since AF and LAprop are calculated using dimensions from the hapcount/dosage files, it seems like these values may be inflated in this specific situation.

Is my understanding of how the code is behaving correct or am I missing something?

Advice on running this?

There are really only 2 ways forward that I can think of:

  1. I can filter the hapcount and dosage files to the correct individuals. This fixes the code issues 1 and 2. I am not certain if this introduces a different bias because the unkinking steps were done with everyone

  2. I can filter the output from RFmix to have only these individuals. I would prefer not to have to do this because it is going to involve parsing somewhere on the order of 2TB of data. If this is the better option then it is what it is

Do you have any advice or are my observations of the code's behavior in this edge case correct?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions