Skip to content
hiu253 edited this page Mar 10, 2015 · 11 revisions

#Group members Taylor Jaraczewski, Haixiang Liu, Erkin Otles #Abstract A grand debate has been present in the cancer biology research community in reference to the accuracy of using cell lines as accurate representations of a tumor. To delve into the efficacy of cancer cell lines many researchers have used hierarchical clustering to show the genomic difference between the two model systems. While some researchers have found cancer cell lines to be a useful model1 others have shown that many cell lines are poor representations of their respective primary samples. 2 Delving further into this paradox, it is readily obvious that one of the primary differences between these two findings is the clustering pipelines that each utilizes. Not only is the clustering algorithm itself different but also the pre processing steps to “prepare” the data is different. This project is being proposed to evaluate a number of different clustering algorithms to assess and validate the numerous assumptions that are used in each. #Methods In order to investigate the relationship between analysis pipeline and clustering results the authors will build out a two step pipeline with modular components. The first step will represent the preprocessing that is typically done on gene expression data. Clustering will be housed in the second step module. ##Preprocessing For the preprocessing module we would like to implement at least two methods:
Individual Data Set Normalization Joint Data Set Normalization ##Clustering For the clustering module we would like to investigate three different methods: Hierarchical Clustering
K-Means Clustering
EM Clustering

#Anticipated results After running the different algorithms presented in the methods section each will be assessed by looking at the clustering output. Upon looking at this initial output, the algorithms will be perturbed by changing individual assumptions made within each. In changing certain assumptions it is expected that certain algorithms may have a difference in the clustered output. Two primary questions will be asked: 1) is the clustered output, and transitively, the efficacy of cell lines dependent on the specific pipeline used and 2) can the output of certain pipelines be modulated by changing some of the assumptions. #Related work

In [1], CCLE is used to predicting drug sensitivity.

In [2], TCGA, CCLE and HGSOC data are combined in a clustering algorithm to comparing cell line gene profile with the cancer cell gene profile.

[4] reports the analysis of 489 clinically annotated stage-II–IV HGS-OvCa samples and corresponding normal DNA using several analysis method.

#Division of labor

Each of the members will implement one method and the results will be compared and analyzed.

How will you ensure all group members contribute equally to the project? #References

[1] Barretina, Jordi, et al. "The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity." Nature 483.7391 (2012): 603-607.

[2] Domcke, Silvia, et al. "Evaluating cell lines as tumour models by comparison of genomic profiles." Nature communications 4 (2013).

[3] Leek, Jeffrey T., et al. "The sva package for removing batch effects and other unwanted variation in high-throughput experiments." Bioinformatics 28.6 (2012): 882-883.

[4] Cancer Genome Atlas Research Network. "Integrated genomic analyses of ovarian carcinoma." Nature 474.7353 (2011): 609-615.

Clone this wiki locally