Skip to content

simkin-bioinformatics/summarize_TE

Repository files navigation

Inputs

  • genome fasta for the organism of interest
  • TE fasta for the TE consensus
  • a bigbed file used for annotation of the TE consensus hub
  • a config file where the user designates which parameters to use

Outputs

  • summary files containing genomic coordinates of each region of the genome that is aligned with a TE
  • summary files containing the total regions of the genome and total nucleotides that correspond to full TE and to partial TE fragments
  • a fasta file of every complete TE that was found in the genome
  • a folder of files that enable visualization of the results in the ucsc genome browser

Dependencies

Using the Pipeline

  • open the summarize_TE_config.yaml file edit the parameters there
  • open terminal and cd into the folder that has the summarize_TE.smk file
  • activate the conda snakemake environment and run the command below snakemake -s summarize_TE.smk --use-conda --cores 4

Uploading to ucsc broswer

  • open the folder you created while using the pipeline and locate the ucsc_browser folder
  • copy the entire ucsc_broswer folder into the public folder where your ucsc broser files are stored
  • Visit the ucsc genome browser website
  • click on "My Data" then "Connected Hubs"
  • add urls pointing to the assembly hub, the track hub, and the annotation hub from the ucsc browser folder you just copied

Important note about annotation file

  • we created a bigbed annotation file from the dfam embl file that can be imported to the ucsc browser
  • the bigbed coordinates were altered to match each TE instead of the entire genome
  • this annotation hub provides another track in the ucsc viewer that shows internal regions and LTRs and any annotated protein encoding genes

multiple summaries

if a collection of genomes has been assembled in a single directory this pipeline can analyze all of them at once

  • output is a collection of summaries of the total regions, etc for each genome
  • combined summary is all of the genomes combined into one file with a single number to represent the presence of each transposon
    • I'd like to have this number be equal to the total number of matching nucleotides divided by the total number of nucleotides present in the total regions if every region was a full copy

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages