Inputs

genome fasta for the organism of interest
TE fasta for the TE consensus
a bigbed file used for annotation of the TE consensus hub
a config file where the user designates which parameters to use

Outputs

summary files containing genomic coordinates of each region of the genome that is aligned with a TE
summary files containing the total regions of the genome and total nucleotides that correspond to full TE and to partial TE fragments
a fasta file of every complete TE that was found in the genome
a folder of files that enable visualization of the results in the ucsc genome browser

Dependencies

a conda environment with snakemake installed (instructions for installing at link below)

https://snakemake.readthedocs.io/en/stable/getting_started/installation.html

Using the Pipeline

open the summarize_TE_config.yaml file edit the parameters there
open terminal and cd into the folder that has the summarize_TE.smk file
activate the conda snakemake environment and run the command below snakemake -s summarize_TE.smk --use-conda --cores 4

Uploading to ucsc broswer

open the folder you created while using the pipeline and locate the ucsc_browser folder
copy the entire ucsc_broswer folder into the public folder where your ucsc broser files are stored
Visit the ucsc genome browser website
click on "My Data" then "Connected Hubs"
add urls pointing to the assembly hub, the track hub, and the annotation hub from the ucsc browser folder you just copied

Important note about annotation file

we created a bigbed annotation file from the dfam embl file that can be imported to the ucsc browser
the bigbed coordinates were altered to match each TE instead of the entire genome
this annotation hub provides another track in the ucsc viewer that shows internal regions and LTRs and any annotated protein encoding genes

multiple summaries

if a collection of genomes has been assembled in a single directory this pipeline can analyze all of them at once

output is a collection of summaries of the total regions, etc for each genome
combined summary is all of the genomes combined into one file with a single number to represent the presence of each transposon
- I'd like to have this number be equal to the total number of matching nucleotides divided by the total number of nucleotides present in the total regions if every region was a full copy

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
annotation_files		annotation_files
scripts		scripts
.gitignore		.gitignore
README.md		README.md
environment.yaml		environment.yaml
summarize_TE.smk		summarize_TE.smk
summarize_TE_config.yaml		summarize_TE_config.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Inputs

Outputs

Dependencies

Using the Pipeline

Uploading to ucsc broswer

Important note about annotation file

multiple summaries

About

Uh oh!

Releases

Packages

Uh oh!

Languages

simkin-bioinformatics/summarize_TE

Folders and files

Latest commit

History

Repository files navigation

Inputs

Outputs

Dependencies

Using the Pipeline

Uploading to ucsc broswer

Important note about annotation file

multiple summaries

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages