- genome fasta for the organism of interest
- TE fasta for the TE consensus
- a bigbed file used for annotation of the TE consensus hub
- a config file where the user designates which parameters to use
- summary files containing genomic coordinates of each region of the genome that is aligned with a TE
- summary files containing the total regions of the genome and total nucleotides that correspond to full TE and to partial TE fragments
- a fasta file of every complete TE that was found in the genome
- a folder of files that enable visualization of the results in the ucsc genome browser
-
a conda environment with snakemake installed (instructions for installing at link below)
https://snakemake.readthedocs.io/en/stable/getting_started/installation.html
- open the summarize_TE_config.yaml file edit the parameters there
- open terminal and cd into the folder that has the summarize_TE.smk file
- activate the conda snakemake environment and run the command below snakemake -s summarize_TE.smk --use-conda --cores 4
- open the folder you created while using the pipeline and locate the ucsc_browser folder
- copy the entire ucsc_broswer folder into the public folder where your ucsc broser files are stored
- Visit the ucsc genome browser website
- click on "My Data" then "Connected Hubs"
- add urls pointing to the assembly hub, the track hub, and the annotation hub from the ucsc browser folder you just copied
- we created a bigbed annotation file from the dfam embl file that can be imported to the ucsc browser
- the bigbed coordinates were altered to match each TE instead of the entire genome
- this annotation hub provides another track in the ucsc viewer that shows internal regions and LTRs and any annotated protein encoding genes
if a collection of genomes has been assembled in a single directory this pipeline can analyze all of them at once
- output is a collection of summaries of the total regions, etc for each genome
- combined summary is all of the genomes combined into one file with a single number to represent the presence of each transposon
- I'd like to have this number be equal to the total number of matching nucleotides divided by the total number of nucleotides present in the total regions if every region was a full copy