Pedigree Painter (pepa)

pepa is a visualization tool for genomics data that can be used to visualize which and how much of the parental genome has been inherited by the offspring. The pipeline generates data optimized for ggplot2, so users can create their own plots using this data. The R scripts in the pipeline focus on graphics, and are optimized for tens of offspring but have been tested on up to 250 samples with 30 chromosomes each.

More information can be found in the article https://www.biorxiv.org/content/10.1101/2024.12.18.629215v1

Examples of the inputs and outputs are in the Test folder.

Description

The script performs the following tasks: 2. Processes VCF files for SNP clustering and ancestry identification. 3. Refines clusters and makes files suitable for ancestry-based visualizations. 4. Optionally computes gene-based ancestry statistics.

Installing `pepa` via Conda

The pipelines are written in Bash, making it easy to modify and run locally, however, it is recommended to use the program through CONDA. The installation through CONDA checks for all the prerequisites has been tested for both clustering and visualization. The usage is described as running within a CONDA environment. If you don’t already have Conda installed, download and install it from (eg. Miniconda or Anaconda). The scripts use a mix of Bash, Python, and R. Python scripts do not generate plots and use only basic modules, so it does not require the installation of any extra code. The output of all analyses is made suitable for ggplot2, and all plots are done using R scripts which require only three packages.

Installation Steps

Create a Conda Environment (Optional): To keep dependencies isolated, create a dedicated environment:
```
conda create -n pepa_env python=3.10
conda activate pepa_env
```
Install pepa: Run the following command to install pepa:
```
conda install mitopozzi::pepa
```
Verify Installation: Ensure the tool is installed successfully:
```
pepa-paint -h
```

Usage

Multiple commands are available in pepa, but the main pipeline can be used with the command below.

pepa-paint -i ListVCF.txt -o Results -1 Parent1.vcf -2 Parent2.vcf -c 1000 -G NCBIannotation.gtf -C 1

Required Flags

Flag	Description
`-i`	Specify a file with a list of VCF files (one VCF per sample).
`-o`	Specify the base name for output files.
`-1`	Specify the first parental VCF file for comparison (Blue).
`-2`	Specify the second parental VCF file for comparison (Red).
`-c`	Specify the clustering size for SNP regions (e.g., 100).

Optional Flags

Flag	Description
`-I`	Specify a Tabulated file (generated by `pepa-table`).
`-G`	Specify a GTF file for annotation conversion (will be converted in .anno).
`-A`	Specify annotation file (.anno).
`-C`	Generate and plot chromosome-wide ancestry percentages (default: inactive).
`-h`	Display the help message and usage instructions.

Other possible commands are below:

Perform all analyses without plotting anything. The output file <basename>_Clustered.csv is suitable for plotting in ggplot2.

pepa-base -i ListVCF.txt -o Results -1 Parent1.vcf -2 Parent2.vcf -c 1000

Perform only the conversion from VCF files to the comparison table (similar to VCFtools). The output file <basename>_Tabulated.csv is suitable for visual inspection of the data.

pepa-table -i ListVCF.txt -o Results -1 Parent1.vcf -2 Parent2.vcf -c 1000

Convert a GTF file into a .anno file. This is a more readable genome annotation format, you can see an example (S. pombe nuclear genome) in the Examples folder.

pepa-gtf -I NCBIannotation.gtf -O Results

This is a utility script that splits VCF files (BG-zipped) into separate, single-sample VCF files. Many other tools can perform this action, but this is especially suited for computers with limited resources The flag -b specifies how many samples to process at the same time, making the code much slower for small values of -b .
Smaller values of -b are suitable for low-memory computers, while high numbers are for computers with more resources. The output of this file can be used for pepa-paint by running the command *.vcf > ListVCF.txt

pepa-split -I MultiSampleVCFfile.vcf.gz -b 20

📁 `Test` Folder

The folder contains:

Example_Input/: Sample VCFs, annotation, and list file used as input.
Example_Output/: Expected output files, including:
- Clustered results (*_Clustered.csv): these show the ancestry blocks detected with a given SNP clustering size. You can try different -c values to tune resolution. An five examples can be found in figure Fig_Clustered with -c values of 2; 10; 100; 1000; 2000; 5000.
- Summary tables and transformed CSVs.
- Genome painting plots (.png) and barplots (.pdf).
Folders like Example10/, Example50/, and Example100/ show how pepa paint chromosomes across different numbers of individuals.

You can use these examples to explore understand pepa’s output format and parameter effects.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
Pipeline		Pipeline
Script		Script
Test		Test
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pedigree Painter (pepa)

Description

Installing `pepa` via Conda

Installation Steps

Usage

Required Flags

Optional Flags

📁 `Test` Folder

About

Uh oh!

Releases 2

Packages

Languages

License

Mitopozzi/PePa

Folders and files

Latest commit

History

Repository files navigation

Pedigree Painter (pepa)

Description

Installing pepa via Conda

Installation Steps

Usage

Required Flags

Optional Flags

📁 Test Folder

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Installing `pepa` via Conda

📁 `Test` Folder

Packages