generates samplesheet compatible with TSO500 LocalApp analysis.
Running LocalApp analysis requires a samplesheet in a specific format consistent with the performed sequencing to guide the analysis. Such a samplesheet can be created manually using a template samplesheet file provided by Illumina and filling in a run specific information for every sequencing run using Excel or similar. This script is an attempt to automatize the task of samplesheet generation as much as possible. The script gets a sequencing run information in a simple tab separated table and generates a samplesheet fulfilling LocalApp requirements.
- python3>=3.9,
- pandas==2.2.0
Names of the input parameters defined for the script and their possible values are listed in the table below.
| parameter | description | type | default |
|---|---|---|---|
--run-id |
ID of the sequencing run for which the samplesheet is generated. | str | |
--index-type |
Type of the used index. Supported values are dual and simple. |
str | dual |
--index-length |
Number of nucleotides in the used index. Supported values are 8 and 10. |
int | |
--investigator-name |
Value of the Investigator Name field in the samplesheet. It is preferred for the string to be in the form name (inpred_node). The string cannot contain a comma. |
str | '' |
--experiment-name |
Value of the Experiment Name field in the samplesheet. It is preferred for the string to be a space separated list of all the studies from which the run samples are. The string cannot contain a comma. |
str | '' |
--input-info-file |
Absolute path to an input info file. | str | |
--read-length-1 |
Length of the sequenced forward reads. This value will be filled in the Reads section of the samplesheet. |
int | 101 |
--read-length-2 |
Length of the sequenced reverse reads. This value will be filled in the Reads section of the samplesheet. |
int | 101 |
--adapter-read-1 |
Nucleotide adapter sequence of read1. This value will be filled in the Settings section of the samplesheet. |
str | |
--adapter-read-2 |
Nucleotide adapter sequence of read2. This value will be filled in the Settings section of the samplesheet. |
str | |
--adapter-behavior |
This value will be filled in the Settings section of the samplesheet and passed as an input parameter to BCL convert. Supported values are trim and mask. For more info about BCL convert, see the BCL convert user guide. |
str | trim |
--minimum-trimmed-read-length |
This value will be filled in the Settings section of the samplesheet and passed as an input parameter to BCL convert. |
int | 35 |
--mask-short-reads |
This value will be filled in the Settings section of the samplesheet and passed as an input parameter to BCL convert. |
int | 22 |
--override-cycles |
This value will be filled in the Settings section of the samplesheet and passed as an input parameter to BCL convert. |
str | |
--samplesheet-version |
Version in which the samplesheet is generated. Only v1 is implemented now. |
str | v1 |
--help |
Print the help message and exit. |
File format of the input_info_file follows.
The file is expected to contain tab separated values (.tsv file). The rows starting with character # are ignored (considered to be comments). The first non-commented row is considered to be a header containing column names.
Required columns of the file are: sample_id, molecule, run_id, barcode, index (at least one of columns barcode and index has to be non-empty).
| column | description |
|---|---|
sample_id |
- id of a sample. |
molecule |
- DNA or RNA, choose the appropriate one. |
run_id |
- id of the sequencing run in which the sample was sequenced. |
barcode |
- one of the Index_ID values from the indexes/TSO500*_dual_*.tsv files or one of the I7_Index_ID values from the indexes/TSO500*simple*.tsv file. The value should correspond to the indexes used for sequencing of the sample. Alternatively, the field can be left empty or containing NA string. |
index |
- DNA sequence of the forward index from the column index from one of the indexes/*.tsv files. Alternatively, the field can be left empty or containing NA string. |
# nextseq instrument, dual indexes, index length 8
--index-type dual \
--index-length 8 \
--read-length-1 101 \
--read-length-2 101 \
--adapter-read-1 "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" \
--adapter-read-2 "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" \
--adapter-behavior "trim" \
--minimum-trimmed-read-length 35 \
--mask-short-reads 35 \
--override-cycles "U7N1Y93;I8;I8;U7N1Y93"
# novaseq instrument, dual indexes, index length 10
--index-type dual \
--index-length 10 \
--read-length-1 101 \
--read-length-2 101 \
--adapter-read-1 "CTGTCTCTTATACACATCTCCGAGCCCACGAGAC" \
--adapter-read-2 "CTGTCTCTTATACACATCTGACGCTGCCGACGA" \
--adapter-behavior "trim" \
--minimum-trimmed-read-length 35 \
--mask-short-reads 35 \
--override-cycles "U7N1Y93;I10;I10;U7N1Y93"
# nextseq instrument, simple indexes, index length 8, legacy
--index-type simple \
--index-length 8 \
--read-length-1 101 \
--read-length-2 101 \
--adapter-read-1 "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" \
--adapter-read-2 "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" \
--adapter-behavior "trim" \
--minimum-trimmed-read-length 35 \
--mask-short-reads 22 \
--override-cycles "U7N1Y93;I8;I8;U7N1Y93"
python3 generate_samplesheet.py [parameters]...
docker run --rm -it \
-v /path_input_file:/container_path_input_file \
inpred/samplesheet_generator:latest bash
singularity run \
-B /path_input_file:/container_path_input_file \
docker://inpred/samplesheet_generator:latest bash
apptainer run \
-B /path_input_file:/container_path_input_file \
docker://inpred/samplesheet_generator:latest bash
Test data are located in the test subfolder of the repository. Input info file is named infoFile.tsv and expected output is stored in samplesheet.tsv.
The script is tested with data of a specific sequencing run. The run consists of artificial samples, including AcroMetrix samples. The sequencing was performed on a NextSeq instrument, with the legacy parameter setting and file formats.
# ${GITHUB_REPOSITORY_LOCAL_PATH} is an absolute path
# to the samplesheet_generator repository on on the local compute.
# define the testRunID value
testRunID="191206_NB501498_0174_AHWCNMBGXC"
# execute samplesheet_generator
python3 samplesheet_generator.py \
--run-id ${testRunID} \
--index-type simple \
--index-length 8 \
--investigator-name "Name (InPreD node)" \
--experiment-name "OUS pathology test run" \
--input-info-file ${GITHUB_REPOSITORY_LOCAL_PATH}/test/infoFile.tsv \
--read-length-1 101 \
--read-length-2 101 \
--adapter-read-1 "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" \
--adapter-read-2 "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" \
--adapter-behavior "trim" \
--minimum-trimmed-read-length 35 \
--mask-short-reads 22 \
--override-cycles "U7N1Y93;I8;I8;U7N1Y93" \
--samplesheet-version "v1"
# ${GITHUB_REPOSITORY_LOCAL_PATH} is an absolute path
# to the samplesheet_generator repository on on the local compute.
# ${INFO_INPUT_FILE_CONTAINER} is an absolute path to the input info file
# in the container.
docker run --rm -it \
-v ${GITHUB_REPOSITORY_LOCAL_PATH}/test/infoFile.tsv:${INFO_INPUT_FILE_CONTAINER} \
docker://inpred/samplesheet_generator:latest bash
Docker> testRunID="191206_NB501498_0174_AHWCNMBGXC"
Docker> python3 samplesheet_generator.py \
--run-id ${testRunID} \
--index-type simple \
--index-length 8 \
--investigator-name "Name (InPreD node)" \
--experiment-name "OUS pathology test run" \
--input-info-file ${INFO_INPUT_FILE_CONTAINER} \
--read-length-1 101 \
--read-length-2 101 \
--adapter-read-1 "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" \
--adapter-read-2 "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" \
--adapter-behavior "trim" \
--minimum-trimmed-read-length 35 \
--mask-short-reads 22 \
--override-cycles "U7N1Y93;I8;I8;U7N1Y93" \
--samplesheet-version "v1"
singularity run \
-B ${GITHUB_REPOSITORY_LOCAL_PATH}/test/infoFile.tsv:${INFO_INPUT_FILE_CONTAINER} \
docker://inpred/samplesheet_generator:latest bash
Singularity> testRunID="191206_NB501498_0174_AHWCNMBGXC"
Singularity> python3 samplesheet_generator.py \
--run-id ${testRunID} \
--index-type simple \
--index-length 8 \
--investigator-name "Name (InPreD node)" \
--experiment-name "OUS pathology test run" \
--input-info-file ${INFO_INPUT_FILE_CONTAINER} \
--read-length-1 101 \
--read-length-2 101 \
--adapter-read-1 "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" \
--adapter-read-2 "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" \
--adapter-behavior "trim" \
--minimum-trimmed-read-length 35 \
--mask-short-reads 22 \
--override-cycles "U7N1Y93;I8;I8;U7N1Y93" \
--samplesheet-version "v1"
apptainer run \
-B ${GITHUB_REPOSITORY_LOCAL_PATH}/test/infoFile.tsv:${INFO_INPUT_FILE_CONTAINER} \
docker://inpred/samplesheet_generator:latest bash
Apptainer> testRunID="191206_NB501498_0174_AHWCNMBGXC"
Apptainer> python3 samplesheet_generator.py \
--run-id ${testRunID} \
--index-type simple \
--index-length 8 \
--investigator-name "Name (InPreD node)" \
--experiment-name "OUS pathology test run" \
--input-info-file ${INFO_INPUT_FILE_CONTAINER} \
--read-length-1 101 \
--read-length-2 101 \
--adapter-read-1 "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" \
--adapter-read-2 "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" \
--adapter-behavior "trim" \
--minimum-trimmed-read-length 35 \
--mask-short-reads 22 \
--override-cycles "U7N1Y93;I8;I8;U7N1Y93" \
--samplesheet-version "v1"