ADRes: a pipeline to detect molecular markers of Antimalarial Drug Resistance
adres.sh
is a BASH-scripted pipeline that can be used to identify SNPs in specific codons of
Plasmodium falciparum genes that are associated with antimalarial drug resistance, from Sanger sequencing data.
Currently, the following genes and codons are supported:
Gene | Codons |
---|---|
pfmdr1 | 86, 184, 1034, 1042,1246 |
pfcrt | 72, 73, 74, 75, 76 |
dhps | 436, 437, 540, 581, 613 |
dhfr | 51, 59, 108, 164 |
Thus allowing for the detection of molecular markers of resistance to chloroquine, sulphadoxine-pyrimethamine, and artemisinin derivatives. Using ABI Sanger sequencing trace files of whole gene (or regions spanning codons of interest) from several samples, this pipeline outputs a CSV-formatted file of sample name, codons and their corresponding amino acids.
This pipeline is a combination of existing tools (such as BWA, sam2fasta.py, abifpy, SAMtools) and custom scripts, and is made up of the following steps:
- Base calling and quality control,
- Alignment of filtered sequences to coding sequence of respective reference gene,
- SAM alignment filtering and conversion to FASTA format,
- Parse FASTA alignment to output codons and corresponding amino acid in CSV format.
Installation
Usage
bash adres.sh <directory_of_ab1_files> <reference_gene_coding_sequence> <gene> [quality_cutoff]
where:
<directory_of_ab1_files> is directory containing ab1 trace files of a single gene (pfcrt, pfmdr1, dhps, or dhfr)
<reference_gene_coding_sequence> is path to reference coding sequence of respective gene
<gene> is pfcrt, pfmdr1, dhps, or dhfr
[quality_cutoff] is the Phred quality threshold for trimming bases.
This argument is OPTIONAL. Default value is 10 (Q10).
Legal values range from 10 to 60.
Example command
bash adres.sh ~/pfcrt_ab1_seq/ ~/anti_mdr_snps/pfcrt_pf3D7_cds.fasta pfcrt
Output
The primary output file is named in this format: gene_dd_mm_yy.csv
and is stored in the <directory_of_ab1_files>
The output for the example command above could look like this:
Sample | Codon_72 | Codon_72_aa | Codon_73 | Codon_73_aa | Codon_74 | Codon_74_aa | Codon_75 | Codon_75_aa | Codon_76 | Codon_76_aa |
---|---|---|---|---|---|---|---|---|---|---|
89C_CRT | TGT | C | GTA | V | ATG | M | AAT | N | AAA | K |
90C_CRT | TGT | C | GTA | V | ATG | M | AAT | N | AAA | K |
92C_CRT | TGT | C | GTA | V | ATG | M | AAT | N | AAA | K |
93C_CRT | TGT | C | GTA | V | ATG | M | AAT | N | AAA | K |
94C_CRT | TGT | C | GTA | V | ATG | M | AAT | N | AAA | K |
97C_CRT | TGT | C | GTA | V | ATG | M | AAT | N | AAA | K |
99C_CRT | TGT | C | GTA | V | ATG | M | AAT | N | AAA | K |
PfCRT_ref | TGT | C | GTA | V | ATG | M | AAT | N | AAA | K |
Example data
The archived file containing the source code for this pipeline also contains some example_data
.
The example_data
directory contains:
example_ab1
: This directory contains 50 ab1 trace files obtained from sequencing regions of pfcrt spanning codons 72 - 76 from 50 P. falciparum isolates. This directory can be used as<directory_of_ab1_files>
. It also contains 3 intermediate files (pfcrt_30_06_15.fasta
,pfcrt_30_06_15.sam
,pfcrt_30_06_15.fastq
) and the main output file (pfcrt_30_06_15.csv
). Typically,<directory_of_ab1_files>
will not contain any intermediate files or output file until you have ran the pipeline successfully.pfcrt_pf3d7_cds.fasta
: The reference coding sequence for pfcrt [PlasmoDB: PF3D7_0709000]. This file can be used as<reference_gene_coding_sequence>
. Reference was originally downloaded from plasmodb.org. Remove any spaces from header of reference fasta file. Modify the header to look like this header for the pfcrt reference:>pfcrt_pf3d7_CDS_1275bp