ADRes: a pipeline to detect molecular markers of Antimalarial Drug Resistance

adres.sh is a BASH-scripted pipeline that can be used to identify SNPs in specific codons of Plasmodium falciparum genes that are associated with antimalarial drug resistance, from Sanger sequencing data. Currently, the following genes and codons are supported:

Gene	Codons
pfmdr1	86, 184, 1034, 1042,1246
pfcrt	72, 73, 74, 75, 76
dhps	436, 437, 540, 581, 613
dhfr	51, 59, 108, 164

Thus allowing for the detection of molecular markers of resistance to chloroquine, sulphadoxine-pyrimethamine, and artemisinin derivatives. Using ABI Sanger sequencing trace files of whole gene (or regions spanning codons of interest) from several samples, this pipeline outputs a CSV-formatted file of sample name, codons and their corresponding amino acids.

This pipeline is a combination of existing tools (such as BWA, sam2fasta.py, abifpy, SAMtools) and custom scripts, and is made up of the following steps:

Base calling and quality control,
Alignment of filtered sequences to coding sequence of respective reference gene,
SAM alignment filtering and conversion to FASTA format,
Parse FASTA alignment to output codons and corresponding amino acid in CSV format.

Installation

Download the source .zip or .tar.gz file and extract
Run the pipline from the ADRes directory

Usage

     bash adres.sh <directory_of_ab1_files> <reference_gene_coding_sequence> <gene> [quality_cutoff]

where:

    <directory_of_ab1_files> is directory containing ab1 trace files of a single gene (pfcrt, pfmdr1, dhps, or dhfr)
    <reference_gene_coding_sequence> is path to reference coding sequence of respective gene
    <gene> is pfcrt, pfmdr1, dhps, or dhfr
    [quality_cutoff] is the Phred quality threshold for trimming bases. 
                    This argument is OPTIONAL. Default value is 10 (Q10).
                    Legal values range from 10 to 60.

Example command

    bash adres.sh ~/pfcrt_ab1_seq/ ~/anti_mdr_snps/pfcrt_pf3D7_cds.fasta pfcrt

Output

The primary output file is named in this format: gene_dd_mm_yy.csv and is stored in the <directory_of_ab1_files> The output for the example command above could look like this:

Sample	Codon_72	Codon_72_aa	Codon_73	Codon_73_aa	Codon_74	Codon_74_aa	Codon_75	Codon_75_aa	Codon_76	Codon_76_aa
89C_CRT	TGT	C	GTA	V	ATG	M	AAT	N	AAA	K
90C_CRT	TGT	C	GTA	V	ATG	M	AAT	N	AAA	K
92C_CRT	TGT	C	GTA	V	ATG	M	AAT	N	AAA	K
93C_CRT	TGT	C	GTA	V	ATG	M	AAT	N	AAA	K
94C_CRT	TGT	C	GTA	V	ATG	M	AAT	N	AAA	K
97C_CRT	TGT	C	GTA	V	ATG	M	AAT	N	AAA	K
99C_CRT	TGT	C	GTA	V	ATG	M	AAT	N	AAA	K
PfCRT_ref	TGT	C	GTA	V	ATG	M	AAT	N	AAA	K

Example data

The archived file containing the source code for this pipeline also contains some example_data. The example_data directory contains:

example_ab1 : This directory contains 50 ab1 trace files obtained from sequencing regions of pfcrt spanning codons 72 - 76 from 50 P. falciparum isolates. This directory can be used as <directory_of_ab1_files>. It also contains 3 intermediate files (pfcrt_30_06_15.fasta, pfcrt_30_06_15.sam, pfcrt_30_06_15.fastq) and the main output file (pfcrt_30_06_15.csv). Typically, <directory_of_ab1_files> will not contain any intermediate files or output file until you have ran the pipeline successfully.
pfcrt_pf3d7_cds.fasta : The reference coding sequence for pfcrt [PlasmoDB: PF3D7_0709000]. This file can be used as <reference_gene_coding_sequence>. Reference was originally downloaded from plasmodb.org. Remove any spaces from header of reference fasta file. Modify the header to look like this header for the pfcrt reference: >pfcrt_pf3d7_CDS_1275bp