Generating Error Profiles ------------------------- Mapping the data ^^^^^^^^^^^^^^^^ Minimap2 is a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database. Typical use cases include: (1) mapping PacBio or Oxford Nanopore genomic reads to the human genome; (2) finding overlaps between long reads with error rate up to ~15%; (3) splice-aware alignment of PacBio Iso-Seq or Nanopore cDNA or Direct RNA reads against a reference genome; (4) aligning Illumina single- or paired-end reads; (5) assembly-to-assembly alignment; (6) full-genome alignment between two closely related species with divergence below ~15%. We now use minimap2 to align the ONT read sets to the reference:: cd ~/workdir mkdir ~/workdir/map_to_ref minimap2 -a -x map-ont data/Reference.fna basecall/basecall.fastq.gz > ~/workdir/map_to_ref/nanoporeToRef.sam For the illumina reads we will use another aligner, as this one is more suited for this kind of data. But before we can do so, we need to create an index structure on the reference:: bwa index ~/workdir/data/Reference.fna bwa mem -t 14 ~/workdir/data/Reference.fna ~/workdir/data/illumina/Illumina_R1.fastq.gz ~/workdir/data/illumina/Illumina_R2.fastq.gz > ~/workdir/map_to_ref/illumina.bwa.sam Inferring error profiles using samtools ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ After mapping the reads on the reference Genome, we can infer various statistics as e.g., number of succesful aligned reads and bases, or number of mismatches and indels, and so on. For this you could easily use the tool collection **samtools**, which offers a range of simple CLI modules all operating on mapping output (SAM and BAM format). We will use the ``stats`` module now:: samtools stats -d -@ 14 ~/workdir/map_to_ref/nanoporeToRef.sam > ~/workdir/map_to_ref/nanoporeToRef.sam.stats samtools stats -d -@ 14 ~/workdir/map_to_ref/illumina.bwa.sam > ~/workdir/map_to_ref/illumina.bwa.sam.stats We can inspect these results now by simply view at the top 40 lines of the output:: head -n 40 ~/workdir/map_to_ref/nanoporeToRef.sam.stats head -n 40 ~/workdir/map_to_ref/illumina.bwa.sam.stats Enhanced mapping statistics ^^^^^^^^^^^^^^^^^^^^^^^^^^^ To get a more in depth info on the actual accuracy of the data at hand, including the genome coverage, we're going to use a more comprehensive and interactive software comparable to FastQC which is called **Qualimap**. First, we convert the SAM files into BAM format and sort them:: cd ~/workdir samtools view -@ 4 -bS ~/workdir/map_to_ref/nanoporeToRef.sam | samtools sort - -@ 8 -o ~/workdir/map_to_ref/nanoporeToRef.sorted.bam samtools view -@ 4 -bS ~/workdir/map_to_ref/illumina.bwa.sam | samtools sort - -@ 8 -o ~/workdir/map_to_ref/illumina.bwa.sorted.bam Then we can run **qualimap** on those BAM files now:: qualimap bamqc -bam ~/workdir/map_to_ref/nanoporeToRef.sorted.bam -nw 5000 -nt 14 -c -outdir ~/workdir/map_to_ref/nanopore.quali qualimap bamqc -bam ~/workdir/map_to_ref/illumina.bwa.sorted.bam -nw 5000 -nt 14 -c -outdir ~/workdir/map_to_ref/illumina.quali Qualimap can also be run interactively. References ^^^^^^^^^^ **minimap2** https://github.com/lh3/minimap2 **BWA** http://bio-bwa.sourceforge.net/ **Samtools** http://samtools.sourceforge.net/ **QualiMap** http://qualimap.bioinfo.cipf.es/doc_html/index.html