DNA SEQUENCING: ILLUMINA WORKFLOW PART-4 (SEQUENCING BIOINFORMATICS)

4/26/2018

ARTICLE BY MR. EZRA SALAMAT

Modern DNA sequencing techniques are driven in part by complex machinery and electronics. This enables sequencing instruments to generate data that would have otherwise be impossible to obtain using manual methods. However, the data generated by these instruments are too large and complex for humans to manually process... necessitating the use of other machines and dedicated software capable of reading and processing these data.

Photo: a modern interface socket that connects some modern CPU (microprocessors) to the computer motherboard. Complex circuitry like this are found in computers that are used to process raw sequencing data. [Photo by: Qurren (talk), with IXY Digital 10 compact digital camera. - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=8427874]

The previous three chapters discussed the wet-lab workflow of DNA sequencing. Upon completion of the actual sequencing process, sequencing platforms like the Illumina MiSeq can produce an output of up to 15Gb worth of data, amounting to up to 25 million reads. One might think that the immediate output of such sequencing process would be a whole genome sequence such as this:

...when in fact the actual output actually looks more of something like this instead:

Now the actual output doesn't look anything remotely useful at all... and in fact it isn't... at least in its current form. The Illumina platform outputs the individual sequences of all several millions worth of reads, each only 150-300bp long (depending on the kit used). These sequences need to be processed into a form usable for scientist... and that's where bioinformatics tools come in.

If you remember in Chapter 2, we mentioned that the Illumina sequencing platform can only sequence short stretches of DNA. So instead of sequencing the entire length of the genome from one end to another, one base at a time, Illumina reads millions of smaller fragments in parallel, which can then be assembled later.

OVERVIEW OF QUALITY SCORES

The Illumina platform outputs its data in FastQ Format (sample shown below). Each read (red box) starts with an @ symbol, and is composed of four lines (parts):

Read Label : contains information about the read
Read Sequence : actual nucleotide sequence
Quality Score Identifier line: a plus (+) sign that indicates that the scores are listed below it
Quality Score : indicates the quality of each base in the read

Image: FastQ File

As you can see in the image above, the Quality Scores are encoded in single characters (a number, letter, or symbol), each representing a quality score based on the ASCII table. The Illumina ASCII-Quality Score code table is shown below:

Image from: https://support.illumina.com/help/BaseSpace_OLH_009008/Content/Source/Informatics/BS/QualityScoreEncoding_swBS.htm

Here, a symbol corresponds to a Q-score, which is a compact expression of error probability. For instance, the symbol "?" has a corresponding Q-score of 30. We can then determine the error probability using the formula:

Q(A) =-10 log10(P(~A))

where:
Q(A), expresses the probability that A is not true
P(~A) is the estimated probability of an assertion A being wrong.

We can also use pre-computed tables, such as the one below:

Based on the table above, a Q-score of 30 has an Error Probability of 0.001- that is, the probability of an incorrect base call is 1 in 1,000 (or 0.1%).

Let's look at an actual example below:

In the FastQ file above, the boxed nucleotide sequence "T" has a quality symbol of "5". Looking back at the tables above, "5" corresponds to a Quality score of 20, which means that it has an error probability of 0.01 (1%). This means that for that position, the machine is 99% sure that the nucleotide is a "T".

SEQUENCE QUALITY CONTROL

Converting raw reads to an assembled genome requires a pipeline - or series of processes that filters, sorts, and assembles the reads. The first step is to separate the good reads from the bad ones, based on the overall quality score of each read. A threshold is set, and all reads above a certain threshold (e.g. 10% error rate) are discarded. A dedicated bioinformatics program called "AfterQC" is used to sort the good and bad reads. Below is an example of a FastQ file sorted into good and bad reads. In the images below, 1,034,328 reads were sorted into 509,328 good reads and 525,000 bad reads.

Good reads

Bad reads

SEQUENCE ASSEMBLY (SIMPLIFIED OVERVIEW)

Only the good reads will be used in the next step, which is the assembly step. There are several programs available, each using its own algorithms (i.e. set of mathematical processes/formula), but we will look into a program called SPAdes (St. Petersburg genome assembler) , shown below:

Image: Initialization page of SPAdes. Like most Bioinformatics programs, SPAdes lack a dedicated UI (User Interface) and runs using command lines.

SPAdes aligns the reads and determines the most likely arrangement in which they are interconnected, using a complex algorithm (i.e. it employs multisized de Bruijn graph). You can learn more about SPAdes at the Center for Algorithmic Biotechnology webpage.

Image: SPAdes output files (left), and QUAST-generated SPAdes report (right)

SPAdes generates longer sequences called contigs, scaffolds,and assemblies by joining reads together - visible as fasta files above. A report above (generated by the QUAST tool) indicates that there were four >50,000 bp contigs successfully, eight >25000 bp contigs successfully assembled, and so on.

Image: Simplified diagram of Genome Assembly

The image above shows a simplified diagram of how the assembly was done. Short contigs are assembled into longer scaffolds, which are then assembled further into assemblies. In most cases, the initial assemblies may correspond to a considerable portion of the whole genome sequence (e.g. 98% of the whole genome), but it may still contain gaps (as shown above) that may need to be resolved by another sequencing method (such as capillary sequencing. Once the gaps has been sequenced, the gap sequence is aligned to the initial assembly, resulting to the final assembly (i.e. whole genome; sample shown below).

Image: portion of an E. coli genome sequence

The Image below shows how all the 500,000+ good reads (multicolored dots) reads map into the final assembly, which shows the immense amount of computational power needed to perform such a large scale assembly of short read sequences.

Image: Read Map generated by TABLET: a graphical viewer for next generation sequence assemblies and alignments

ANNOTATION

After genome assembly, a scientist may wish to label the genome in a process called annotation. Genome annotation is the process of identifying/predicting the locations of genes and coding regions in a genome and determining their respective functions. One such Bioinformatics Program designed for Annotation is called the GLIMMER (Gene Locator and Interpolated Markov ModelER). GLIMMER uses interpolated Markov models (IMMs) to identify coding regions, and distinguish them from non-coding DNA. A sample of GLIMMER result is shown below. Here, the image shows results from NODE 1, containing a total of 43 Open Reading Frames (ORFs), all of which are potential gene-coding regions. The first column shows the ORF name, while the second and third column shows the end and start positions of the ORFs, respectively.

Image: GLIMMER ORF List Output

To give meaning to the sequences, the ORFs can be subjected to the Basic Local Alignment Search Tool (BLAST). Here, each ORF is compared to millions of characterized sequences found in a sequence database, such as that of the National Center for Biotechnology Information's (NCBI) GenBank. This is done in order to find a match, and subsequently the likely identity and function of each ORF.

Image: BLAST search generated using Local BLAST

In the sample above, the sample ORF# 00260 (underlined) was identified as a Nicotinamide riboside transporter protein. As indicated in the other BLAST results, some of the ORFs correspond to a protein-coding gene, while some are uncharacterized, probably because either they are yet to be discovered, or that the prediction for that ORF was incorrect.

Further characterization of the genome can be made, like Genome BLAST, Variant Calling, Multiple Sequence Alignment, Phylogeny Determination, etc, but that's subject to another article.

And that concludes the 4-part Series on the Illumina DNA-sequencing Workflow! If you're interested in doing further readings in DNA sequencing topics, you can visit the reference links in the end of each chapter.

References:

BABRAHAM Bioinformatics: FastQC. Retrieved from: https://www.bioinformatics.babraham.ac.uk/projects/ fastqc/
Center for Algorithmic Biotechnology: SPAdes. Retrieved from: http://cab.spbu.ru/software/spades/
FastQ File. Retrieved from: http://support.illumina.com/content/dam/illumina-support/help/BaseSpaceHelp_v2/ Content/Vault/Informatics/Sequencing_Analysis/BS/swSEQ_mBS_FASTQFiles.htm
GLIMMER. Retrieved from: https://ccb.jhu.edu/software/glimmer/
Illumina Quality Scores. Retrieved from: https://support.illumina.com/help/BaseSpace_OLH_009008 /Content/Source/Informatics/BS/QualityScores_swBS.htm
Quality Score encoding. Retrieved from: https://support.illumina.com/help/ BaseSpace_OLH_009008/Content/ Source/Informatics/BS/QualityScoreEncoding_swBS.htm#
The James Hutton Institute: Tablet. Retrieved from: https://ics.hutton.ac.uk/tablet/

2 Comments

Educational Articles