Skip navigation

A common strategy for genome sequencing is an approach called “whole genome shotgun“. We should recall that all DNA sequencing method are able to decode a limited amount of DNA bases (“letters” of the sequence).

With the “genome shotgun” we take several copies of the genome, we randomly shear the DNA producing smaller fragments and then we sequence those fragments. The assembly step require a dedicated software that comparing the produced sequences (called reads) produces a larger consensus sequence called contig.

In theory we should be able to reconstruct the original DNA sequence (e.g. a chromosome), but the presence of repeated sequences makes this difficult.

The rule of the thumb is that you can solve repeats shorter than the average read length. For example if your machine produces read long ~500 bp you will experience “breaks” in your contigs if you have repeated regions >500 bp.

I prepared this picture using an English sentence as “reference genome” to make an example:

de novo assembly