3.4 Data analysis

After the sequencing is completed, the machine returns DNA fragment sequences (‘reads’) that can be analysed on a computer.

Can you recall the two types of sequencing machines in common use, based on read length and how they differ?

Answer

Short-read sequencing generates reads of 50–500 base pairs; long-read sequencing can generate reads of thousands and up to 1 million base pairs.

How do the sequences generated by these two systems compare with regard to quality?

Answer

Short-read sequences are more accurate, but the less accurate long-reads are better for resolving complex regions of the genome.

Before further analysis, a quality control (QC) step ensures that only high-quality sequence data is used. An example of an automatically generated QC report from an ONT sequencing run is shown in Figure 8 below.

Described image
Figure 8 Sequencing run summary from an ONT report.

The sample portion of a QC report generated by ONT during a sequencing run shown in Figure 8 was generated about 19 hours into a 72-hour run. This is just a portion of a much larger report to give you a sense of how an ONT run report looks. As can be seen, the system has produced approximately 60 gigabytes (GB) of raw data. You can see that 99.99% of the sequence has been successfully read as base pairs (reads called), with only a tiny percentage being highlighted as a fail. This is only a partial run; ONT runs typically generate even more data if allowed to complete the full 72 hours. ONT provides detailed, real-time QC to help users monitor yield, quality and run progress at any time.

Once the high-quality data is ready, it is analysed using specialised software. This can be done by trained bioinformaticians, but there are also free, cloud-based platforms available that make analysis more accessible, especially for smaller labs or public health institutions.

The next steps depend on the type of sequencing that has been performed, as detailed below.

  • With short-read data, the DNA fragments are usually compared to the sequence of a known reference genome. This helps to identify DNA sequence differences between your sample and known strains, such as changes that may affect resistance to antibiotics, how the bacteria are spreading or how they may have evolved.
  • With long-read data, the software often builds the sequence of the genome from scratch – a process called genome assembly. This approach is especially good for detecting plasmids, mobile resistance genes and parts of the genome that are difficult to analyse using short-read technologies. Plasmids and mobile resistance genes are important because they often carry AMR genes and can move between bacteria. This means they can spread resistance not just within one species but across different species and environments. Being able to detect and track these elements helps us understand how resistance is spreading and where interventions may be needed.

After the reads are mapped or assembled into a full genome, the sequences can be uploaded by the users to a trusted international database such as ResFinder [Tip: hold Ctrl and click a link to open it in a new tab. (Hide tip)] , CARD (The Comprehensive Antibiotic Resistance Database), and VFDB (The Virulence Factor Database), where the reads are compared to existing genomes already in the database to see how closely related they are. This is important for tracking outbreaks, understanding how resistance is spreading, and identifying the likely source of an infection.

Figure 9 shows the AMR profile predicted from the genome of an E. coli isolate using the ResFinder tool. As can be seen, the panel lists antimicrobials screened, their class, the predicted resistance phenotype based on WGS and the genetic background (i.e. the resistance gene detected). This output demonstrates how WGS can be used to predict resistance to multiple antibiotic classes and link phenotypes to specific genetic determinants.

Figure 9 ResFinder output for a whole genome assembly of Escherichia coli.

While this course focuses on sequencing from cultured isolates, another emerging approach is metagenomics – the direct sequencing of DNA from complex samples such as food, clinical or wastewater samples without prior culturing. These samples often contain a mixture of DNA from multiple organisms. During the data analysis stage, bioinformatic tools are used to separate and identify the different organisms’ DNA sequences. Metagenomics has the potential to reduce turnaround times and improve detection of unculturable bacteria or mixed infections. However, it still faces technical and cost-related barriers, including low DNA yield, contamination and complex data interpretation. As sequencing technologies and analysis tools continue to advance, metagenomic approaches may become increasingly practical and valuable in the future.

Activity 6: The four stages of WGS

Timing: Allow 10 minutes

Drag and drop the four stages of WGS described above into the correct order.

Two lists follow, match one item from the first with one item from the second. Each item can only be matched once. There are 4 items in each list.

  1. Sample collection

  2. DNA isolation and library preparation

  3. Sequencing

  4. Data analysis

Match each of the previous list items with an item from the following list:

  • a.2

  • b.3

  • c.4

  • d.1

The correct answers are:
  • 1 = d,
  • 2 = a,
  • 3 = b,
  • 4 = c

3.5 Data-sharing considerations