1.3 Making sense of the sequences

In the previous section you learned that millions of DNA fragments can be sequenced. But how do scientists make sense of all this information?

What comes next is a bit like assembling a puzzle. Powerful computer programs compare the DNA fragments and look for regions with a matching sequence of letters: When two sequences have the same letters at their ends, the program assumes they come from the same original piece of DNA and joins them together. By repeating this process many times, these fragments are combined into longer (assembled) sequences.

Figure 4 The process of assembling DNA sequences from DNA fragments (image taken from Pavlopoulos et al 2013).

The computers compare the assembled sequences to known genes in a database that contains known DNA sequences from many different organisms, and identify the organism(s) they come from (taxonomic assignment). The biological function of a gene can also be predicted. For example, scientists can predict which energy source a group of microorganism uses by looking at which type of metabolic genes are found in the sample.

Sometimes, not all sequences in a sample can be identified, meaning that they don’t have a matching sequence in the databases. The more scientists sequence genomes and organisms from the environment, the bigger the databases become and so the capacity to identify and study genes with metagenomics expands.

You can imagine that metagenomic studies generate a lot of data, making their storage and sharing very important. Data are often stored in online repositories such as the NCBI Sequence Read Archive (SRA).

Figure 5 Metagenomics workflow.

You will now try to make sense of some metagenomic data in a short activity.

Activity 1

Allow 20 minutes for completing this activity

In this activity, you will look at the results of a metagenomics study where total DNA was extracted from water samples from two different ponds (A and B) and then sequenced with the shotgun approach. The goal was to determine which of the two ponds had the highest number of different species in it (e.g. which pond was more biodiverse). The table below shows the DNA sequences found in the two ponds (represented as strings of A, T, G, C), but they have not been identified yet. On the right, you can see a list of reference DNA sequences of organisms expected to be found in the ponds.

Question 1

Match the DNA sequences in Pond A and B with the reference sequences on the right (coloured box) and fill the identification column with the name of the organism corresponding to the sequence

Table 1 DNA sequences found in the two ponds and reference organism sequences

Pond A Identification in A Pond B Identification in B Reference sequence Organism
GCGCGC
(A text entry box would appear here, but your browser does not support it.)
TATCCC
(A text entry box would appear here, but your browser does not support it.)
AATTTA Water flea
GCGCGC
(A text entry box would appear here, but your browser does not support it.)
TATATA
(A text entry box would appear here, but your browser does not support it.)
GCGCGC Mosquito
GGGCCC
(A text entry box would appear here, but your browser does not support it.)
AATTTA
(A text entry box would appear here, but your browser does not support it.)
GGGCCC Bacterium
GCGCGC
(A text entry box would appear here, but your browser does not support it.)
GGGCCC
(A text entry box would appear here, but your browser does not support it.)
TATATA Fish species A
GCGCGC
(A text entry box would appear here, but your browser does not support it.)
CATACA
(A text entry box would appear here, but your browser does not support it.)
TATCCC Fish species B
CATACA
(A text entry box would appear here, but your browser does not support it.)
AATTTA
(A text entry box would appear here, but your browser does not support it.)
CATACA Water hyacinth
GCGCGC
(A text entry box would appear here, but your browser does not support it.)
AATTTA
(A text entry box would appear here, but your browser does not support it.)
ATAGGG Frog
CATACA
(A text entry box would appear here, but your browser does not support it.)
ATAGGG
(A text entry box would appear here, but your browser does not support it.)
   
GGGCCC
(A text entry box would appear here, but your browser does not support it.)
CGGGGG
(A text entry box would appear here, but your browser does not support it.)
   
GCGCGC
(A text entry box would appear here, but your browser does not support it.)
CCCCCC
(A text entry box would appear here, but your browser does not support it.)
   
Interactive feature not available in single page view.

Answer

Table 1 (completed) DNA sequences found in the two ponds and reference organism sequences

Pond A Identification in A Pond B Identification in B Reference sequence Organism
GCGCGC Mosquito TATCCC Fish species B AATTTA Water flea
GCGCGC Mosquito TATATA Fish species A GCGCGC Mosquito
GGGCCC Bacterium AATTTA Water flea GGGCCC Bacterium
GCGCGC Mosquito GGGCCC Bacterium TATATA Fish species A
GCGCGC Mosquito CATACA Water hyacinth TATCCC Fish species B
CATACA Alga AATTTA Water flea CATACA Water hyacinth
GCGCGC Mosquito AATTTA Water flea ATAGGG Frog
CATACA Water hyacinth ATAGGG Frog    
GGGCCC Bacterium CGGGGG Unknown    
GCGCGC Mosquito CCCCCC Unknown    

Question 2

Which are the most found sequences (organisms) in the two ponds? Calculate their frequency.

As an example, if in a pond C a total of 10 sequences were found and half belong to a fish, then 5/10=50% is the frequency of fish sequences in then pond.

(A text entry box would appear here, but your browser does not support it.)

Answer

In pond A, mosquito sequences were the most common (6/10=60%) while in pond B sequences of the water flea (3/10=30%).

Question 3

Which of the two ponds can be considered the more biodiverse, considering the DNA sequences that you could identify?

(A text entry box would appear here, but your browser does not support it.)

Answer

Pond B is more diverse, because sequencing data showed the presence of up to 7 different organisms. By comparison, only 3 were found in pond A.

Question 4

Were all the sequences identified? If not, why do you think this happened?

(A text entry box would appear here, but your browser does not support it.)

Answer

In Pond B two sequences did not match any of the reference organisms. This means that they did not have a reference similar enough in the database used for the identification. The most likely explanation is that these sequences belong to organism(s) for which we do not have a sequence yet and they may represent unknown organisms.