Skip to main content

About this free course

Download this course

Share this free course

The metagenomics revolution: an introduction
The metagenomics revolution: an introduction

Start this free course now. Just create an account and sign in. Enrol and complete the course for a free statement of participation or digital badge if available.

1.3 Making sense of the sequences

In the previous section you learned that millions of DNA fragments can be sequenced. But how do scientists make sense of all this information?

What comes next is a bit like assembling a puzzle. Powerful computer programs compare the DNA fragments and look for regions with a matching sequence of letters: When two sequences have the same letters at their ends, the program assumes they come from the same original piece of DNA and joins them together. By repeating this process many times, these fragments are combined into longer (assembled) sequences.

Described image
Figure 4 The process of assembling DNA sequences from DNA fragments (image taken from Pavlopoulos et al 2013).

The computers compare the assembled sequences to known genes in a database that contains known DNA sequences from many different organisms, and identify the organism(s) they come from (taxonomic assignment). The biological function of a gene can also be predicted. For example, scientists can predict which energy source a group of microorganism uses by looking at which type of metabolic genes are found in the sample.

Sometimes, not all sequences in a sample can be identified, meaning that they don’t have a matching sequence in the databases. The more scientists sequence genomes and organisms from the environment, the bigger the databases become and so the capacity to identify and study genes with metagenomics expands.

You can imagine that metagenomic studies generate a lot of data, making their storage and sharing very important. Data are often stored in online repositories such as the NCBI Sequence Read Archive (SRA).

Described image
Figure 5 Metagenomics workflow.

You will now try to make sense of some metagenomic data in a short activity.

Activity 1

Timing: Allow 20 minutes for completing this activity

In this activity, you will look at the results of a metagenomics study where total DNA was extracted from water samples from two different ponds (A and B) and then sequenced with the shotgun approach. The goal was to determine which of the two ponds had the highest number of different species in it (e.g. which pond was more biodiverse). The table below shows the DNA sequences found in the two ponds (represented as strings of A, T, G, C), but they have not been identified yet. On the right, you can see a list of reference DNA sequences of organisms expected to be found in the ponds.

Question 1

Match the DNA sequences in Pond A and B with the reference sequences on the right (coloured box) and fill the identification column with the name of the organism corresponding to the sequence

Table 1 DNA sequences found in the two ponds and reference organism sequences
Pond A Identification in A Pond B Identification in B HighlightedReference sequence HighlightedOrganism
GCGCGC
To use this interactive functionality a free OU account is required. Sign in or register.
TATCCC
To use this interactive functionality a free OU account is required. Sign in or register.
HighlightedAATTTA HighlightedWater flea
GCGCGC
To use this interactive functionality a free OU account is required. Sign in or register.
TATATA
To use this interactive functionality a free OU account is required. Sign in or register.
HighlightedGCGCGC HighlightedMosquito
GGGCCC
To use this interactive functionality a free OU account is required. Sign in or register.
AATTTA
To use this interactive functionality a free OU account is required. Sign in or register.
HighlightedGGGCCC HighlightedBacterium
GCGCGC
To use this interactive functionality a free OU account is required. Sign in or register.
GGGCCC
To use this interactive functionality a free OU account is required. Sign in or register.
HighlightedTATATA HighlightedFish species A
GCGCGC
To use this interactive functionality a free OU account is required. Sign in or register.
CATACA
To use this interactive functionality a free OU account is required. Sign in or register.
HighlightedTATCCC HighlightedFish species B
CATACA
To use this interactive functionality a free OU account is required. Sign in or register.
AATTTA
To use this interactive functionality a free OU account is required. Sign in or register.
HighlightedCATACA HighlightedWater hyacinth
GCGCGC
To use this interactive functionality a free OU account is required. Sign in or register.
AATTTA
To use this interactive functionality a free OU account is required. Sign in or register.
HighlightedATAGGG HighlightedFrog
CATACA
To use this interactive functionality a free OU account is required. Sign in or register.
ATAGGG
To use this interactive functionality a free OU account is required. Sign in or register.
   
GGGCCC
To use this interactive functionality a free OU account is required. Sign in or register.
CGGGGG
To use this interactive functionality a free OU account is required. Sign in or register.
   
GCGCGC
To use this interactive functionality a free OU account is required. Sign in or register.
CCCCCC
To use this interactive functionality a free OU account is required. Sign in or register.
   
Words: 0
Interactive feature not available in single page view (see it in standard view).

Answer

Table 1 (completed) DNA sequences found in the two ponds and reference organism sequences
Pond A Identification in A Pond B Identification in B HighlightedReference sequence HighlightedOrganism
GCGCGC Mosquito TATCCC Fish species B HighlightedAATTTA HighlightedWater flea
GCGCGC Mosquito TATATA Fish species A HighlightedGCGCGC HighlightedMosquito
GGGCCC Bacterium AATTTA Water flea HighlightedGGGCCC HighlightedBacterium
GCGCGC Mosquito GGGCCC Bacterium HighlightedTATATA HighlightedFish species A
GCGCGC Mosquito CATACA Water hyacinth HighlightedTATCCC HighlightedFish species B
CATACA Alga AATTTA Water flea HighlightedCATACA HighlightedWater hyacinth
GCGCGC Mosquito AATTTA Water flea HighlightedATAGGG HighlightedFrog
CATACA Water hyacinth ATAGGG Frog    
GGGCCC Bacterium CGGGGG Unknown    
GCGCGC Mosquito CCCCCC Unknown    

Question 2

Which are the most found sequences (organisms) in the two ponds? Calculate their frequency.

As an example, if in a pond C a total of 10 sequences were found and half belong to a fish, then 5/10=50% is the frequency of fish sequences in then pond.

To use this interactive functionality a free OU account is required. Sign in or register.
Interactive feature not available in single page view (see it in standard view).

Answer

In pond A, mosquito sequences were the most common (6/10=60%) while in pond B sequences of the water flea (3/10=30%).

Question 3

Which of the two ponds can be considered the more biodiverse, considering the DNA sequences that you could identify?

To use this interactive functionality a free OU account is required. Sign in or register.
Interactive feature not available in single page view (see it in standard view).

Answer

Pond B is more diverse, because sequencing data showed the presence of up to 7 different organisms. By comparison, only 3 were found in pond A.

Question 4

Were all the sequences identified? If not, why do you think this happened?

To use this interactive functionality a free OU account is required. Sign in or register.
Interactive feature not available in single page view (see it in standard view).

Answer

In Pond B two sequences did not match any of the reference organisms. This means that they did not have a reference similar enough in the database used for the identification. The most likely explanation is that these sequences belong to organism(s) for which we do not have a sequence yet and they may represent unknown organisms.