George Mason University     |      Departments of MMB & Mathematics

The Tree-Based Algorithm

Given a database of genes and a fixed positive integer n, the tree-sort algorithm stores the location of all n-length sequences in an entire database of genes. Additionally, given any n-length sequence, we store the location of each occurrence of this n-length sequence under the same index of the tree. Sorting the sequences in this manner makes it easy to perform analysis on the data.

Our tree-sort algorithm takes as input a database of genes, consisting of the nucleotides A, C, G, and T. To store all possible n-length sequences, we need a full tree with (n+1) layers, labeled 0 through n, where the kth layer has 4k vertices since there are four types of nucleotides. Each vertex in the kth layer has branches to four vertices in the (k+1)th layer. Thus the total number of vertices in a full tree is

Formula

For larger values of n, the full tree requires a significant amount of storage space. Fortunately, as n gets larger, the ratio of present n-length sequences to all possible n-length sequences decreases. We only need to create branches and vertices in the tree to store the n-length sequences that appear in the database. In fact, when n = 16, we only need 3% of the full tree to store all 16-length sequences in our database.

The following chart gives an overview of the steps the algorithm performs to store each of the nucleotide sequences provided:

Algorithm: To store database in a tree and find all N-nucleotide long duplicate subsequences

Input: A large database of genes (DNA)
Output: List of unique and duplicate N-words

Procedure:

Omega = set of G genes in the database
S = empty

Step 1: While Omega is nonempty, input a gene from a database. Put S = set of nucleotides in the gene. Save S0 = S.

Step 2: Read an N-length subsequence s1,…sN of nucleotides from S. Store the sequence location and create a branch in the tree if it does not exist. Delete s1 from S.

Step 3: If S is nonempty, go to Step 2. Otherwise, delete S0 from Omega and go to Step 1.


Memory requirements:

Our algorithm requires the following four main pieces of memory: character arrays to store the gene sequences, gene structures containing pointers to the each field in the gene sequences, storage structures to track the location of each n-length sequence, and the tree structures to sort the n-length sequences. Suppose our database of genes has G genes with average length ΔL. Then the memory for the gene structures is (G * ΔL). Each gene structure contains a pointer to the gene cluster, a pointer to the gene ID, a pointer to the actual gene sequence, and an integer containing the length of the sequence. Using 32-bit pointers, each gene structure requires 16 bytes of memory, and we need G gene structures. Thus the total memory for the gene structures is (16 * G) bytes. We need a storage structure for each n-length sequence in the database. That number is G * (ΔL + 1 - n). Since n << ΔL, we use G * ΔL. The storage structure contains a pointer to the associated gene, an integer specifying the location of the n-length sequence in the gene, and a pointer to another storage structure containing the previous occurrence of this exact n-length sequence. So the total memory for the storage structures is (12 * G * ΔL) bytes. For the character arrays, gene structures, and storage structures, we need a combined (G * (13 * ΔL + 16)) bytes of memory, which amounts to 700MB of RAM in our database. The tree structures contain four pointers to the next level of the tree, a pointer to a storage structure, and an integer to count the occurrences of the associated n-length sequence. Thus each tree structure requires 24 bytes of memory. The exact number of tree structures necessary depends on n and the types of n-length sequences in the database. The table below shows how many branches were used for each value of n. The right column is the tree memory plus 700MB from the other three main components.

nFull Tree# Branches Used% Branches UsedFull Tree MemoryActual Tree MemoryTotal Memory
934952534951999.998%8 MB8 MB708 MB
101398101139527199.798%32 MB31.94 MB731.94 MB
115592405536692595.968%128 MB122.84 MB822.84 MB
12223696211784990579.795%512 MB408.55 MB1.08 GB
13894784854571778051.094%2 GB1.02 GB1.70 GB
143579139418826430724.661%8 GB1.97 GB2.65 GB
1514316557651389654339.707%32 GB3.11 GB3.79 GB
1657266230611929673383.370%128 GB4.31 GB5.00 GB