Distances to calculate

A Tree is a cycle free, connected, simple finite graph. We will call the vertices of degree one leaves. We will work with binary trees, which are trees with vertices of degree one or three. The vertices of degree three will be called internal vertices. A cherry of the tree T is a pair of leaves adjacent to the same vertex.

A phylogenetic tree (PhT) with n leaves in this work is a binary tree whose leaves are distinctly labeled by the consecutive integers [n]= { 1, ... , n}, and the internal leaves are unlabeled. A binary tree with n leaves has n-2 internal vertices. A PhT-isomorphism is a tree isomorphism which keeps the labels of the leaves. A rooted binary tree is a binary tree, where one of the edges is split by a vertex which is labeled ROOT. There are (2n-5)!! different labeled binary trees and (2n-3)!! rooted labeled binary trees, on n leaves.

Each edge e divides the original tree into an unordered pair of induced subtrees uniquely. Both of these subtrees can have exactly one point of contraction, and that is at the appropriate endpoint of the edge e. We will denote the set of leaves of the one of the two induced subtrees which contains 1 by Se. We will call Se and [n] \ Se the clusters of the tree at e.

We will represent the edge e by the following bipartition of [n]: (Se | [n] \ Se), and we call this representation the split of
the tree at the edge e.

We use the notation ij | kl for the tree on four leaves i, j, k, l such that the leaves i, j are separated from the leaves j, k by an internal edge. We will also call ij | kl a quartet split or QS.

Given a binary tree T, ij || kl is a valid quartet split of T, if the reduced subtree generated on i, j, k, l is the tree ij | kl.

Given a finite set of quartet splits Q, we define the dyadic closure cl(Q) of Q as the set of quartet splits that can be inferred from Q by the repeated use of rules :

If ab | cd then ba | cd and cd | ab.
If ab | cd then ac | bd and ad | bc do not holds.
If ab | cd and ac | de then ab | ce, ab | de and bc | de holds.
If ab | cd, and ab | ce then ab | de holds.

Q is said to be inconsistent, if Q is not contained by the set of valid quartet splits of a binary tree T, otherwise Q is called consistent. A pair of quartet splits is contradictory, if they are on the same quartet but can not be presented on the same tree.

Let us be given an evolutionary tree T with the sequences of length k on the leaves 1, ... ,n over an a element alphabet. Let f(i,j)(alpha ,beta ) denote the probability that a given character in the sequence at leaf i is in state alpha and the same character in the sequence at leaf j is in state beta. Ordering the states, f(i,j)(alpha ,beta ) forms an (a x a) square matrix:

F(i, j) = [f(i, j)(alpha ,beta )]. The dissimilarity score of sequences i and j is: h(i,j)=-log det (F(i, j)). The dissimilarity score matrix H is given as H=[h(i, j)]. (This definition makes sense in this case, since the determinant of F(i, j) is positive in those cases when we want to apply the algorithm.)
The width of a quartet q={ i,j,k,l} is the value: w(q) = max { h(x,y) | x,y are distinct elements of q } where h(i, j) denotes the dissimilarity score between sequences i and j.
Q(w) is the set of quartet splits whose width is at most w.
In the practice the correct probabilities are unknown,so we are using the relative frequencies to approximate F(i, j).
The program calculates the distance matrix of the sequences and the normalized distance matrix: H.
The program decides the pattern of the split of a quartet according to the distance matrix (dist.txt), and the width of the matrix
according to the normalized Hamming matrix H (nhdist.txt).

Any symmetric matrix, which has zeros in its diagonal and positive numbers everywhere else, will be called a distance matrix.
An nxn distance matrix D(i,j) is called additive, if there exists an n-leaf tree with positive edge weights on the internal edges and non negative edge weights on the leaf edges, such that D(ij) equals the sum of edge weights in the tree along the path connecting i and j.

The matrix D(ij) satisfies the Four Point Condition if for all (not necessarily distinct) i, j, k, l the maximum of {D(ij)+D(kl), D(ik)+D(jl), D(il)+D(jk)} is not unique.

The concept of additive matrix and the Four Point condition is connected by the following theorem of Buneman:
A matrix D is additive, if and only if it satisfies the Four Point Condition. The edge weighted tree representing the additive distance matrix is unique among the trees without vertices of degree two.

GoBack