I see TF5xxxxx families are “created by hcluster”. What is the algorithm behind?
TreeFam clusters (TF5xxxxx series) are created by hcluster_sg, a hierarchical clustering software for sparse graphs. Basically, hcluster_sg performs hierarchical clustering under mean distance. It reads an input file that describes the similarity between two sequences, and groups two nearest nodes at each step. When two nodes are joined, the distance between the joined node and all the other nodes are updated by mean distance. This procedure is iterated until one of the three rules is met: • Do not merge cluster A and B if the total number of edges between A and B is smaller than |A|*|B|/3, where |A| and |B| are the sizes of A and B, respectively. This rule guarantees each cluster is compact. • Do not join A to any other cluster if |A| < 500. This rule avoids huge clusters which may cause computational burden for multialignment and tree building as well. • Do not join A and B if both A and B contain plant genes or both A and B contain Fungi genes. This rule tries to find animal gene