Please enable JS
  • video_woman_cover3

Benchmark Overview

The performance of 33 alignment-free measures (methods list: ) was assessed by calculating areas under the ROC curves (AUC) with reference to SCOPe database (Structural Classification Of Proteins) (1). The implementations of the alignment-free methods used in this study are available as a Python package. The accuracy and speed of alignment-free measures were also tested against Smith-Waterman algorithm implemented in water program (from EMBOSS package).

SCOP Database Overview

The SCOP database consists of Protein Data Bank (PDB) entries and provides a detailed and reliable description of protein structure relationships. The three-dimensional (3D) structure analysis allows the detection of more remote homologous, since structure is typically more conserved than sequence.

SCOP hierarchically classifies proteins at four structural levels:

1 family (fa) Clear evolutionarily relationship
2 superfamily (sf) Most probable common evolutionary origin
3 fold (cf) Major structural similarity
4 class (cl) Overall structural similarity

This hierarchical description of proteins allows the evalutaion of each method for different levels of similarity.

Materials & Methods

Reference protein dataset

The reference protein dataset used for evaluting the accuracy of each sequence comparison method was constructed based on the SCOP database 2.06, called ASTRAL 2.06. This data set includes all SCOP sequences that share less than 40% identity to each other. This dataset has become a one of the most reliable benchmark test sets in the evaluation of different methods to detect remote protein homologies.

Similarily to the study of Vinga et al. Bioinformatics. 2004, we trimmed original ASTRAL dataset to:

  • exclude sequences with unknown amino acids
  • exclude families with less than 5 proteins
  • include only the four major classes
    • α class - constituted mainly by proteins with α helix
    • β class - essentially formed by β-sheet structures
    • α/β class - proteins with mixtures of α-helices and β-strands
    • α + β class - those where α-helices and β-strands are largely segregated

Protein dataset used in this study:

Db α β α/β α + β total
pr fa sf cf pr fa sf cf pr fa sf cf pr fa sf cf
astral40 2,439 984 513 289 2,795 890 365 177 3,970 939 246 148 3,346 1,217 561 385 12,550
alfree40 1,072 98 57 46 1,469 112 62 44 2,478 159 75 59 1,577 144 88 70 6,596
Number of sequences: proteins (pr), families (fa), superfamilies (sf) and folds (cf)
You can download the Reference Protein Datasets from Download > Benchmark Data Sets .

Benchmark methods

The 33 alignment-free methods were used to calculate the dissimilarity/distance measures between all 21,750,310 possible combination pairs of 6,596 proteins from the reference dataset. Different combinations of input parameters were used to run word-based methods (i.e. word size from 1 to 4; protein alphabet consisted of either the original 20 amino acids or 11 reduced alphabet: different vector types such as count occurrences, frequencies, weighted counts, weighted frequencies, standardized frequencies with equal/equilibrium amino acid frequencies) and W-metric (different substitution matrices: PAM120-PAM250, BLOSUM30-70). In total, 529 independent runs of alignment-free calculations were performed using the reference dataset as query. Smith-Waterman algorithm (water) was run using score matrix BLOSUM50 and default parameters for gap scoring (i.e. gap open: 10, gap extend: 0.5).

Similarily to the study by Vinga et al. (2004):

For each alignment-free method run, the distances between all proteins pairs were subsequently sorted, from maximum to minimum similarity. The comparative test procedure was based on a binary classification of each protein pair, where 1 corresponds to the two proteins sharing the same group in SCOP database, 0 otherwise. Since the group can be defined at any one of the four different levels of the database (family, superfamily, fold, class), each protein pair was associated to four binary classifications (one for each level). The similarity measure for alignment method was based on the Smith–Waterman score, with no correction for statistical significance.

R package, ROCR was used to obtain ROC curves and AUC values for the Smith-Waterman run and all alignment-free methods (at each SCOP level).


ROC curves for the reference dataset.

These ROC curves were obtained for each SCOP level (i.e. class, fold, superfamily, family). Mouse-over ROC to see the Area Under the Curve (AUC).


  1. As expected, AUC decreases from family to class level. The sequence similarity between proteins sharing the same family is still well recognized. Consequently, most distances achieve their best discrimination accuracy at this level.
  2. Suprisingly, the alignment-based method (Smith-Waterman) is outperformed at all levels (AUCs: 0.62, 0.67, 0.78, 0.81) by two alignment-free measures, Normalized Google Distance (AUCs: 0.63, 0.78, 0.80, 0.84) and Bray-Curtis (AUCs: 0.63, 0.77, 0.80, 0.84).
  3. Three other word-based methods, including two variants of Squared Euclidean Distance (d Eseq1 and d Eseq2) as well as the Canberra distance (dCanberra), though less accurate in recognition of relationships within class, obtained higher overall scores (AUC: 0.744, 0.733 and 0.725, respectively) than S-W (AUC: 0.72).
  4. At family level, the Smith-Waterman algorithm was beaten by a modified Euclidean distance (d Eseq1).
  5. At superfamily level, SW is third the most accurate sequence comparison method.
  6. At fold level, 14 alignment-free methods perform better than alignment scores, including:
    • 13 word-based metrics
      • all 4 variants of Euclidean distance, Manhattan distance, Minkowski distance, Canberra distance, Chebyshev distance, 3 Angle metrics Normalized Google Distance and Bray-Curtis distance.
    • 1 (information teory)-based measure: d_star (based on Lempel-Ziv complexity)
  7. At the class level, the Smith-Waterman is outperformed by 9 word-based alignment-free methods.
  8. Word-based methods achieve, on average, slight better performance (AUC: 0.67 ±0.04) than IT-based solutions (AUC: 0.61 ±0.06). A half of IT-based measures (i.e. two variants Lempel-Ziv measure dLZ and dLZ1, Normalized Compression Distance and Base-base correlation) are located at the end of the ranking list. The best performing IT-based method is a variant of Lempel-Ziv distance (dLZ*), which is preceded by 5 word-based measures and Smith-Waterman algorithm.

Ranking of overall accuracy

In order to assess the overall acurracy of different metrics, we averaged AUC values from different SCOP levels (i.e. class, fold, superfamily, family).

Analysis by class

We further compared metrics accuracies at each of the four hierachical SCOP levels (i.e. class, fold, superfamily, family) separetely.

Full detailed results

Here we list the AUC values of all 33 in combination of different values of parameters. For example, for word-bases alignment-free methods these parameters include: k-mer length (from 1 to 4 amino acids) and vector type (e.g. counts, frequencies, weighted counts etc.).


In this section, we present time measurements of each alignment-free method as well as Smith-Waterman algorithm implemented in EMBOSS package as water program. We measure duration time in calcualtion of 21,750,310 pairwise SCOP protein comparisons. The measurements were conducted on a 64-bit 3.2-GHz x86-compatible Intel processor.

The Smith-Waterman (SW) algorithm is computationally intensive, it takes 3 days to complete the task. Its running times can be 1000-fold longer than that of the other metrics here compared.