One of the most important characteristics related to protein function is flexibility. Flexibility is the time-dependent fluctuations, which contains equilibrium dynamics governing biological processes such as signal transduction, allosteric regulation, protein-protein and protein-small ligand interactions, assembly of macromolecular machines and thermal enzymatic adaptation. The mechanism of protein function is reversible interactions between proteins and their surroundings, which let them quickly and reversibly respond to environmental changes and metabolic conditions. As a result, the scales of protein flexibility take place at very different time and amplitude from rather short time local fluctuations (0.01 to 5 Angstrom and 10-15 s to 10-1 s), such as molecular vibrations which reflect the movement of residues, to movements with large amplitudes and longer time scales (1-10 Angstrom, 10-9s to 1s) such as domain or subunit displacement.
The initial dataset was prepared from Protein Data Bank (PDB). To assemble the dataset the structures determined by NMR with more than five models and more than 40 residues have been selected. Then, the CD-HIT server  was used to remove redundant or highly similar sequences with a 70% sequence identity cutoff.
The assembled dataset is imbalance and the percentages of flexible residues are less than rigid ones. To strictly assess the proposed method, three datasets were constructed with different flexible to rigid (F: R) ratios.
The training and testing datasets can be download.
The novel flexibility measure (based on TAF information) can be computed as follows (see figure 1 for a schematic representation):
At first, the interested protein structure is represented based on the virtual bond model defined by four consecutive Ca atoms. After that, Ca-based TAF for each residue is calculated according to Zhang's method in which the TAFs for a residue is defined as the average difference of Ca-based torsion angles among different NMR models. Finally, information theory is used to relate the computed TAFs to the flexibility.
Each residue was characterized by a 49-dimensional feature vector that includes the PSSM-based features, the Miler's index and structural information. The PSSM (20 features) was obtained by PSI-BLAST profiles by searching a given sequence with three iterations against the NCBI non-redundant database. Miler's index includes: seven representative physical parameters: steric parameter, hydrophobicity, volume, polarizability, isoelectric point, helix probability and sheet probability. Moreover, two types of structural information have been used: secondary structure (3 features) and solvent accessibility (2 features), which were computed by ACC/SSpro. Finally, flexibility descriptors based on information theory (17 features) was considered. In this work, random forest (RF) algorithm was used as the classification algorithm for flexibility prediction. It has been shown that RF provides reasonable performance compared with other machine learning algorithms in protein attribute prediction problems WEKA software package which is an open-source library of machine learning method was used for training and building our predictor.