We have also made good progress on Sub-aims 1.2 Intrinsic Dimensionality Estimation from Clustering and 1.3 Intrinsic Dimensionality Estimation from Metric Scaling. In particular, we compared the motions of intrinsically disordered proteins (IDPs) with early-stage folding motions of natively folded proteins (NFPs). Implicit-solvent MD simulations were analyzed using metric scaling for dimensionality reduction/estimation and a hypercube dimensionality estimation technique. Interestingly, proteins from both classes exhibited extended-coil and collapsed-coil dynamics. This was evident from the metric scaling results, where maps of the conformation space were constructed by projecting the protein ensembles onto the three principal modes of motion. The maps indicate that IDPs and NFPs that exhibit extended-coil conformations quickly cover the available conformation space, while the collapsed-coil IDPs and NFPs slowly cover a rugged conformational landscape. In addition, when considering IDPs and NFPs from the extended-coil class, the plots from the metric scaling results indicate that the NFPs exhibit higher-dimensional motion than the IDPs. The same result is true when comparing NFPs to IDPs within the collapsed-coil class as well. The hypercube dimensionality estimation technique tends to agreed with this trend once we normalize for protein length since longer proteins inherently have more degrees of freedom. This work was presented as a platform talk titled "Probing the Conformation Landscape of the Unfolded State: Do Disordered and Unfolded Dynamics Differ?" at the 2011 Meeting of the Biophysical Society.

We published a journal paper on work performed during the first period of the grant (9/2010 - 5/2011) related to Sub-aim 1.1 Clustering Analysis of Single Replicate Trajectories. The work in this paper consisted of a thorough investigation of spectral data clustering for partitioning trajectories into meta-stable and transition states. This paper titled "Validating clustering of molecular dynamics simulations using polymer models" was published in BMC Bioinformatics, a top computational biology journal, in preliminary form in November 2011 and in final form in January 2012.

The research focus of the second period of the grant (9/2011 - 5/2012) continued to be on Specific Aim 1: Methods for Analyzing the Intrinsic Dimensionality of MD Trajectories. We have almost completed Sub-aim 1.2 Intrinsic Dimensionality Estimation from Clustering. Instead of applying dimensionality estimation to the clusters, we decided to investigate its application on temporal windows of the full trajectories. This change was based on our observations during the first period of the grant. As proposed, we implemented the correlation dimension method for estimating the intrinsic dimensionality of a set of points given inter-point distances. We also investigated other approaches including one based on maximum likelihood estimation. We used both the root mean square distance (RMSD) to compute the inter-structure distances as well as a representation/distance measure based on the dihedral angles along the backbone of biopolymers. Due to the large sizes of the trajectories, significant effort was spent on a storage efficient representation which only keeps track of the distances to a limited number of nearest neighbors for each structure/point. Storing and manipulating the full NxN matrix of interpoint distances where N is the number of structures is unmanageable.

As proposed, we constructed dynamic polymer models with known dimensions to evaluate the dimensionality estimation techniques. We considered a completely random polymer, a semi-rigid polymer in which half the links are frozen out while the other half remain random, and a polymer which repeatedly coils and uncoils from a helical structure. These models show the techniques are able to estimate the dimension of the synthetic trajectories although they tend to underestimate the true dimension.

In applying the dimensionality estimation techniques to MD simulations of real proteins, we determined that the thermal noise in the systems results in an overestimate of the dimension when the biopolymer is in a fixed state (such as the folded state). This makes sense because this noise makes it appear as if all the degrees of freedom are being exercised. We therefore investigated techniques for suppressing this noise. One promising approach is to apply low-pass filtering to the backbone angles. We are finalizing this approach. We are also investigating other noise suppression approaches.

While the work on applying the dimensionality estimation to real proteins is ongoing, our preliminary results are encouraging. We observe the expected decrease in dimension as the short Trp-cage miniprotein folds. We also observe the appropriate differences in dimension between disordered proteins that are known to differ in how collapsed or extended they are. This work was presented as an abstract titled "Dimensionality estimation of disordered protein dynamics" at the 2012 Meeting of the Biophysical Society.

The research focus of the third period of the grant has been to explore metrics of inter-structure distance besides RMSD as mentioned in Section 2.6 of the Project Summary. We believe these new metrics will improve our previous results from Specific Aim 1 and future work on Specific Aim 2. To facilitate this investigation, we have created and validated four tools based on the libraries and interface of the widely used MD simulation software Gromacs 4 [Hess, et al. 2008] which we believe will make them familiar to a majority of the likely users. In addition, we have completed a library which currently implements ten metrics of inter-structure distance. Each of these metrics can be individually utilized by our four complete tools as well as any future tools we will be developing. Our current set of applications include: (1) a tool to visualize the inter-structure distances for all of the NxN pairs of structures in a MD trajectory of length N, (2) a tool to plot the average inter-structure distance for a given metric as the temporal distance increases between structures, (3) a tool to estimate a single value of order or disorder based on an MD trajectory and plot an estimate of order per amino acid when using a subset of the metrics, and (4) a tool to measure statistics of average and maximum distances between structures which is useful to compare the metrics from our library. The metrics of inter-structure distance that we have investigated include modifications to the RMSD metric, various comparisons of backbone angles and dihedrals, calculations of correlation coefficients, and two additional metrics based on structure comparisons that are based on recent publications: MAMMOTH [Ortiz, et al. 2002] and elastic shape analysis [Liu, et al. 2011].

We have produced MD trajectories to validate and apply the tools we have written. These trajectories include replicates of fragments of the FG-nucleoporins nsp1 and nup116 as well as several mutants with simulation times of 200-250 nanoseconds in implicit and explicit solvent. To compare these results with a simpler set of trajectories along a known spectrum of disorder, we simulated three sets of homopolymers with increasing conformational flexibility in explicit solvent. In addition, we have explored coarse grain models with the intent to be able to study larger fragments sizes and longer simulation times. These include simulations using a modified version of the MARTINI model [Monticelli, et al. 2008] and a custom bead-spring model produced using the LAMMPS MD simulation software package [Plimpton 1995].

To improve the signal-to-noise ratio of our measurements, we are beginning to implement frequency-based noise suppression methods which will be able to use the frequency information obtained from multiple metrics. We plan to apply these noise reduction tools to not only filter for noise but also reconstruct our MD trajectories. The reconstruction of the MD trajectories will allow us to apply multiple stages of filters based on metrics with only partial overlap of noise frequency spectra. The expectation is that frequency-based filtering of noise based on atom position will not directly overlap with filtering based on backbone angles and other inter-structure distance metrics. This multi-stage filtering should vastly improve our ability to remove noise from our trajectories.

One of the tools produced over this period of the grant shows utility in differentiating between ordered and disordered regions of proteins. This tool assigns a value of order or disorder to a protein based on a scaled average of all inter-structure distances and is based on a previously proposed algorithm [Stultz, et al. 2011]. In addition to new options of inter-structure distance metrics, the primary improvement we have made to this algorithm assigns this value of disorder to individual amino acids. The resultant tool is able predict and differentiate ordered and disordered regions of a protein based on the results of a simulation. We have observed that this tool is sometimes able to predict portions of proteins which are locked into secondary structure during our MD trajectories. We expect partially disordered proteins such as the tumor suppressor protein p53 will have far more contrast than the sections of transient secondary structure within fully disordered proteins that we are currently studying. We plan to validate this usage of the tool with (1) a coarse grain model system of several FG-nucleoporin sequences believed to have varying levels of disorder spread across different regions and (2) simulations of a partially disordered protein, possibly a fragment of p53.

[Hess, et al. 2008] Hess, B, Kutzner, C, van der Spoel, D, and Lindahl, E (2008). GROMACS 4: Algorithms for Highly Efficient, Load-Balanced, and Scalable Molecular Simulation. *J Chem Theory Comput*, 4 (3): 435-447.

[Liu, et al. 2011] Liu, W, Srivastava, A, Zhang, J (2011). A Mathematical Framework for Protein Structure Comparison. *PLOS Computational Biology*, 7 (2): e1001075.

[Monticelli, et al. 2008] Monticelli, L, Kandasamy, SK, Periole, X, Larson, RG, Tieleman, DP, Marrink, SJ (2008). The MARTINI Coarse Grained Forcefield: Extension to proteins. *J Chem Theory Comput*, 4:819-834.

[Ortiz, et al. 2002] Ortiz, AR, Strauss, CEM, and Olmea, O (2002). MAMMOTH (Matching Molecular Models Obtained From Theory): An Automated Method For Model Comparison. *Protein Science*, 11: 2606-2621.

[Plimpton 1995] Plimpton, S (1995). Fast Parallel Algorithms for Short-Range Molecular Dynamics, *J Comp Phys*, 117: 1-19.

[Stultz, et al. 2011] Fisher, CK, Stultz, CM (2011). Protein Structure Along The Order-Disorder Continuum. *J Am Chem Soc*, 133: 10022-10025.

As pointed out in the grant proposal, the commonly used inter-structure distance measure (ISDM) root mean square distance (RMSD) has known deficiencies. We began implementing a number of alternate ISDMs during the third period of the grant. During the fourth period, we continued work on these ISDMs and now have a suite of over a dozen. We are continuing to evaluate these different measures based on the higher level analysis they enable, such as dimensionality reduction, on a wide range of systems.

During the fourth period of the grant, we undertook an extensive investigation into dimensionality reduction (DR) for visualizing trajectories as well as determining the intrinsic dimensionality of a system. We focused mostly on linear DR techniques such as classical multi-dimensional metric scaling but also investigated non-linear techniques such as ISOMAP. Through the use of synthetic data with known dimensionality, we have gained great insight into the usefulness of DR as a tool for studying dynamic protein systems. We are preparing the findings for publication.

The final major activity during the fourth period of the grant has been to start making available our analysis tools to the broader community. We have a preliminary toolset ready for release that uses functions and libraries from the widely used MD packages Gromacs 4 [Hess, et al. 2008]. In addition to a select subset of the ISDMs mentioned above, this toolset includes: (1) the tool g_isdcalc to calculate means and basic statistics from an all-structure to all-structure ISDM matrix for a trajectory; (2) the tool g_isddecorr to calculate the decorrelation and saturation of measures of ISD to help choose the optimal measures to use with specific tools and systems; (3) the tool g_isdmap to visualize an ISDM matrix and perform basic qualitative clustering of structures; (4) the tool g_isdcmds which implements classical multi-dimensional metric scaling and dimensionality estimation; and (5) the tool g_isdorder to assign a globally meaningful measurement of disorder to a protein and locate local regions of disorder and flexibility in folded proteins. The tool g_isdorder is in part based on a previously proposed measure of disorder [Stultz, et al. 2011]. A still to be implemented tool, (6) g_isdcluster, will cluster the structures of a trajectory. It will combine the library of ISDMs with estimates of the amplitude of ISD explained by random thermal vibrations. By applying a customized clustering algorithm to this information, we plan to create a more advanced and optimized clustering method for IDPs. The tool g_isdcalc also implements the foundation of a more powerful way to analyze targeted MD simulations.

The tool g_isdcmds improves and generalizes our previous work with representations of proteins in reduced dimensionality space. Classical multi-dimensional scaling is performed directly on molecular dynamics trajectories and the representation can be displayed via an output script designed to work with the open source application GNU Octave. The tool displays the output in up to six dimensions by combining spatial and color coordinate systems. In addition, the tool self-assesses the accuracy of the representation in the reduced dimensional space. By combining the concept of an estimated thermal noise floor (the amount of protein motion that can be explained by thermal noise) with the accuracy assessment for arbitrary numbers of dimensions, the tool also provides an estimate of the number of dimensions that are necessary to describe the meaningful conformational changes within a protein.

All-atom molecular dynamics simulations were performed on several proteins to demonstrate the ability of the tool g_isdorder to differentiate local flexibility and disorder in otherwise folded proteins. A cyclic cystine knot protein was simulated as an example of a folded protein with a known disordered loop. The DNA binding domain of the tumor protein p53 was simulated as an example of a protein domain which is believed to be folded and stable. The analysis of these proteins reveals that the tool is able to (1) differentiate the disordered loop from the folded portions of the cystine knot molecule and (2) identify short regions of significant flexibility in the stable folded p53 molecule. The upper theoretical limit of structural variance was calculated over a representative phase space by generating random polymer chain structures grouped by size and polymer chain length. This phase space is used to scale the output of g_isdorder and create a meaningful and universally applicable measure of disorder.

[Hess, et al. 2008] Hess, B, Kutzner, C, van der Spoel, D, and Lindahl, E (2008). GROMACS 4: Algorithms for Highly Efficient, Load-Balanced, and Scalable Molecular Simulation. *J Chem Theory Comput*, 4 (3): 435-447.

[Fisher, et al. 2011] Fisher, CK, Stultz, CM (2011). Protein Structure Along The Order-
Disorder Continuum. *J Am Chem Soc*, 133: 10022-10025.