molFrame is a workflow tool that will allow one to go from simulation to plot-ready data in minutes!
And more!
MolFrame is a utility that is used to process large scale trajectory data from Molecular Dynamics simulation programs such as VMD. In programs like VMD, simulation trajectory coordinates can be exported under multiple formats, e.g. .xyz, .pdb, or .dcd; as of this current version, molFrame is most compatible with .xyz and currently looking to build classes to expand upon molFrame's usability for more extensions.
- Print coordinates (no header lines, just pure trajectories)
- Center of Mass
- Angular Conformation
- Order parameter
- Energy Average and standard deviation
- Radial distribution
- Protein surface area
Simulation coordinate files are very cumbersome to parse as it is littered with unecessary headers and columns, which is why this version begins with one of the least complex extensions, .xyz. A typical .xyz file can appear as such:
36288
generated by VMD
C1 31.869669 -20.711391 -34.581379
N1 30.844368 -20.086567 -33.779266
H1 31.610401 -21.606871 -35.055351
H2 32.800102 -20.909853 -34.014919
H3 32.144650 -20.023277 -35.347126
C2 30.857540 -18.913921 -33.099659
C3 29.650688 -18.775717 -32.479210
N2 28.933077 -19.915432 -32.859100
C4 29.690832 -20.693037 -33.642181
Not to mention the first two lines shows up intermittently to mark the beginning of every simulation frame. In short, a large simulation containing hundreds of thousands of frames can have hundreds of these headers, which can cause potential parsing errors. In addition, making shell scripts that utilizes grep commands and regex leads to messy outcomes.
molFrame takes advantage of these headers by first appending the .xyz files with the terminate_xyz script that will attach a termination sequence at the very end of the file and then using those lines as checkpoints for allocation and deallocation of memory space, preventing memory leaks.
$ ./terminate_xyz trajectories.xyz > terminatedTrajectories.txtOnce appended, it can properly parse these files and due to the nature of how MD programs arrange their data, this allows molFrame to analyze the simulation metrics of the simulation, containing the following:
- Total number of simulation frames
- Total number of molecules within simulation frames
- Number of atoms per molecule
========================================================
molFrame : The data has the following array dimensions...
Simulation Frames: 500
Molecules per Frame: 672
Atoms per Molecule: 56
========================================================
One reason, for speed. Typically, multiple simulated systems have to be analyzed, and due to their large size, it's difficult to analyze all of them concurrently without using High-Performance Clusters, so the next best thing is to quickly analyze them one by one. To put it in perspective, it takes minutes for Interpreted languages like Python or R to analyze files that are roughly half a GB in size, whereas molFrame takes about 30 seconds to give the user simulation metrics and analyses.
With that said, there are future plans on overhauling molFrame into a different language, such as C++ to support more Object Orientation and the utilization of more Data Structures in order to perform more complex methods. Another proposed alternative is to strip molFrame of its main interface and leave its methods alone and turn the Fortran components into a dynamic library that can be invoked in memory.
- Fix the garray subroutine to reset the simulation frame metric
- Incorporate methods to analyze protein files such surface area
- Expand methods on Energy module to include other statistical metrics
- Greater support for multithreading to allow concurrent execution of multiple analyses
- Development of a GUI to make it more user friendly

