Structural biology deals with the characterization of the structural (atomic coordinates) and dynamic (fluctuation of atomic coordinates over time) properties of biological macromolecules and adducts thereof. The dynamic properties of these systems are crucial to many aspects of their biological function. Researchers in the field largely rely on in silico simulations to tackle the dynamic aspect of structural biology (Molecular Dynamics Simulations).
The complexity of methods and resources instead must be tackled by providing a user-friendly and intuitive interface for the scientists with various backgrounds. This use case seeks to implement a pipeline that combines protocols that automate the steps for setup and execution of MD simulations, as well as the analysis of MD trajectories, aiming also at simplifying the comparison with different kinds of experimental data for validation of the simulations.
The research group coordinated by Antonio Rosato is part of a Center for research, knowledge transfer, and higher education (CERM) of the University of Florence. CERM is a key lab of the ESFRI infrastructure for Integrated Structural Biology (INSTRUCT, www.structuralbiology.org). It has a long tradition of expertise in the characterization of the structure and dynamics of proteins in solution through NMR and is one of the world-leading institutions in structural biology of metalloproteins in particular focusing on the study of metals in biological systems.
The main research activities of the Rosato group focus on the development of software and web applications to assist the analysis of NMR data, including the implementation of web grid-enabled portals, and on the bioinformatics analysis of genomes and proteins. The portals include the AMPS-NMR portal for protein structure refinement and the MaxOcc portal for the study of dynamics in multi-domain proteins.
Structural biology deals with the characterization of the structural (atomic coordinates) and dynamic (fluctuation of atomic coordinates over time) properties of biological macromolecules and adducts thereof. The dynamic properties of these systems are crucial to many aspects of their biological function, such as recognition of molecular partners, signal transduction and diffusion of small molecules (substrates, products, inhibitors) to/from the active site of catalytic machineries. These properties are hard to characterize experimentally in a direct and comprehensive manner at the full-atomic level. Consequently, researchers in the field largely rely on in silico simulations to tackle the dynamic aspect of structural biology (molecular dynamics simulations). Importantly, such simulations can be validated by comparison to different types of experimental data.
Case Study and User story
Molecular dynamics (MD) simulations of macromolecules have grown to become a standard tool for complementing experiments, providing a structural basis for rationalizing in vitro and in vivo observations, and for suggesting new experiments. Advances in algorithms and hardware have allowed ever larger systems to be simulated for ever longer times (ranging from nanoseconds to microseconds), providing, among other things, exciting views on function-related macromolecular dynamics: from the single molecular conformational fluctuations related to the intermolecular recognition, signaling and catalysis towards assembly of full virus particles, protein folding, and even longer time scale dynamics. Methodologically, a variety of MD software tools are available with a specified philosophy of molecular modeling. However, due to the complexity of the underlying methods their usage requires both a lot of experience and sufficient computing power. The latter can be satisfied by using cloud computing to gain access to appropriate computing resources, which can include also High Performance Computing (HPC) systems.
The complexity of methods and resources instead must be tackled by providing a user-friendly and intuitive interface for the scientists with various backgrounds. Here we seek to implement a pipeline that combines protocols that automate the steps for setup and execution of MD simulations, as well as the analysis of MD trajectories, aiming also at simplifying the comparison with different kinds of experimental data for validation of the simulations.
Some typical user stories are:
- analysis and processing of the 3D coordinates of a biological macromolecules, construction in automatic fashion of sophisticated molecular systems (e.g. explicit inclusion within the models of non-standard residues, explicit membrane environment etc)
- access the virtual machine (VM) where the simulation is being carried out for visualization and control of the molecular systems
- processing and analysis of the simulation output data, integrative analysis of the job outputs e.g. producing graphical output
- implementation of own workflows for molecular system preparation, running of multiple jobs in parallel.
Portals for the setup and execution of MD simulations on the EGI computational infrastructure have been developed within the WeNMR project (www.wenmr.eu) and have been used since 2010. The applications supported are free dynamics and MD-based refinement of protein structures, with limited capabilities for analysis of trajectories.
Currently, the portals are being upgraded to execute simulations on GPU computational resources, which can significantly speed up calculations and thus allow users to perform longer simulations, up to more biologically relevant time scales. In parallel, these portals will be endowed with a larger set of predefined applications in order to support a broader scope of applications. By implementing these in the INDIGO-DataCloud environment, users will be able to exploit state-of-the-art approaches on the most appropriate computational infrastructure, in a transparent manner.
The community involves researchers from academia as well as, potentially, pharma and biotech companies. The researchers will include: i) biologists, who want to validate hypotheses on functional mechanisms relying on macromolecular dynamics; ii) pharmacologists, who are interested especially in ligand docking applications; iii) experts in MD simulations, typically with a background in physics or physical chemistry and an interest in the development of MD methods and their validation through experimental data. The first two profiles will often have only very basic fundamentals in understanding of the complexity of MD simulations and little or none skills in cloud/distributed computing.
End users will be associated to projects, and guided in their work via simple graphical interfaces, e.g. integrated within the web browser. The range of applications can extend from unrestrained “brute-force” molecular dynamics simulation towards more methodologically sophisticated pathway-dependent free-energy calculations (umbrella sampling simulations); accelerated sampling techniques, performing of high-throughput virtual screening by means of macromolecular docking or post-processing trajectory analysis techniques (MMGBSA); simulation and explicit analysis of specified reactional pathways of protein-ligand binding/diffusion using steered MD protocol. Finally end-users will obtain opportunity to perform biomolecular simulations constrained by means of experimental data e.g via using NMR constrains or alternatively ensembles of X-ray structures or alternatively to perform MD-based sampling of experimentally solved structures using different experimental techniques and/or under various external conditions.
The main objective is the implementation of interfaces that allow users to run multi-threading as well as MPI-based molecular dynamics simulations depending on specific needs, provide access to the simulated data (trajectories) and perform standardized analyses of the trajectories. Cloud technologies would be instrumental both to obtain access to different types of computational resources and to allow the post-simulation analysis of trajectories without the need to move the data from the researchers’ lab.
There are several bottlenecks for the community to benefit of state-of-the-art simulations of molecular dynamics, related to the i) automatical adjustment of the simulation setup towards particular molecular system; ii) the computational power required to run the long unrestrained simulations by using multi-core processors and/or GPU-based systems as well as a suitable implementation of MPI libraries; iii) large output datasets produced by the simulation that must be stored for subsequent analysis using dedicated tools. In all of those cases a researchers needs to locally install and maintain his/her own version of the MD software, and will experience a high barrier to adopt more recent tools. Moreover huge computational resources when multiple protein conformations are processed in parallel. The technological infrastructure developed by INDIGO-DataCloud will be key to remove such bottlenecks, especially for users who are not computer-savvy.