For more than ten years the Bonvin lab has been developing the HADDOCK software for the modelling of biomolecular complexes. In order to hide its complexity and provide an easy and user-friendly access to both the software and computational resources, the Bonvin team has developed and been operating since 2010 a web-based portal which offers various access levels, exposing an increasing number of parameters and options to endusers.
Within INDIGO, new tools and solutions are being developed that should make it easy to deploy, configure and customize a virtual instance of the HADDOCK portal and cluster, at the click of a mouse, without having to learn complex Cloud management operation.
The computational structural biology group in Utrecht (bonvinlab.org/people) is composed of master and PhD students and post-doctoral researchers working on a variety of projects, all centred around the main research theme of the lab: the development of reliable bioinformatics and computational approaches to predict, model and dissect biomolecular interactions at atomic level. For this, bioinformatics data, structural information and available biochemical or biophysical experimental data are combined to drive the modelling process. This is implemented and further developed in the widely used HADDOCK software for the modelling of biomolecular complexes. By following a holistic approach integrating various experimental information sources with computational structural biology methods we aim at obtaining a comprehensive description of the structural and dynamic landscape of complex biomolecular machines, adding the structural dimension to interaction networks and opening the route to systematic and genome-wide studies of biomolecular interactions.
Cell functioning depends on many molecular processes where biomolecular machines interact with each other. Three dimensional structure of biomolecules and their complexes provides the information on how these molecules work, and also offers a basis for therapeutical drug design. Obtaining even a partial picture of the protein network is a highly challenging task, and necessitates the integration of both experimental and computational approaches.
This requires a large amount of computational resources, which can be solved by INDIGO solutions.
Developed and maintained by Bonvin lab, the HADDOCK software is an integrative information driven approach for modelling biomolecular complexes. It enables the incorporation of a large variety of experimental data to generate thousands of models at atomic resolution, based on classical mechanics. The software is available through a user-friendly web-interface, offered both on local resources and within the context of WeNMR Virtual Research Community. The web-based portal provides various access levels and options of docking parameters for end users. The graphical and numerical analysis of the results are also displayed on the web page, from which the end user can download all of the simulation and analysis data. Currently, HADDOCK is the most cited protein docking software and used by a large community worldwide.
The case study for HADDOCK Portal involves the virtualization of the HADDOCK web portal and the required computational infrastructure underneath it using INDIGO solutions. The aim is to be less dependent on local hardware and to facilitate the deployment of the software at other sites (possibly within a company, or usage for teaching purposes).
DisVis and PowerFit
DisVis is and PowerFit are other softwares developed by Bonvin group, to study biomolecular interactions. Both of them are Python package and a simple command-line programs, and available on www.github.com/haddocking/disvis and www.github.com/haddocking/powerfit. Currently, users can install these softwares to their working environment following the steps described on github repository. Both DisVis and PowerFit have the ability to harness multiple CPUs and GPGPU.
DisVis and PowerFit case studies involve the deployment of these softwares into virtual machines with all their dependencies, so that the end user will not deal with installation procedure.
“The HADDOCK portal has processed over 110.000 user runs since its official launch, and this for a growing community of more than 6500 registered users worldwide. The portal is hosted on a local cluster, but makes efficient use of the distributed grid resources offering through the European Grid Initiative. This was made possible through two former EU FP7 e-Infrastructure projects (eNMR and WeNMR). The user submission is translated into more than 8 million individual grid jobs per year, running all over Europe, but also in grid sites in Asia and the US (via the Open Science Grid), and this all in a transparent manner for the end user. The job management is based on middleware developed in the context of former European projects (EMI middleware) and current services operated by EGI-Engage (e.g. the DIRAC4EGI service).
Operating such a portal is a complex process with many risk factors. The current implementation on bare metal makes it vulnerable to hardware and/or power failures, which would have a direct and large impact on the user community.
As for PowerFit and DisVis, the users can install these softwares from github repositories. Although the installation procedure is described in detail on github, this process can be tricky for some users.
The scientific community of the HADDOCK portal consists of scientists worldwide working in the field of life sciences in general, with a particular focus on structural biology. More than 1000 laboratories are also using a local version of the software. Results from a recent survey indicate that about 33% of the user community are PhD students, 20% post-doctoral researchers and 33% research staff members. Interestingly, there is also a substantial share (~9%) of both bachelor and master students using the portal, indicating that the HADDOCK portal is also having an impact on education. This can easily be explained by its ease of use that allows, for example, still inexperienced bachelor students to perform small research projects using HADDOCK. Next to mainly academic use, HADDOCK is also being used by a number of major pharmaceutical companies.
The HADDOCK portal has been in continuous production since 2010. Currently a majority of jobs are being sent to international grid resources. Gaining the ability to offer a fully virtualized portal, while still making efficient use of either local virtual resources (e.g. a virtual cluster) or grid resources, will in the long term improve the quality and reliability of our services.
The end user might not realize this since all the complexity of the workflows and computing remain hidden from him/her, but the portal operators will greatly benefit for the developments in INDIGO since they will be able to clone and launch on demand new instances of the portal with minimal overheads. This will allow to finetune the operation in order to respond to increased demands and/or specific purposes, like for education, workshops or applications within an industrial setting.
Within INDIGO solutions, gaining virtualization of HADDOCK portal in what we would like to call a HADDOCK server in the Cloud, would have several advantages:
- 1. it would make it less vulnerable to hardware and power failures, since new instances could be launched at any time
- 2. it would make it less dependent on software and operating system
- 3. it would increase the flexibility with respect to teaching and workshops since dedicated instances of the portal could be operated for dedicated purposes
- 4. it would make it more attractive to potential industrial users like pharma industries and small biotech, which could run a local instance of the full portal within their local network in order to ensure full protection with respect to intellectual property.
Within INDIGO, new tools and solutions are being developed and that should make it easy to deploy, configure and customize a virtual instance of the HADDOCK portal and cluster, at the click of a mouse, without having to learn complex Cloud management operation.
Furthermore, in related research areas, our softwares PowerFit and Disvis can efficiently exploit GPGPU resources. Web portals are currently under development for those, and we hope to be able, as for the HADDOCK portal, to virtualize those, but in this particular case on GPGPU-enabled instances. Finally, the data generated by such computations, might also need to be stored and shared, either in a private manner during the duration of a project, or made public at the end of a project in open data policy. Tools underdevelopment in INDIGO should facilitate this process.”