Next Generation Sequencing techniques have completely reshaped the whole scenario of scientific research involving nucleic acids on -omic scale since its introduction about ten years ago. These technologies allow an unprecedented level of detail and scope in a vast array of molecular assays. A few examples include pinpointing the exact genomic loci of protein-DNA and protein-RNA interactions, massive human genome re-sequencing projects to detect disease related mutations, epigenomics studies to comprehend the subtle molecular mechanisms at the basis of gene expression regulation, etc... Moreover, cheapness is one of the key strength of these technologies. With just one thousand dollars or less is nowadays possible to re-sequence a whole human genome. It took many years and three billions of dollars when this endeavour was first attempted with the Human Genome Project about two decades ago. At the same time NGS technologies pose a serious challenge from the computational and data handling standpoints. Small, medium and large laboratories and research institutions around the globe generate massive amounts of data each and every day, producing a continuous stream of raw but precious information that have to be handled, processed and safely maintained with the help of bioinformatic tools and databases. This in turn requires large amounts of computational and storage resources.
In this scenario the adoption of proper e-Science solutions is the most viable way to assure that this richness of data is fruitfully exploited, granting positive fallout on fields like e.g. environmental sciences, health care and personalized medicine, biotechnological and pharmaceutics industries and the likes.
The Case Study: the "GALAXY as a service" scalable platform
Galaxy is an open, web-based platform for computational biomedical research. A thriving community of bioinformaticians and computer scientists develops and maintains Galaxy, which in turn has attracted a vast number of users among different life science communities. Galaxy's easy to learn and use web based interface allows scientists unaccustomed with programming and command line interfaces to run complex bioinformatic tools and to build their own personalized workflows and pipelines for data analysis, or to adopt the ones developed by other investigators. Hiding most of the complexity involved in running bioinformatic tools and in handling data, Galaxy can also be adopted as a flexible didactic platform to teach and learn bioinformatics at any level of detail. The INDIGO use case on Galaxy aims to tackle some of the drawbacks involved in using Galaxy public servers or in installing custom instances on local hardware as described in the next section.
Most users access Galaxy through one of the many public Galaxy servers, while others have to manage to install their own Galaxy instance on local hardware resources. Both these solutions have drawbacks that will be addressed within the INDIGO project. The GALAXY use case in INDIGO aims to develop an automated provider of fully customizable GALAXY instances on the cloud, solving in one fell swoop all the above issues thanks to the power, scalability and flexibility of cloud platforms.
While some attempts to "cloudify" Galaxy have been already performed by others, this is at the best of our knowledge the first attempt to make it in the context of a big and integrated e-Science project like INDIGO-DataCloud.
This means that the solutions that will be adopted to fulfil the GALAXY use case will be based on state of the art technologies that will be reasonably supported over the time, and with a deep exploitation of both IaaS and PaaS Cloud Layer. This will add all the needed automatisms and scalability needed in this context. Moreover, innovative solutions developed within INDIGO to prevent the access to sensitive data by non authorized users, will enable the use of "GALAXY as a service" platforms even by those actors that have been until now excluded from utilizing them, like health operators interested in clinical bioinformatics.
Production of biological data is growing exponentially, but this is not the case for the number of people able to handle, preserve, manage and analyse them.
Realistically in the next years the European Union will see a chronic lack of bioinformaticians that will be only in part mitigated by extensive training programs that are beginning to be funded in some other h3020 projects like ELIXIR-EXCELERATE. The ready availability of instances of an easy to use and learn bioinformatic workflow managers like GALAXY will enable many life scientists to perform at least some of the most common analyses for their experimental data by their own, and this will in turn allow bioinformaticians more time to dedicate to the more complex and unusual analyses or to the development of novel algorithms and tools to explore experimental data. These statements will become even truer with the imminent advent of clinical bioinformatics that will be unavoidable for the pursuer of personalized medicine and pharmacogenomics.
We are confident that solutions developed within INDIGO project will have an important and lasting impact on e-Sciences in general and on bioinformatics in particular. As an added value, the interdisciplinarity of projects like INDIGO is noteworthy since it puts in touch scientists from very different areas of research that hardly have any occasion to talk to each other. This greatly helps in sharing innovative solutions for common problems that usually are tackled independently by the single communities (e.g. big data handling) but that can greatly benefit from a unified and common approach.
Professor Graziano Pesole is Director of the Institute of Biomembranes and Bioenergetics, National Research Council, Bari, Italy.
He is Head of the Italian Node of ELIXIR.
As one of INDIGO Champions, he leads the Bioinformatics and Comparative Genomics group working on the GALAXY use case.