This team is developing a fully customizable Galaxy instance provider platform based on the technology developed within the INDIGO-DataCloud project. It will provide an easy setup of on-demand Galaxy workspaces, ready to be used both by life scientists and bioinformaticians lacking the necessary resources (either human or computational) to deploy their own custom instance
The Bioinformatics and Comparative Genomics group is composed of associate and assistant professors, technical assistants, master and PhD students and post-doctoral researchers working on a variety of projects, most of them centred around leveraging Next Generation Sequencing (NGS) technologies with bioinformatics and computational approaches in several -omics fields of investigation. Its main research focuses cover genomics, with the sequencing, re-sequencing, and functional annotation of whole genomes; metagenomics with the microbiome analysis of clinical and environmental samples; transcriptomics with pioneering studies on the role of RNA-editing in pathologies like Amyotrophic Lateral Sclerosis. Being both producers and users of bioinformatic data and software we well know the complexity of handling massive sequencing data and of designing user friendly bioinformatic tools, that is tools that are not only at the state of the art from the scientific and technical standpoints but also easy to learn and to be used by other bioinformaticians and life scientists in general. Bioinformatic workflow manager platforms like GALAXY give an outstanding help in this endeavour, simplifying the use and handling of bioinformatic tools and data in an integrated and coherent work environment.
The genomic informations produced by the recent introduction of Next Generation Sequencing (NGS) and other data intensive technologies are growing exponentially, requiring powerful computational infrastructures in order to manage, store and ensure reproducibility of the data.
Moreover, quality bioinformatics software and skilled people to manage data and operate tools in order to analyse and exploit these precious informations are needed. Bioinformatics analyses are still the primary bottleneck in NGS studies. Most Software for NGS analyses runs in unfamiliar Unix/Linux environments, requiring command line operations and a steep learning curve to be correctly operated.
In this scenario the ability to access and interact seamlessly with data and software tools, containing the costs, becomes crucial.
Galaxy provides an elegant solution to this problem allowing data analysis, integrating multiple tools and complex bioinformatics workflows through an easy to use wb-based environment.
Case Study & User story
Galaxy is an open source, web based, workflow manager platform for bioinformatics analysis. It is designed to allow data analysis, integrating open-source and command line tools into an easy to use web-based environments. The results of each analysis can be viewed and published or shared with other users of the same platform. Galaxy capabilities can be expanded and tailored to the user needs by the platform administrator through the installation of tools from public repositories or by adding to the platform any custom bioinformatic tool developed by users.
The use case consists in the development of a fully customizable Galaxy instance provider platform based on the technology developed within the INDIGO-DataCloud project. Among other advantages, cloud technologies may help in decreasing the costs of bioinformatics analyses through a more efficient use of expensive computing hardware.
Each single instance will allow the Galaxy administrator end-user to add needed reference data and custom tools, e.g. the ones available in the Galaxy's Tool Shed (https://wiki.galaxyproject.org/ToolShed) repository and to each Galaxy user to upload their own data to the instance to perform analyses. Each Galaxy instance will be customized according to the user needs by the instance administrator, using the web interface to select between different sets of tools as default configuration of the Galaxy instance, installing new (custom) tools and integrating private reference data-sets. New pipelines can be easily designed by each end-user and deployed on the instance, with the possibility of sharing them with other users of the same instance.
Each Galaxy instance will be configured according to the Virtual Machine hardware configuration, e.g. CPU number, ram, storage, public/private IP address: specific virtual hardware setup in terms of CPU, ram and storage, will be selectable through web interface.
The use case will provide an easy setup of on-demand Galaxy workspaces, ready to be used both by life scientists and bioinformaticians lacking the necessary resources (either human or computational) to deploy their own custom instance
Galaxy is currently adopted in many life science research environments, in order to facilitate the use of many bioinformatics tools, handling large quantities of biological data. While the use of the workflow manager is relatively simple, making it a good solution also for training purposes, setting it up and its administration it requires an adequate computational infrastructure and people with the necessary technical know-how.
At present, there are three main solution to use Galaxy, both with advantages and drawbacks.
The first one is to use a public Galaxy server. There are more than 70 public Galaxy servers that can be used by the users, some oriented to specific types of analysis, others more general. This solution offers a full working Galaxy deployment, without any user-specific configuration needed. On the other hand the hardware resources are shared among all the server users and often computationally intensive jobs are not allowed. A public Galaxy instance configuration is not customizable, therefore it is not possible to install new tools, leading to a very limiting situation since it is often needed to process data using different strategies and tools searching the best approach for a particular experiment. Moreover it is impossible to add new, self-developed tools.
New user data are uploadable and accessible only by the allowed users, but data are still potentially accessible to the platform administrators. Even if this is not an issue in the most common research situations, this is a critical drawbacks when it is needed to process sensitive clinical data.
Alternatively, Galaxy can be deployed in private and personalized instances on local hardware resources. The deployment of a physical Galaxy server is costly since it require both hardware and human resources to maintain it.
Galaxy installation and maintenance requires to be performed in a Unix/Linux environment, requires command line operations, python programming and database management. Moreover, at least one bioinformatician is needed for bioinformatic tools and reference data management.
A third solution is the use of commercial Galaxy instances on the cloud like the one offered by Amazon. This solution is costly and comes with serious ethical drawbacks in terms of data privacy and handling by the private vendor of the service.
A public cloud based and easy to manage Galaxy environment will greatly benefit many life science research environments by offering computational resources without the need of owning and maintaining a burdensome computational infrastructure and at the same time freeing up bioinformaticians’ time that could be better invested in data interpretation and algorithms development.
The “Galaxy as a service” platform aims to be adopted both by life scientists and bioinformaticians requiring a powerful computational infrastructure to run complex analyses over large datasets within the familiar Galaxy environment.
This solution is particularly suitable for small research groups, big institutions or SMEs which cannot or do not want to setup and maintain their own hardware and software infrastructure. Moreover a Galaxy cloud service could be a practical solution for universities and other training facilities to teach basic and advanced bioinformatics. Finally, when bioinformatics protocols will be available for personalized medicine (e.g. pharmacogenomics) and Galaxy as cloud service will provide an easy-to-use (and learn) platform, ready to be used by healthcare operators.
An end-user will be able to customize its own Galaxy instance using mostly web interface. For instance he will be able to select between different sets of tools as default configuration Galaxy and install new (custom) tools and workflows. Any configuration issue is already solved during the development of the service, therefore the administrator of the virtual instance has not to be an expert of Galaxy to run it.
Sensible data can be transferred to and from the VM(s) because they are insulated from every other user in the cloud environment, platform administrator(s) included. Data are accessible only by the instance Galaxy administrator and to the other users of the same instance (each user has access to its own data).
The virtual hardware will be also customizable, in terms of CPU, ram and storage. The Galaxy instance will be configured according to the technical specification of the virtual hardware (e.g. CPU number, ram, storage, public/private IP address).
ELIXIR groups Europe’s leading life science organizations. The goal of ELIXIR is to manage the collection, quality control and archiving of large amounts of biological data produced by life science experiments. ELIXIR aims to create an infrastructure that integrates research data from all corners of Europe and ensures a seamless service provision that is easily accessible to all, providing tools to exploit this infrastructure, like Galaxy.
In this context cloud computing is a valuable opportunity to offer on-demand services deployed on a flexible computational infrastructure.
Although this is not the first attempt to deploy Galaxy as a cloud service, the integration of Galaxy in the INDIGO-DataCloud infrastructure ensures, for the first time, the development of innovative solutions to completely fulfill the project requirements. Moreover, the possibility to isolate and protect sensitive clinical data opens to the adoption of this service by those research groups involved in clinical bioinformatics.
The INDIGO-DataCloud community offers the unprecedented opportunity to cooperate among different research communities, sharing not only common problems, but especially common solutions in different area of life science.
Finally, this project is very interesting for the ELIXIR-ITA community (the Italian Node of ELIXIR), representing the leading Italian institutions in the field of bioinformatics, aiming to propose this solution to a vast and heterogeneous community of scientists that use, develop and maintain a large set of bioinformatics services.