Running a climate models intercomparison data analysis experiment is very challenging, as it usually requires the availability of large amount of data (multi-terabyte order) from multiple climate models.
In the current scenario, such datasets have to be downloaded (e.g. from the ESGF data nodes) on the end-user’s local machine before starting to run the analysis steps. This is a strong barrier for climate scientists, as this phase can take (depending on the amount of data needed to run the analysis) from days, to weeks, to months.
The current client-side nature of the analysis workflow also needs end-users to have system management/ICT skills to install and update all the needed data analysis tools/libraries on their local machines.
Another barrier relates to the complexity of the data analysis process itself. Analysing large datasets involves running multiple data operators, from widely adopted set of command line tools. This is usually done via scripts (e.g. bash) on the client side and also requires climate scientists to take care of, implement and replicate workflow-like control logic aspects (which are error-prone too) in their scripts - along with the expected application-level part.
Moreover, the big (volumes of) data and the strong I/O requirements pose additional challenges related to performance. In this regard, production-level tools for climate data analysis are mostly sequential and there is a lack of solutions implementing fine-grain data parallelism or adopting stronger parallel I/O strategies.
Case Study & User story
A step-wise description of the large-scale climate model intercomparison data analysis use case is provided. It includes a set of user stories throughout the entire description that can be considered small pieces of the same large picture:
- 1. According to her needs, the user will need to select a data analysis experiment (anomalies analysis, trend analysis, climate change signal analysis) through a specific interface provided by a scientific gateway.
- 2. The experiment will be defined in terms of a set of input (e.g. variables, models, scenarios, timeframe, bounding box, etc.) and a workflow for data analysis. The workflows will need to be also published in a workflow repository for further re-use, community-based feedback, link to the scientific gateway, etc. To this end, existing market-place tools, should be used.
- 3. The experiment will involve datasets at a single site or at multiple sites according to the distribution/location of the datasets.
- 4. Some of the tasks in the experiment could be related to running existing software like visualization tools well-known in the community (e.g. NCL, NCAR Command Language) or other ones for data manipulation and analysis, based on users needs and preferences. Moreover, specific interfaces are expected to support an easy definition of big data analytics experiments on large datasets through a declarative approach (e.g. parallel statements in the workflow definition).
- 5. Considering that the experiment runs could take from minutes to hours (or even more depending on the set of inputs defined by the user) notification mechanisms (e.g. email) will be useful to inform users about the status of the experiments.
- 6. The results of the experiments should be easily available to the end user for inspection, download, visualization through the scientific gateway interface. To this end, the user interface (e.g. scientific gateway) should provide specific/advanced support for data analytics and visualization.
- 7. The publication of the final results on HTTP/FTP or domain-specific services (e.g. OPeNDAP/THREDDS) should be a possible option for the end users. That will enable a better sharing regarding the experiments results.
Status today (January 2016)
A preliminary implementation of a precipitation trend analysis experiment has been performed using the current prototype of the Scientific Gateway for Climate Data Analysis, developed in INDIGO.
In this regard, from a back-end point of view, an Ophidia big data cluster stack has been successfully tested both on OpenNebula and OpenStack private clouds running the first analysis prototype of the use case (single-model scenario). The extension of the current workflow toward the multi-model scenario is currently on-going and will allow running a geographically distributed, climate intercomparison data analysis experiment. To run, test and validate the use case, experiments involving CMIP5 data (provided by CMCC in the CMIP5 federated data archive) have been perfomed.
The case study relates to the scientific community working on climate modelling, which is organized within the European Network for Earth System modelling (ENES). The institutions involved in this network include university departments, research centres, meteorological services, computer centres and industrial partners.
The case study deals with the climate models intercomparison data analysis, which is very challenging as it usually requires the availability of large amount of data (multi-terabyte order) from multiple climate models. In such a context, from an infrastructure point of view, the Earth System Grid Federation (ESGF) provides a federated data infrastructure involving a large set of data providers/modeling centres around the globe (the IS-ENES project provides the European contribution to the ESGF infrastructure). ESGF has been serving the Coupled Model Intercomparison Project Phase 5 (CMIP5) experiment, providing access to 2.5PB of data for the IPCC AR5.
The INDIGO DataCloud architectural solutions, will allow implementing the following general workflow:
- Experiment setup: from a user interface (e.g. Scientific Gateway) the climate scientist will choose/define a specific type of data analysis (e.g. trend analysis, climate change signal, etc.). The input parameters provided at this stage will relate to the domain- and analysis-specific information (e.g. climate models, variables, time frequencies, bounding box, etc.).
- Experiment run: the data analysis workflow will be transparently submitted to the infrastructure exploiting the INDIGO PaaS. A lot of things will transparently happen underneath thanks to the INDIGO services.
- Data discovery will be based on the input parameters provided by the user and performed querying the ESGF Index nodes. Tasks will be orchestrated by WfMS components and submitted to multiple sites (taking into account data locality constraints and exploiting a server-side approach) taking advantage of big data solutions/framework deployed at each side (in this regard, the INDIGO PaaS will support the dynamic and transparent instantiation of the needed components/resources). The output of this step will be both intermediate data and final products. Provenance will be captured throughout the entire process and security aspects related to AuthN and AuthZ will be properly managed by specific INDIGO services.
- Results access, visualization, and publication: The results will be made easily available to the end-user through a dedicated interface for download, visualization and possibly further analysis. The user interface will provide analytics, exploration, and visualization capabilities, thanks to the integration of already existing and well-known tools in the community.
The main innovation achieved within INDIGO will be:
- a software framework deployable on heterogeneous infrastructures (e.g. HPC clusters and cloud environments) to run distributed, parallel data analysis;
- provisioning of efficient big data analysis solutions exploiting workflow-enabled, server-side, and declarative approaches;
- an interoperable solution with regard to the already existing community-based software eco-system and infrastructure (IS-ENES/ESGF);
- the adoption of workflow management system solutions for large-scale (distributed) climate data analysis experiments;
- the exploitation of cloud computing technologies offering easy-to-deploy, flexible, elastic, isolated and dynamic big data analysis solutions;
- the provisioning of interfaces, toolkits and libraries to develop high-level interfaces/applications;
- the provisioning of a Data analytics Gateway for eScience, with specific support for scientific data analysis and visualization.
As it can be easily argued, the proposed case study deals with multiple challenges and requirements regarding: workflow management, big data analytics, distributed data analysis, scientific data management (e.g. analysis, visualization, publication), security, interoperability with existing infrastructures/eco-systems, re-usability (e.g. workflows, intermediate/final products), performance (e.g. data volume is key), dynamic and flexible use of infrastructural resources, etc.
The proposed INDIGO architectural components allow to address these requirements and overcome current limitations regarding client-side data analysis, sequential data analysis, static approaches, poor performance, etc. as well as a complete lack of workflow support, and domain-oriented big data approaches/frameworks to enable large scale, high performance climate data analysis experiments. It aims at providing a core part still missing in the climate scientists’ research eco-system.
INDIGO helps in:
- - dealing in an easy manner with large scale, massive climate model intercomparison data analysis experiments;
- - running complex data analysis workflows across multiple data centers,
- - integrating in large scale workflows well-known existing tools, libraries and command line interfaces;
- - reducing the time-to-solution and complexity associated to this class of large-scale experiments;
- - implementing and exploiting server-side solutions for data analysis, reducing at the same time large scale data movement
- - addressing the re-use of final products, intermediate results and workflows.