Alberto Michelini is the Director of the National Earthquake Center at the National Institute of Geophysics and Volcanology (INGV) in Italy. Looking for help to manage vast quantities of complex and diverse seismological data, he became involved in EUDAT, a project that is establishing a Collaborative Data Infrastructure for European research data. In conversation with Erwin Laure, Director of the PDC Center for High-Performance Computing at the KTH Royal Institute of Technology in Sweden, Alberto explains the background to his involvement with EUDAT and provides a checklist for research communities about how and why they too should become involved with EUDAT.
Research communities need places to store their increasing volumes of digital data – and usually they want to make that data available to other researchers too, so it can be re-used and contribute to society by helping to solve some of the challenges we face. For example, seismologists share data to help produce accurate predictions of seismic ground motions at high frequencies (which is very important for the purposes of reducing seismic hazards). Consequently the data storage facilities we adopt to manage our research data must provide good tools that enable people to search for and find specific types of data. Many researchers also need ways to move or copy their data to and from the high-performance computing centres where it is processed. In addition, research data must be managed and stored in a secure, professional and persistent manner so that both the researchers who make their data available to others, and the researchers who use others’ data, can rely on the integrity of that data.
The cost of data handling - time and resources
In Europe we could adopt a model whereby individual institutions would be responsible for their own data handling. However, it is important to understand the size of investment (in terms of both time and money) that would be needed to provide suitable data management facilities. The kind of computer system that can adequately store such huge amounts of data is not cheap to install, and providing the right tools for managing the data takes time and requires a lot of resources. For example, one either needs sufficient personnel to handle the data management side of things, or the researchers need to take time from their research to do their own data management – which is certainly not an efficient use of their research time. In any case, we are now at a breakpoint since individual research communities are struggling to provide all the necessary modern 'tools' for data archiving, curation and preservation.
The power of shared resources
In the same way that we can have access to much faster high-performance computer systems by sharing resources (for example, through the PRACE project), we can also enjoy much better data management and storage facilities if we share and work together. And this is where EUDAT comes in.
EUDAT is primarily designed to provide data services for European researchers (although naturally we want to make the data available for world-wide research, wherever appropriate). The project addresses the needs of both researchers and members of the general public who are producing or using very large data sets for research purposes. A typical example of this kind of data comes from the continuous series of observations recorded by our networks of seismic stations over time, or from waveform simulations of earthquakes. There are also researchers who generate or work with many small sets of data, such as those resulting from the analysis of ambient seismic noise cross-correlation. In all these cases, the researchers have quantities of data that are either too large to store on their local departmental computer facilities (or their personal computers and laptops) or that need to be moved across to specific high-performance computing facilities and services for analysis.
EUDAT’s purpose is to pool resources across Europe to provide us all with much better research data ‘library’ services than any of us could afford individually. In other words, EUDAT can be ‘the engine under the hood’ to drive the vehicle for all our research communities' data needs.