Data preservation

Tools to consider

Contents

Storage: servers and collaboration tools

Data is usually stored on a computer, an external hard drive or a USB stick, the latter being practical but risky (1). Users usually realize the risk taken when it is too late and damage is done: when a machine crashes, data is inadvertently overwritten, or in the event of loss or theft for example.

A server, which presents less risks and provides more storage space, is the best solution for research projects. Centralizing resources is often supplemented by collaboration tools that are useful for sharing information between members of a team, especially when the project is managed by multiple authorities.

For example, Université de Lyon 1 offers a collaborative platform called Box UCBL which uses NextCloud technology (free file hosting software). Any user with a Lyon 1 account can connect to it and share files, using up to 5 Go . It operates like a drive for file sharing, and a link can be sent to third-parties to connect to the work-space by defining their rights. The UMS Gricad offers researchers attached to University of Grenoble Alpes intensive calculation and data processing facilities, a cloud (with virtual servers) and a collaboration platform.

Storage and file naming conventions go hand-in-hand if data is to found in the future, be it several weeks, months or years after creation. For example, for data associated with an experiment, use the title of the experiment, the associated project, and the date of execution.

Apart from standard institutional servers, there are also secure on-line tools. For example, the platform EUDAT.eu supported by European research programs (PCRD 7 and H2020), offers several on-line services covering different data management needs such as storage, searching, and archiving.

B2Drop, for example, stores and synchronizes research data on several machines. Data can also be shared with colleagues or members of a group. 20 Go is assigned per user and 2 Go per file. It is only to be used for frequently used data while a project is in progress, for example, between several establishments. There is no metadata associated with the data.

Data repositories, initially designed to encourage open-access publication of scientific data, provide varied means of distribution (open access, restricted access, embargo, etc.). They can also provide storage. That is how some researchers perceive Zenodo the data warehouse made available by the Cern. The platform, presented as an experimental project with a minimum life of 20 years, does not however commit to data readability over time (2). The infrastructure is secure, supplies lasting identifiers (DOI) and accepts large data sets up to 50 Go. Data deposits on Zenodo must be described: title, author, version, keywords, etc. This saves space on personal devices, and data is easily traced.

Archiving: a sensitive question

When a project finishes, a decision must be made about what happens to the data produced. Can some of it be deleted? Or does it have intrinsic value to justify a long-term – and costly – solution?

There are several solutions archiving data and preserving its integrity over time.

The major player for scientific archiving, based in Montpellier, the CINES (Centre Informatique National de l’Enseignement Supérieur) ensures data longevity. The first step consists of sending a letter of intent to the director of CINES featuring a presentation of the project, the type of data, the formats used, and the volume of the data sets. If a data management plan has been prepared, all this information is available. A project team is assigned to archive the data which usually takes between 6 months and 1 year. Costs vary depending on the type of service (number of copies on disk or magnetic tape) and the volume of data (3).

For smaller projects (less than 10 TB), the basic tariff is €1,043 (incl. VAT) per TB archived per year. The service includes a local copy on disk, a local copy on tape, and data replication within 300 km.

For bigger projects (greater than 100 TB), the cost is €221 (incl. VAT) per TB archived. In this case, the service includes two local copies on tape and remote data replication.

A fixed processing fee of €2,500 (incl VAT) is payable in advance for the preparation of the service.

Access to the FACILE platform, an on-line format validation tool is also possible via the CINES site. The platform features a list of the eligible formats for deposits and a contact function if expert assistance is required.

The previously mentioned EUDAT.eu platform has a “long-term data preservation system” with the service B2Share. However, unlike CINES, B2SHARE does not commit to long-term content readability.

The service is free-of-charge to all European researchers whether or not they are affiliated with research organizations or universities. Data sets have a long-lasting identifier distributed by the platform. Certain basic metadata must be entered, such as the title and description of the data. Of course, more metadata can be entered, particularly using extensions and interfaces specific to certain communities.

As the name share suggests, data can be published and shared amongst communities. But users control access to their data and can restrict access if they prefer.

To improve data searches in B2SHARE, EUDAT has also incorporated an annotation service: B2NOTE. These annotations are used to classify groups of data or files. Three types of annotation are possible. Firstly, the semantics tag from existing ontologies (currently from Bioportal only (4), with ontologies for biology). Secondly, it is possible to create and associate your own keywords when there is no tag. Thirdly, it is possible to leave comments describing the resource more thoroughly.

While this tool is not indispensable, it can improve indexing of your own data or enable more refined searches in B2SHARE data.

  1. “The storage of data on laptops, external hard disks or storage devices such as USB sticks, is not recommended.” See the ANR’s PGD model: https://anr.fr/fileadmin/documents/2019/ANR-modele-PGD.pdf
  2. “Zenodo makes no promises of usability and understandability of deposited objects over time.” https://about.zenodo.org/policies/
  3. How to archive at CINES:https://www.cines.fr/archivage/comment-archiver-au-cines/
  4. https://bioportal.bioontology.org/