PANGAEA is highly committed to preserving its data and metadata. Next to redundant storage, several logical measures are taken to keep data and metadata instantly in a consistent state. Data is not just dumped to harddisks, but PANGAEA stores metadata about all parts of each dataset in its relational database (PostgreSQL). The metadata schema is designed to allow easy dissemination into various other metadata formats. The concepts of schema.org and ISO 19115 are used to describe data. For PANGAEA metadata is considered essential for the reusability of data and is therefore carefully collected and preserved. The following metadata is collected:
- for the citation of PANGAEA datasets (“citation”): author / contributor names and their ORCID iD and affiliations with ROR identifier; title of dataset; date of dataset publication, a DOI name
- funding information (including projects, grant numbers)
- event information with detailed information when and where data was taken, including information about methods/devices
- links (including DOI names or other persistent identifiers) to related documentation (scientific articles / papers). If documentation is not stored in external repositories (e.g., in libraries, publishers) and isn’t reachable by persistent identifier, PANGAEA stores a copy of the PDF file in PDF-A format in its own systems. By using a standard like PDF-A it is unlikely that the files need to be touched again. Of course PANGAEA always tries to adapt new long term standards for documents/documentation and will transform PDF files to newer standard versions, if needed.
The actual data objects in PANGAEA are stored as “data series” (a series of data points in numerical, date/time, string or binary form). Each data entry in a data series refers to metadata about the object:
- type of data point (numeric, date, string, binary file)
- responsible scientist (PI)
- for binary files also hashes and file size, absolute location in bucket store
- for numerical data also format information like significant digits
PANGAEA’s database schema is constantly adapted to new metadata standards. During this process, already existing metadata of datasets are adapted and extended according to new standards. Great care is taken to not introduce incompatible changes to object’s metadata.
Not all of PANGAEA’s data is provided in tabular form. Some datasets are only available in compact, community specific binary formats like NetCDF or encoded as videos or static images. Long term preservation of those formats is a complex problem, so PANGAEA has developed some format rules before accepting data in binary formats. At the moment, any of the following formats are accepted for archival. If possible, uncompressed formats are preferred:
- images: JPEG, PNG, TIFF
- documents: PDF-A (preferred), ODF, OOXML
- media containers: MP4, MPG, OGG, Matroska; the contents of media containers (audio/video codec) needs to comply to the following standards:
- Video: no compression, MPEG-1, MPEG-4 Part 2, AVC, H.264, H.265
- Audio: no compression, MPEG Layer III (MP3), MPEG-4 Part 3, AAC
- NetCDF, preferably using “Climate and Forecast Metadata Conventions” - in all other cases detailed documentation is required
This list is not complete, PANGAEA may add more formats to this list. If any of those formats get deprecated or replaced by later standards, PANGAEA will do its best to transform these to modern replacements, but still keep the original data available.
Raw data is not accepted by PANGAEA for preservation, unless a variant is provided in some higher processing level to complement the raw data. Raw data is clearly flagged using a low “processing level” (0 or 1, see Processing levels). No guarantees are given for long-term usability of those datasets.
In coordination with scientific communities, PANGAEA has developed documentation on how to harmonize metadata and data for archival, e.g. for CTD data, Thermosalinograph TSG Underway Data, Bathymetry. Those documents also contain information on how long-term preservation is handled (if applicable).
To ensure physical access to archived data, the computer center of the AWI takes care of the proper function of hardware and software systems including backup of data and migration of data from outdated media, as stated in the AWI/MARUM/University Bremen (AMAR) cooperation contract. AWI implemented technical-organizational measures (TOM):
- Fire and smoke detection systems and fire extinguisher
- Server room monitoring of temperature and humidity
- Server room air-conditioning
- UPS system and emergency diesel generators
- RAID system / hard disk mirroring in virtualization environment
- Storage of backup media in a physically separated secure locations
- Backup concept and existence of an emergency plan
- Backup monitoring and reporting, regular checksum validation
- For documentation of all systems Mediawiki / Confluence is used
- User permission management
- eMail checking with anti-virus software
- Network firewall
- Intrusion Detection Systems
- A ticket system is used to track incidences
A transfer of custody can be managed by reducing PANGAEA to a file based repository. In this case a file based copy of all data sets including possible binary object files would be created and made available either by the AWI and/or the University Bremen. In any case the host institutions guarantee that the data and metadata are available for at least 10 more years after the formal decommissioning of PANGAEA.
Due to legal reasons (e.g., copyright law / art. 17 GDPR), there may be a request by copyright owners, data subjects, or authorities to delete published datasets and their contents permanently. In this case, a tombstone page linked to the DOI name of the dataset will be created, informing potential users about the deletion.