Abstract | Simulation studies of molecules primarily produce data that represent the config-
uration of the system as a function of the progress variable, usually time. Because
of the high-dimensional nature of these data, which grows very quickly, compromises
are often necessary and achieved by storing only a subset of the system’s components,
for example stripping solvent, and by restricting the time resolution to a scale signif-
icantly coarser than the basic timestep of the simulation. The resultant trajectories
thus describe the essentially stochastic evolution of molecules of interest. Maintaining
their interpretability through metadata is of interest not only because they can aid re-
searchers interested in specific systems, but also for reproducibility studies and model
refinement. Here, we introduce a standard for the storage of data created by molecular
simulations that improves compliance with the FAIR (for Findable, Accessible, Inter-
operable, Reusable) principles. We describe a solution conceived in PostgreSQL, along
with reference implementations, that provides stringent links between metadata and
raw data, which is a major weakness of the established file formats used for storing
these data. A possible structure for the logic of SQL queries is included along with
salient performance testing. To close, we suggest that a PostgreSQL-based storage of
simulation data, in particular when coupled to a visual user interface, can improve the
FAIR compliance of molecular simulation data at all levels of visibility, and a prototype
solution for accomplishing this is presented.
|