Data store#
This interface lets you transparently store and load objects outside the data base.
The DataStore class#
The DataStore
class is a dictionary whose values can also be accessed as object attributes.
Upon serialization, its contents are serialized to the filesystem (or somewhere else) and only an url
reference is retained as its value. The object is deserialized by loading the contents referenced in the url
.
There are no limits to how many objects you can store within a DataStore
object and you can add as many DataStore
objects as you want to run.fields
.
Info
By default, the data store relies on the filesystem. The base directory is defined by option datastore.url
. The files are organized in subdirectories, named as experiment IDs.
Important
If an experiment is deleted, its associated datastore folder is also removed. If an experiment is persisted, its associated datastore folder is wiped before using it.
Can I use third-party storage services?
Yes! The datastore interface is designed to be flexible. If you are interested in using more storage options, please request it on the issue tracker or open a discussion.
In the following, we demonstrate how to add two DataStore
objects. Each DataStore
object is serialized separately. Depending on the use case, you might want to increase storage efficiency by adding more
values to a single DataStore
, resulting in I/O with a single reference.
DataStore example
ID experiment: d65df69e-1175-44a5-be2f-2232765703b8
--
Contents of datastore directory:
mltraq.datastore/
mltraq.datastore/d65df69e-1175-44a5-be2f-2232765703b8
mltraq.datastore/d65df69e-1175-44a5-be2f-2232765703b8/d65df69e117544a5be2f2232765703ba
mltraq.datastore/d65df69e-1175-44a5-be2f-2232765703b8/d65df69e117544a5be2f2232765703bb
--
a <class 'numpy.ndarray'> [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
b <class 'pandas.core.series.Series'> [1 2 3]
c <class 'int'> 123
The DataStoreIO class#
In some cases, you might need a more fine-grained control on how, when and where the objects are serialized and written. The class DataStoreIO
comes to the rescue.
In the following example, we show how to store directly two values. Value a
is a NumPy
array
and requires serialization. Value b
is a bytes
object and can be written directly skipping serialization (faster). The values are persisted regardless of the experiment, which is not persisted.
Warning
The value of relative_path_prefix
must be different than the experiment ID. Why? Whenever the experiment
is persisted with experiment.persist()
, the contents of the experiment ID subdirectory are wiped,
consistently with the semantics of the class DataStore
. With DataStoreIO
, you are in charge of
managing the files, including their deletion.
DataStoreIO example
ID experiment: d65df69e-1175-44a5-be2f-2232765703b8
--
Contents of datastore directory:
mltraq.datastore/
mltraq.datastore/self_managed
mltraq.datastore/self_managed/d65df69e-1175-44a5-be2f-2232765703b8
mltraq.datastore/self_managed/d65df69e-1175-44a5-be2f-2232765703b8/d65df69e117544a5be2f2232765703ba
mltraq.datastore/self_managed/d65df69e-1175-44a5-be2f-2232765703b8/d65df69e117544a5be2f2232765703bb
--
Field values:
url_a file:///self_managed/d65df69e-1175-44a5-be2f-2232765703b8/d65df69e117544a5be2f2232765703ba
url_b file:///self_managed/d65df69e-1175-44a5-be2f-2232765703b8/d65df69e117544a5be2f2232765703bb
--
Loaded values:
a <class 'numpy.ndarray'> [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
b <class 'bytes'> b'11111'