Data store#

This interface lets you transparently store and load objects outside the data base.

The DataStore class#

The DataStore class is a dictionary whose values can also be accessed as object attributes.

Upon serialization, its contents are serialized to the filesystem (or somewhere else) and only an url reference is retained as its value. The object is deserialized by loading the contents referenced in the url.

There are no limits to how many objects you can store within a DataStore object and you can add as many DataStore objects as you want to run.fields.

Info

By default, the data store relies on the filesystem. The base directory is defined by option datastore.url. The files are organized in subdirectories, named as experiment IDs.

Important

If an experiment is deleted, its associated datastore folder is also removed. If an experiment is persisted, its associated datastore folder is wiped before using it.

Can I use third-party storage services?

Yes! The datastore interface is designed to be flexible. If you are interested in using more storage options, please request it on the issue tracker or open a discussion.

In the following, we demonstrate how to add two DataStore objects. Each DataStore object is serialized separately. Depending on the use case, you might want to increase storage efficiency by adding more values to a single DataStore, resulting in I/O with a single reference.

DataStore example

import glob

import numpy as np
import pandas as pd

from mltraq import DataStore, create_experiment
from mltraq.utils.fs import tmpdir_ctx

with tmpdir_ctx():
    # Create a new experiment and execute a run
    experiment = create_experiment()
    with experiment.run() as run:
        # DataStore with two values
        run.fields.ds = DataStore()
        run.fields.ds.a = np.zeros(10)
        run.fields.ds.b = pd.Series([1, 2, 3])

        # DataStore with a single value
        run.fields.ds2 = DataStore()
        run.fields.ds2.c = 123

    # Persist experiment
    experiment.persist()

    print(f"ID experiment: {experiment.id_experiment}")
    print("--\n")

    # List files in datastore directory
    print("Contents of datastore directory:\n")
    for pathname in glob.glob("*/**", recursive=True):
        print(pathname)
    print("--\n")

    # Reload experiment from database
    experiment.reload()

    # Show stored values
    ds = experiment.runs.first().fields.ds
    ds2 = experiment.runs.first().fields.ds2
    print("a", type(ds.a), ds.a)
    print("b", type(ds.b), ds.b.values)
    print("c", type(ds2.c), ds2.c)

Output

ID experiment: d65df69e-1175-44a5-be2f-2232765703b8
--

Contents of datastore directory:

mltraq.datastore/
mltraq.datastore/d65df69e-1175-44a5-be2f-2232765703b8
mltraq.datastore/d65df69e-1175-44a5-be2f-2232765703b8/d65df69e117544a5be2f2232765703ba
mltraq.datastore/d65df69e-1175-44a5-be2f-2232765703b8/d65df69e117544a5be2f2232765703bb
--

a <class 'numpy.ndarray'> [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
b <class 'pandas.core.series.Series'> [1 2 3]
c <class 'int'> 123

The DataStoreIO class#

In some cases, you might need a more fine-grained control on how, when and where the objects are serialized and written. The class DataStoreIO comes to the rescue.

In the following example, we show how to store directly two values. Value a is a NumPy array and requires serialization. Value b is a bytes object and can be written directly skipping serialization (faster). The values are persisted regardless of the experiment, which is not persisted.

Warning

The value of relative_path_prefix must be different than the experiment ID. Why? Whenever the experiment is persisted with experiment.persist(), the contents of the experiment ID subdirectory are wiped, consistently with the semantics of the class DataStore. With DataStoreIO, you are in charge of managing the files, including their deletion.

DataStoreIO example

import glob

import numpy as np

from mltraq import DataStoreIO, create_experiment
from mltraq.utils.fs import tmpdir_ctx

with tmpdir_ctx():
    # Create a new experiment and execute a run
    experiment = create_experiment()
    with experiment.run() as run:
        a = np.zeros(10)
        b = b"11111"
        run.fields.url_a = DataStoreIO.serialize_write(
            a, relative_path_prefix=f"self_managed/{run.id_experiment}"
        ).url
        run.fields.url_b = DataStoreIO.write(
            b, relative_path_prefix=f"self_managed/{run.id_experiment}"
        ).url

    print(f"ID experiment: {experiment.id_experiment}")
    print("--\n")

    print("Contents of datastore directory:")
    # List files in datastore directory
    for pathname in glob.glob("*/**", recursive=True):
        print(pathname)
    print("--\n")

    # Show urls and loaded values
    fields = experiment.runs.first().fields
    print("Field values:")
    print("url_a", fields.url_a)
    print("url_b", fields.url_b)
    print("--\n")

    print("Loaded values:")

    # `a` is read and deserialized
    a = DataStoreIO(fields.url_a).read_deserialize()
    print("a", type(a), a)

    # `b` is loaded directly, no serialization/deserialization
    b = DataStoreIO(fields.url_b).read()
    print("b", type(b), b)

    # Delete directory containing the stored objects
    DataStoreIO.delete("self_managed")

Output

ID experiment: d65df69e-1175-44a5-be2f-2232765703b8
--

Contents of datastore directory:
mltraq.datastore/
mltraq.datastore/self_managed
mltraq.datastore/self_managed/d65df69e-1175-44a5-be2f-2232765703b8
mltraq.datastore/self_managed/d65df69e-1175-44a5-be2f-2232765703b8/d65df69e117544a5be2f2232765703ba
mltraq.datastore/self_managed/d65df69e-1175-44a5-be2f-2232765703b8/d65df69e117544a5be2f2232765703bb
--

Field values:
url_a file:///self_managed/d65df69e-1175-44a5-be2f-2232765703b8/d65df69e117544a5be2f2232765703ba
url_b file:///self_managed/d65df69e-1175-44a5-be2f-2232765703b8/d65df69e117544a5be2f2232765703bb
--

Loaded values:
a <class 'numpy.ndarray'> [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
b <class 'bytes'> b'11111'