State storage#

The state of experiments is persisted to database using the method Experiment.persist(...) in two tables, "experiments" and "experiment_{name}" where name is the sanitized name of the experiment. In this section, we cover the database schema, the compression and the serialization methods.

Tip

The default table names can be changed via options.

List of supported types#

There are four classes of Python object types that can be persisted to database:

NATIVE_DATABASE_TYPES: The values are stored as columns with native database types
BASIC_TYPES: Serialized with DATAPAK format, stored in LargeBinary column
CONTAINER_TYPES: Serialized with DATAPAK format, stored in LargeBinary column
COMPLEX_TYPES: Serialized with DATAPAK format, stored in LargeBinary column

Supported types for database storage

from pprint import pprint

from mltraq.storage.serialization import NATIVE_DATABASE_TYPES
from mltraq.storage.serializers.datapak import (
    BASIC_TYPES,
    COMPLEX_TYPES,
    CONTAINER_TYPES,
)

print("NATIVE_DATABASE_TYPES")
pprint(NATIVE_DATABASE_TYPES, indent=2, width=70)
print("--\n")

print("BASIC_TYPES")
pprint(BASIC_TYPES, indent=2, width=70)
print("--\n")

print("CONTAINER_TYPES")
pprint(CONTAINER_TYPES, indent=2, width=70)
print("--\n")

print("COMPLEX_TYPES")
pprint(COMPLEX_TYPES, indent=2, width=70)

Output

NATIVE_DATABASE_TYPES
[ <class 'bool'>,
  <class 'int'>,
  <class 'numpy.int32'>,
  <class 'numpy.int64'>,
  <class 'float'>,
  <class 'numpy.float32'>,
  <class 'numpy.float64'>,
  <class 'str'>,
  <class 'datetime.time'>,
  <class 'datetime.datetime'>,
  <class 'datetime.date'>,
  <class 'uuid.UUID'>,
  <class 'bytes'>]
--

BASIC_TYPES
[ <class 'bool'>,
  <class 'int'>,
  <class 'float'>,
  <class 'str'>,
  <class 'bytes'>,
  <class 'NoneType'>]
--

CONTAINER_TYPES
[<class 'tuple'>, <class 'list'>, <class 'set'>, <class 'dict'>]
--

COMPLEX_TYPES
[ <class 'mltraq.utils.bunch.Bunch'>,
  <class 'mltraq.utils.sequence.Sequence'>,
  <class 'mltraq.storage.datastore.DataStore'>,
  <class 'mltraq.storage.archivestore.ArchiveStore'>,
  <class 'mltraq.storage.archivestore.Archive'>,
  <class 'pandas.core.frame.DataFrame'>,
  <class 'pandas.core.series.Series'>,
  <class 'pyarrow.lib.Table'>,
  <class 'numpy.ndarray'>,
  <class 'numpy.datetime64'>,
  <class 'uuid.UUID'>]

Tip

Types in NATIVE_DATABASE_TYPES and BASIC_TYPES overlap. E.g., int. Whenever possible, native database types are used.
For fields with type in NATIVE_DATABASE_TYPES, there is no serialization, native SQL column types are used instead to maximize accessibility and enable SQL interoperability. The translation is handled using some of the generic SQLAlchemy types.
Safe objects like numpy.int32 or datetime.date cannot be part of BASIC_TYPES as they depend on the REDUCE Pickle opcode, which is generally dangerous and unsafe.
The class mltraq.utils.bunch.Bunch is an ordered dictionary that mimics howsklearn.utils.Bunch works: It extends dictionaries by enabling values to be accessed by key, bunch["value_key"], or by an attribute, bunch.value_key.
The class mltraq.storage.datastore.DataStore extends Bunch and its values are stored separately, as defined by the datastore strategy. At the moment, the only datastore option is the filesystem. The datastore is recommended to store large objects to limit the size of the database.
The class mltraq.utils.sequence.Sequence models a multidimensional time series, with append and access as a Pandas dataframe.

Persisting complex objects#

In the next example, we persist and reload an experiment with a NumPy array:

Persistence for NumPy arrays

import numpy as np

from mltraq import create_experiment

experiment = create_experiment("example")

with experiment.run() as run:
    run.fields.result = np.array([0.1, 0.2, 0.3])

experiment = experiment.persist().reload()
value = experiment.runs.first().fields.result
print("Type", type(value))
print("Value", value)

Output

Type <class 'numpy.ndarray'>
Value [0.1 0.2 0.3]

The DATAPAK format#

The procedure to serialize and deserialize the types BASIC_TYPES, CONTAINER_TYPES and COMPLEX_TYPES is named DATAPAK and specifies how existing open formats are used together.

Specification#

Serialization#

Python types listed in BASIC_TYPES and CONTAINER_TYPES are serialized with the Python pickle library, allowing only a subset of safe Pickle opcodes.
Python types listed in COMPLEX_TYPES are encoded as regular Python dict objects with one element: the key (type: str) specified the type of the encoded object and the value (type: bytes) represents the encoded object.

An encoded complex type uses CONTAINER_TYPES and BASIC_TYPES, and it can be serialized with Pickle. The COMPLEX_TYPES types can be nested inside CONTAINER_TYPES types.

The encoding of complex objects relies on open formats:
- Arrow IPC format for Pandas and Arrow tables
- NumPy NEP format for NumPy arrays

If requested, the resulting binary blob is compressed (see separate section on this).

Deserialization#

The deserialization applies the inverse procedure of the serialization:

Given a binary blob A, we decompress it, obtaining B. If the blob was not compressed, A == B.
B is by definition a pickled Python object we can safely unpickle, obtaining C.
1. If the type of C is in BASIC_TYPES, the deserialization is complete and we return C.
2. If C is a dict that contains the DATAPAK magic key, we decode it, obtaining and returning D.
3. Otherwise, if C is any of the CONTAINER_TYPES types, we decode it recursively, obtaining and returning D.
4. If an unknown type is encountered, an exception is raised.

Compression#

Compression is optional and its behaviour is controlled via options. By default, the compression is disabled and "zlib" can be specified. If enabled, the serialized object of type bytes is prefixed by a magic string that specifies the compression algorithm:

Supported compression codecs

from pprint import pprint

from mltraq.storage.serializers.serializer import (
    MAGIC_COMPRESSION_PREFIX_MAP,
)

pprint(MAGIC_COMPRESSION_PREFIX_MAP)

Output

{b'C00': <CompressionCodec.uncompressed: 1>, b'C01': <CompressionCodec.zlib: 2>}

The decompression is transparent. If any matching magic prefix is found, the decompression is attempted, returning the input if it fails.

Example#

In this example, we demonstrate how to manually deserialize an experiment field queried from database and containing a NumPy array.

Decompression: The first two bytes contain b'C1' (zlib compression)
Depickling: Complex objects are represented as dictionaries with one key/value pair that describe their encoded contents.
Safe loading of NumPy arrays, without trusting potentially harmful pickled objects.

Example of DATAPAK manual deserialization

import pickle
import zlib
from io import BytesIO
from pprint import pprint

import numpy as np

from mltraq import create_experiment, options

experiment = create_experiment("example")

with experiment.run() as run:
    run.fields.result = np.linspace(0, 100, num=20)

# Set explicitly the compression codec used.
with options().ctx({"serialization.compression.codec": "zlib"}):
    experiment.persist()

serialized = experiment.db.query("SELECT result from experiment_example")[
    "result"
].iloc[0]

print("Serialized:")
pprint(serialized, width=70)
print("\n--")

# Identify compression, if any
print("Compression codec magic prefix:")
pprint(serialized[:2], width=70)
print("\n--")

decompressed = zlib.decompress(serialized[3:])

print("Decompressed:")
pprint(decompressed, width=70)
print("\n--")

depickled = pickle.loads(decompressed)  # noqa: S301

print("Depickled:")
pprint(depickled, width=70)
print("\n--")

memfile = BytesIO()
memfile.write(depickled["value"])
memfile.seek(0)
array = np.load(memfile, allow_pickle=False)

print("NumPy array:")
pprint(array, width=70)
print("\n--")

Output

Serialized:
(b'C01x\x9ck`\x9d\x1a\xc8\xc8\x00\x06\xb5S4z8]\x1cC\x1c\x03\x1c'
 b'\xbdu\r\xa6\xf4\xf0\xe7\x95\xe6\x16T\xea\xe5\xa5$\x16\x15%V\x82'
 b'DX\xcb\x12sJS\xa78)\x00uL\xf6\x0b\xf5\r\x88dd(c\xa8VOI-N.R\xb7R'
 b'P\xb7I\xb3P\xd7QPO\xcb/*)J\xcc\x8b\xcf/JI\x05\x89\xbb%'
 b'\xe6\x14\xa7\x02\xc5\x8b3\x12\x0bR\x81|\r#\x03\x1dM\x1d\x85Z'
 b'\x05\xf2\x01\x17\x03\x14\xdc\x08\x88s\xae\xe4\x15u\x80\xd0'
 b'\xaa\x0e.\x95\xbcOM\xa7\xe8C\xf9\xa6\x0e|@\xde\xf5\x00+\xa8'
 b'\xb8\xbd\xc3\xde\xb6OR\xa7\xb2\x9d\xa0\xf2\xae\x0e\x9f\x81'
 b'<\x8d\xf5\xeePu^\x0e\x9a\xeb\x17\xeei\xfb\xe4\x03U\xef\xef\xb0'
 b"\x01\xc4\x95\n\x84\xea\x0br\x00\xa9^\xb8'\x18\xaa?\xd4\xe1\x19H"
 b'[\\\x18\xd4\x9cp\x07\x88\xab"\x1d\xa6\x94\xea\x01\x00\xa6\x0ek\x05')

--
Compression codec magic prefix:
b'C0'

--
Decompressed:
(b'\x80\x05\x95Q\x01\x00\x00\x00\x00\x00\x00}\x94(\x8c\tDATAPAK-'
 b'0\x94\x8c\x0fnumpy.ndarray-0\x94\x8c\x05value\x94B \x01\x00'
 b"\x00\x93NUMPY\x01\x00v\x00{'descr': '<f8', 'fortran_order': Fal"
 b"se, 'shape': (20,), }                                           "
 b'                \n\x00\x00\x00\x00\x00\x00\x00\x00\xd8P^Cy\r\x15'
 b'@\xd8P^Cy\r%@Dy\r\xe55\x94/@\xd8P^Cy\r5@\x0e\xe55\x94\xd7P:@Dy\r'
 b'\xe55\x94?@\xbd\x86\xf2\x1a\xcakB@\xd8P^Cy\rE@\xf3\x1a\xcak(\xafG'
 b'@\x0e\xe55\x94\xd7PJ@)\xaf\xa1\xbc\x86\xf2L@Dy\r\xe55\x94O'
 b'@\xb0\xa1\xbc\x86\xf2\x1aQ@\xbd\x86\xf2\x1a\xcakR@\xcak('
 b'\xaf\xa1\xbcS@\xd8P^Cy\rU@\xe65\x94\xd7P^V@\xf3\x1a\xcak(\xafW'
 b'@\x00\x00\x00\x00\x00\x00Y@\x94u.')

--
Depickled:
{'DATAPAK-0': 'numpy.ndarray-0',
 'value': b"\x93NUMPY\x01\x00v\x00{'descr': '<f8', 'fortran_order': "
          b"False, 'shape': (20,), }                                "
          b'                           \n\x00\x00\x00\x00'
          b'\x00\x00\x00\x00\xd8P^Cy\r\x15@\xd8P^Cy\r%@Dy\r\xe5'
          b'5\x94/@\xd8P^Cy\r5@\x0e\xe55\x94\xd7P:@Dy\r\xe55\x94?@'
          b'\xbd\x86\xf2\x1a\xcakB@\xd8P^Cy\rE@\xf3\x1a\xcak(\xafG@'
          b'\x0e\xe55\x94\xd7PJ@)\xaf\xa1\xbc\x86\xf2L@Dy\r\xe5'
          b'5\x94O@\xb0\xa1\xbc\x86\xf2\x1aQ@\xbd\x86\xf2\x1a\xcakR@'
          b'\xcak(\xaf\xa1\xbcS@\xd8P^Cy\rU@\xe65\x94\xd7P^V@'
          b'\xf3\x1a\xcak(\xafW@\x00\x00\x00\x00\x00\x00Y@'}

--
NumPy array:
array([  0.        ,   5.26315789,  10.52631579,  15.78947368,
        21.05263158,  26.31578947,  31.57894737,  36.84210526,
        42.10526316,  47.36842105,  52.63157895,  57.89473684,
        63.15789474,  68.42105263,  73.68421053,  78.94736842,
        84.21052632,  89.47368421,  94.73684211, 100.        ])

--

Handling of unsupported types#

If we need a type not currently supported, we can always encode/decode to field of type bytes, which is supported. If we try to serialize unsupported types, an exception is raised:

Handling of unsupported types

from pprint import pprint

from mltraq import create_experiment
from mltraq.storage.serializers.datapak import UnsupportedObjectType


class SomethingNotSupported:
    pass


experiment = create_experiment("example")


with experiment.run() as run:
    run.fields.failing = SomethingNotSupported()

try:
    experiment.persist()
except UnsupportedObjectType as e:
    pprint(str(e), width=70)

Output

('DataPakSerializer does not support type <class '
 "'__main__.local_python_func.<locals>.SomethingNotSupported'>")

Storing large artifacts#

The Datastore interface is designed to facilitate the storage and reloading of large objects such as datasets, weights and models. See its separate article for a comprehensive discussion.

Unsafe pickling#

It is possible, but not advised, to pickle/unpickle complete Experiment objects.

Safe with limitations

This procedure is safe if experiments are produced, stored and reloaded locally. It is generally much faster than trying to store the state in a controlled way. However, changes in the environment (architecture/package versions) might result in broken reloads and data loss, making it unsuitable for long-term storage.

In the following, we demonstrate how an unsafe object can be stored, pickled and unpickled in the run.state dictionary. The run.steps dictionary is always safeguarded form unsafe objects, and cannot be used.

Danger

Upon loading the experiment from database, the method SomethingUnsafe.__setstate__ is evaluated as part of the unpickling procedure, with potential harmful instructions.

Unsafe unpickling

from mltraq import create_session


class SomethingUnsafe:
    def __getstate__(self):
        return {}

    def __setstate__(self, state):
        print("__setstate__")


# Create session and experiment
session = create_session()
experiment = session.create_experiment("example")

with experiment.run() as run:
    # add field value with unsafe type
    run.state.unsafe = SomethingUnsafe()

# Persist experiment, pickling the Experiment object
experiment.persist(store_unsafe_pickle=True)

print("Reloading the pickled experiment from database")
experiment = session.load_experiment("example", unsafe_pickle=True)

Output

Reloading the pickled experiment from database
__setstate__

Info

The serialization format and more details on the persistence logic are presented in more detail at State storage.

Database schema#

Experiments are persisted on two tables:

Table `"experiments"`#

id_experiment: UUID of the experiment
name: Name of the experiment, by default a 6-alphanum hash of id_experiment
meta: Serialized dictionary with properties on experiment and runs columns
fields: Experiment fields as set by the user
unsafe_pickle: Pickled Experiment object, disabled by default

Tables `"experiment_xyz"`#

Each experiment is associated to a dedicated table, ending with its sanitized name. E.g., "experiment_xyz" if name=="xyz". There are only two fixed columns:

id_experiment: UUID of the experiment
id_run: UUID of the run

Additional columns are named as the keys in the run.fields dictionary present in all runs of the experiment. Each row represents a run of the experiment.

Columns either use the native database SQL type, or DATAPAK.

Tip

In case of experiments persisted with experiment.persist(store_unsafe_pickle=True) and loaded with experiment.load(unsafe_pickle=True), the experiment is also persisted as a binary blob which is unpickled upon loading (including its runs). Having the pickled blob does not limit/interfere with the regular storage semantics: the fields column in the experiments table, as well as the individual experiment tables, continue to operate as expected, and does not depend on the pickled object. This guarantees an extra level of interoperability and accessibility for the fields dictionaries.

State storage#

List of supported types#

Persisting complex objects#

The DATAPAK format#

Specification#

Serialization#

Deserialization#

Compression#

Example#

Handling of unsupported types#

Storing large artifacts#

Unsafe pickling#

Database schema#

Table "experiments"#

Tables "experiment_xyz"#

Table `"experiments"`#

Tables `"experiment_xyz"`#