State storage#
The state of experiments is persisted to database using the method Experiment.persist(...)
in two tables, "experiments"
and "experiment_{name}"
where name
is the sanitized name of the experiment. In this section, we cover the database schema, the compression and the serialization methods.
Tip
The default table names can be changed via options.
List of supported types#
There are four classes of Python object types that can be persisted to database:
NATIVE_DATABASE_TYPES
: The values are stored as columns with native database typesBASIC_TYPES
: Serialized with DATAPAK format, stored inLargeBinary
columnCONTAINER_TYPES
: Serialized with DATAPAK format, stored inLargeBinary
columnCOMPLEX_TYPES
: Serialized with DATAPAK format, stored inLargeBinary
column
Supported types for database storage
NATIVE_DATABASE_TYPES
[ <class 'bool'>,
<class 'int'>,
<class 'numpy.int32'>,
<class 'numpy.int64'>,
<class 'float'>,
<class 'numpy.float32'>,
<class 'numpy.float64'>,
<class 'str'>,
<class 'datetime.time'>,
<class 'datetime.datetime'>,
<class 'datetime.date'>,
<class 'uuid.UUID'>,
<class 'bytes'>]
--
BASIC_TYPES
[ <class 'bool'>,
<class 'int'>,
<class 'float'>,
<class 'str'>,
<class 'bytes'>,
<class 'NoneType'>]
--
CONTAINER_TYPES
[<class 'tuple'>, <class 'list'>, <class 'set'>, <class 'dict'>]
--
COMPLEX_TYPES
[ <class 'mltraq.utils.bunch.Bunch'>,
<class 'mltraq.utils.sequence.Sequence'>,
<class 'mltraq.storage.datastore.DataStore'>,
<class 'mltraq.storage.archivestore.ArchiveStore'>,
<class 'mltraq.storage.archivestore.Archive'>,
<class 'pandas.core.frame.DataFrame'>,
<class 'pandas.core.series.Series'>,
<class 'pyarrow.lib.Table'>,
<class 'numpy.ndarray'>,
<class 'numpy.datetime64'>,
<class 'uuid.UUID'>]
Tip
- Types in
NATIVE_DATABASE_TYPES
andBASIC_TYPES
overlap. E.g.,int
. Whenever possible, native database types are used. - For fields with type in
NATIVE_DATABASE_TYPES
, there is no serialization, native SQL column types are used instead to maximize accessibility and enable SQL interoperability. The translation is handled using some of the generic SQLAlchemy types. - Safe objects like
numpy.int32
ordatetime.date
cannot be part ofBASIC_TYPES
as they depend on theREDUCE
Pickle opcode, which is generally dangerous and unsafe. - The class
mltraq.utils.bunch.Bunch
is an ordered dictionary that mimics howsklearn.utils.Bunch
works: It extends dictionaries by enabling values to be accessed by key,bunch["value_key"]
, or by an attribute,bunch.value_key
. - The class
mltraq.storage.datastore.DataStore
extendsBunch
and its values are stored separately, as defined by the datastore strategy. At the moment, the only datastore option is the filesystem. The datastore is recommended to store large objects to limit the size of the database. - The class
mltraq.utils.sequence.Sequence
models a multidimensional time series, withappend
and access as a Pandas dataframe.
Persisting complex objects#
In the next example, we persist and reload an experiment with a NumPy array:
Persistence for NumPy arrays
The DATAPAK format#
The procedure to serialize and deserialize the types BASIC_TYPES
, CONTAINER_TYPES
and COMPLEX_TYPES
is named DATAPAK and specifies how existing open formats are used together.
Specification#
Serialization#
-
Python types listed in
BASIC_TYPES
andCONTAINER_TYPES
are serialized with the Python pickle library, allowing only a subset of safe Pickle opcodes. -
Python types listed in
COMPLEX_TYPES
are encoded as regular Pythondict
objects with one element: the key (type:str
) specified the type of the encoded object and the value (type:bytes
) represents the encoded object.An encoded complex type uses
CONTAINER_TYPES
andBASIC_TYPES
, and it can be serialized with Pickle. TheCOMPLEX_TYPES
types can be nested insideCONTAINER_TYPES
types.The encoding of complex objects relies on open formats:
- Arrow IPC format for Pandas and Arrow tables
- NumPy NEP format for NumPy arrays
If requested, the resulting binary blob is compressed (see separate section on this).
Deserialization#
The deserialization applies the inverse procedure of the serialization:
- Given a binary blob
A
, we decompress it, obtainingB
. If the blob was not compressed,A == B
. B
is by definition a pickled Python object we can safely unpickle, obtainingC
.- If the type of
C
is inBASIC_TYPES
, the deserialization is complete and we returnC
. - If
C
is adict
that contains the DATAPAK magic key, we decode it, obtaining and returningD
. - Otherwise, if
C
is any of theCONTAINER_TYPES
types, we decode it recursively, obtaining and returningD
. - If an unknown type is encountered, an exception is raised.
- If the type of
Compression#
Compression is optional and its behaviour is controlled via options.
By default, the compression is disabled and "zlib
" can be specified.
If enabled, the serialized object of type bytes
is prefixed by a magic string that specifies the compression algorithm:
Supported compression codecs
The decompression is transparent. If any matching magic prefix is found, the decompression is attempted, returning the input if it fails.
Example#
In this example, we demonstrate how to manually deserialize an experiment field queried from database and containing a NumPy array.
- Decompression: The first two bytes contain
b'C1'
(zlib compression) - Depickling: Complex objects are represented as dictionaries with one key/value pair that describe their encoded contents.
- Safe loading of NumPy arrays, without trusting potentially harmful pickled objects.
Example of DATAPAK manual deserialization
Serialized:
(b'C01x\x9ck`\x9d\x1a\xc8\xc8\x00\x06\xb5S4z8]\x1cC\x1c\x03\x1c'
b'\xbdu\r\xa6\xf4\xf0\xe7\x95\xe6\x16T\xea\xe5\xa5$\x16\x15%V\x82'
b'DX\xcb\x12sJS\xa78)\x00uL\xf6\x0b\xf5\r\x88dd(c\xa8VOI-N.R\xb7R'
b'P\xb7I\xb3P\xd7QPO\xcb/*)J\xcc\x8b\xcf/JI\x05\x89\xbb%'
b'\xe6\x14\xa7\x02\xc5\x8b3\x12\x0bR\x81|\r#\x03\x1dM\x1d\x85Z'
b'\x05\xf2\x01\x17\x03\x14\xdc\x08\x88s\xae\xe4\x15u\x80\xd0'
b'\xaa\x0e.\x95\xbcOM\xa7\xe8C\xf9\xa6\x0e|@\xde\xf5\x00+\xa8'
b'\xb8\xbd\xc3\xde\xb6OR\xa7\xb2\x9d\xa0\xf2\xae\x0e\x9f\x81'
b'<\x8d\xf5\xeePu^\x0e\x9a\xeb\x17\xeei\xfb\xe4\x03U\xef\xef\xb0'
b"\x01\xc4\x95\n\x84\xea\x0br\x00\xa9^\xb8'\x18\xaa?\xd4\xe1\x19H"
b'[\\\x18\xd4\x9cp\x07\x88\xab"\x1d\xa6\x94\xea\x01\x00\xa6\x0ek\x05')
--
Compression codec magic prefix:
b'C0'
--
Decompressed:
(b'\x80\x05\x95Q\x01\x00\x00\x00\x00\x00\x00}\x94(\x8c\tDATAPAK-'
b'0\x94\x8c\x0fnumpy.ndarray-0\x94\x8c\x05value\x94B \x01\x00'
b"\x00\x93NUMPY\x01\x00v\x00{'descr': '<f8', 'fortran_order': Fal"
b"se, 'shape': (20,), } "
b' \n\x00\x00\x00\x00\x00\x00\x00\x00\xd8P^Cy\r\x15'
b'@\xd8P^Cy\r%@Dy\r\xe55\x94/@\xd8P^Cy\r5@\x0e\xe55\x94\xd7P:@Dy\r'
b'\xe55\x94?@\xbd\x86\xf2\x1a\xcakB@\xd8P^Cy\rE@\xf3\x1a\xcak(\xafG'
b'@\x0e\xe55\x94\xd7PJ@)\xaf\xa1\xbc\x86\xf2L@Dy\r\xe55\x94O'
b'@\xb0\xa1\xbc\x86\xf2\x1aQ@\xbd\x86\xf2\x1a\xcakR@\xcak('
b'\xaf\xa1\xbcS@\xd8P^Cy\rU@\xe65\x94\xd7P^V@\xf3\x1a\xcak(\xafW'
b'@\x00\x00\x00\x00\x00\x00Y@\x94u.')
--
Depickled:
{'DATAPAK-0': 'numpy.ndarray-0',
'value': b"\x93NUMPY\x01\x00v\x00{'descr': '<f8', 'fortran_order': "
b"False, 'shape': (20,), } "
b' \n\x00\x00\x00\x00'
b'\x00\x00\x00\x00\xd8P^Cy\r\x15@\xd8P^Cy\r%@Dy\r\xe5'
b'5\x94/@\xd8P^Cy\r5@\x0e\xe55\x94\xd7P:@Dy\r\xe55\x94?@'
b'\xbd\x86\xf2\x1a\xcakB@\xd8P^Cy\rE@\xf3\x1a\xcak(\xafG@'
b'\x0e\xe55\x94\xd7PJ@)\xaf\xa1\xbc\x86\xf2L@Dy\r\xe5'
b'5\x94O@\xb0\xa1\xbc\x86\xf2\x1aQ@\xbd\x86\xf2\x1a\xcakR@'
b'\xcak(\xaf\xa1\xbcS@\xd8P^Cy\rU@\xe65\x94\xd7P^V@'
b'\xf3\x1a\xcak(\xafW@\x00\x00\x00\x00\x00\x00Y@'}
--
NumPy array:
array([ 0. , 5.26315789, 10.52631579, 15.78947368,
21.05263158, 26.31578947, 31.57894737, 36.84210526,
42.10526316, 47.36842105, 52.63157895, 57.89473684,
63.15789474, 68.42105263, 73.68421053, 78.94736842,
84.21052632, 89.47368421, 94.73684211, 100. ])
--
Handling of unsupported types#
If we need a type not currently supported, we can always encode/decode to field of type bytes
,
which is supported. If we try to serialize unsupported types, an exception is raised:
Handling of unsupported types
Storing large artifacts#
The Datastore interface is designed to facilitate the storage and reloading of large objects such as datasets, weights and models. See its separate article for a comprehensive discussion.
Unsafe pickling#
It is possible, but not advised, to pickle/unpickle complete Experiment objects.
Safe with limitations
This procedure is safe if experiments are produced, stored and reloaded locally. It is generally much faster than trying to store the state in a controlled way. However, changes in the environment (architecture/package versions) might result in broken reloads and data loss, making it unsuitable for long-term storage.
In the following, we demonstrate how an unsafe object can be stored, pickled and unpickled in the run.state
dictionary. The run.steps
dictionary is always safeguarded form unsafe objects, and cannot be used.
Danger
Upon loading the experiment from database, the method SomethingUnsafe.__setstate__
is evaluated as part of the unpickling procedure, with potential harmful instructions.
Unsafe unpickling
Info
The serialization format and more details on the persistence logic are presented in more detail at State storage.
Database schema#
Experiments are persisted on two tables:
Table "experiments"
#
id_experiment
: UUID of the experimentname
: Name of the experiment, by default a 6-alphanum hash ofid_experiment
meta
: Serialized dictionary with properties on experiment andruns
columnsfields
: Experimentfields
as set by the userunsafe_pickle
: PickledExperiment
object, disabled by default
Tables "experiment_xyz"
#
Each experiment is associated to a dedicated table, ending with its sanitized name. E.g., "experiment_xyz"
if name=="xyz"
. There are only two fixed columns:
id_experiment
: UUID of the experimentid_run
: UUID of the run
Additional columns are named as the keys in the run.fields
dictionary present in all runs of the experiment.
Each row represents a run
of the experiment.
Columns either use the native database SQL type, or DATAPAK.
Tip
In case of experiments persisted with experiment.persist(store_unsafe_pickle=True)
and loaded with experiment.load(unsafe_pickle=True)
, the experiment is also persisted as a binary blob which is unpickled upon loading (including its runs). Having the pickled blob does not limit/interfere with the regular storage semantics: the fields
column in the experiments
table, as well as the individual experiment
tables, continue to operate as expected, and does not depend on the pickled object. This guarantees an extra level of interoperability and accessibility for the fields
dictionaries.