State storage#
The state of experiments is persisted to database using the method Experiment.persist(...) in two tables, "experiments" and "experiment_{name}" where name is the sanitized name of the experiment. In this section, we cover the database schema, the compression and the serialization methods.
Tip
The default table names can be changed via options.
List of supported types#
There are four classes of Python object types that can be persisted to database:
NATIVE_DATABASE_TYPES: The values are stored as columns with native database typesBASIC_TYPES: Serialized with DATAPAK format, stored inLargeBinarycolumnCONTAINER_TYPES: Serialized with DATAPAK format, stored inLargeBinarycolumnCOMPLEX_TYPES: Serialized with DATAPAK format, stored inLargeBinarycolumn
Supported types for database storage
NATIVE_DATABASE_TYPES
[ <class 'bool'>,
<class 'int'>,
<class 'numpy.int32'>,
<class 'numpy.int64'>,
<class 'float'>,
<class 'numpy.float32'>,
<class 'numpy.float64'>,
<class 'str'>,
<class 'datetime.time'>,
<class 'datetime.datetime'>,
<class 'datetime.date'>,
<class 'uuid.UUID'>,
<class 'bytes'>]
--
BASIC_TYPES
[ <class 'bool'>,
<class 'int'>,
<class 'float'>,
<class 'str'>,
<class 'bytes'>,
<class 'NoneType'>]
--
CONTAINER_TYPES
[<class 'tuple'>, <class 'list'>, <class 'set'>, <class 'dict'>]
--
COMPLEX_TYPES
[ <class 'mltraq.utils.bunch.Bunch'>,
<class 'mltraq.utils.sequence.Sequence'>,
<class 'mltraq.storage.datastore.DataStore'>,
<class 'mltraq.storage.archivestore.ArchiveStore'>,
<class 'mltraq.storage.archivestore.Archive'>,
<class 'pandas.core.frame.DataFrame'>,
<class 'pandas.core.series.Series'>,
<class 'pyarrow.lib.Table'>,
<class 'numpy.ndarray'>,
<class 'numpy.datetime64'>,
<class 'uuid.UUID'>]
Tip
- Types in
NATIVE_DATABASE_TYPESandBASIC_TYPESoverlap. E.g.,int. Whenever possible, native database types are used. - For fields with type in
NATIVE_DATABASE_TYPES, there is no serialization, native SQL column types are used instead to maximize accessibility and enable SQL interoperability. The translation is handled using some of the generic SQLAlchemy types. - Safe objects like
numpy.int32ordatetime.datecannot be part ofBASIC_TYPESas they depend on theREDUCEPickle opcode, which is generally dangerous and unsafe. - The class
mltraq.utils.bunch.Bunchis an ordered dictionary that mimics howsklearn.utils.Bunchworks: It extends dictionaries by enabling values to be accessed by key,bunch["value_key"], or by an attribute,bunch.value_key. - The class
mltraq.storage.datastore.DataStoreextendsBunchand its values are stored separately, as defined by the datastore strategy. At the moment, the only datastore option is the filesystem. The datastore is recommended to store large objects to limit the size of the database. - The class
mltraq.utils.sequence.Sequencemodels a multidimensional time series, withappendand access as a Pandas dataframe.
Persisting complex objects#
In the next example, we persist and reload an experiment with a NumPy array:
Persistence for NumPy arrays
The DATAPAK format#
The procedure to serialize and deserialize the types BASIC_TYPES, CONTAINER_TYPES and COMPLEX_TYPES is named DATAPAK and specifies how existing open formats are used together.
Specification#
Serialization#
-
Python types listed in
BASIC_TYPESandCONTAINER_TYPESare serialized with the Python pickle library, allowing only a subset of safe Pickle opcodes. -
Python types listed in
COMPLEX_TYPESare encoded as regular Pythondictobjects with one element: the key (type:str) specified the type of the encoded object and the value (type:bytes) represents the encoded object.An encoded complex type uses
CONTAINER_TYPESandBASIC_TYPES, and it can be serialized with Pickle. TheCOMPLEX_TYPEStypes can be nested insideCONTAINER_TYPEStypes.The encoding of complex objects relies on open formats:
- Arrow IPC format for Pandas and Arrow tables
- NumPy NEP format for NumPy arrays
If requested, the resulting binary blob is compressed (see separate section on this).
Deserialization#
The deserialization applies the inverse procedure of the serialization:
- Given a binary blob
A, we decompress it, obtainingB. If the blob was not compressed,A == B. Bis by definition a pickled Python object we can safely unpickle, obtainingC.- If the type of
Cis inBASIC_TYPES, the deserialization is complete and we returnC. - If
Cis adictthat contains the DATAPAK magic key, we decode it, obtaining and returningD. - Otherwise, if
Cis any of theCONTAINER_TYPEStypes, we decode it recursively, obtaining and returningD. - If an unknown type is encountered, an exception is raised.
- If the type of
Compression#
Compression is optional and its behaviour is controlled via options.
By default, the compression is disabled and "zlib" can be specified.
If enabled, the serialized object of type bytes is prefixed by a magic string that specifies the compression algorithm:
Supported compression codecs
The decompression is transparent. If any matching magic prefix is found, the decompression is attempted, returning the input if it fails.
Example#
In this example, we demonstrate how to manually deserialize an experiment field queried from database and containing a NumPy array.
- Decompression: The first two bytes contain
b'C1'(zlib compression) - Depickling: Complex objects are represented as dictionaries with one key/value pair that describe their encoded contents.
- Safe loading of NumPy arrays, without trusting potentially harmful pickled objects.
Example of DATAPAK manual deserialization
Serialized:
(b'C01x\x9ck`\x9d\x1a\xc8\xc8\x00\x06\xb5S4z8]\x1cC\x1c\x03\x1c'
b'\xbdu\r\xa6\xf4\xf0\xe7\x95\xe6\x16T\xea\xe5\xa5$\x16\x15%V\x82'
b'DX\xcb\x12sJS\xa78)\x00uL\xf6\x0b\xf5\r\x88dd(c\xa8VOI-N.R\xb7R'
b'P\xb7I\xb3P\xd7QPO\xcb/*)J\xcc\x8b\xcf/JI\x05\x89\xbb%'
b'\xe6\x14\xa7\x02\xc5\x8b3\x12\x0bR\x81|\r#\x03\x1dM\x1d\x85Z'
b'\x05\xf2\x01\x17\x03\x14\xdc\x08\x88s\xae\xe4\x15u\x80\xd0'
b'\xaa\x0e.\x95\xbcOM\xa7\xe8C\xf9\xa6\x0e|@\xde\xf5\x00+\xa8'
b'\xb8\xbd\xc3\xde\xb6OR\xa7\xb2\x9d\xa0\xf2\xae\x0e\x9f\x81'
b'<\x8d\xf5\xeePu^\x0e\x9a\xeb\x17\xeei\xfb\xe4\x03U\xef\xef\xb0'
b"\x01\xc4\x95\n\x84\xea\x0br\x00\xa9^\xb8'\x18\xaa?\xd4\xe1\x19H"
b'[\\\x18\xd4\x9cp\x07\x88\xab"\x1d\xa6\x94\xea\x01\x00\xa6\x0ek\x05')
--
Compression codec magic prefix:
b'C0'
--
Decompressed:
(b'\x80\x05\x95Q\x01\x00\x00\x00\x00\x00\x00}\x94(\x8c\tDATAPAK-'
b'0\x94\x8c\x0fnumpy.ndarray-0\x94\x8c\x05value\x94B \x01\x00'
b"\x00\x93NUMPY\x01\x00v\x00{'descr': '<f8', 'fortran_order': Fal"
b"se, 'shape': (20,), } "
b' \n\x00\x00\x00\x00\x00\x00\x00\x00\xd8P^Cy\r\x15'
b'@\xd8P^Cy\r%@Dy\r\xe55\x94/@\xd8P^Cy\r5@\x0e\xe55\x94\xd7P:@Dy\r'
b'\xe55\x94?@\xbd\x86\xf2\x1a\xcakB@\xd8P^Cy\rE@\xf3\x1a\xcak(\xafG'
b'@\x0e\xe55\x94\xd7PJ@)\xaf\xa1\xbc\x86\xf2L@Dy\r\xe55\x94O'
b'@\xb0\xa1\xbc\x86\xf2\x1aQ@\xbd\x86\xf2\x1a\xcakR@\xcak('
b'\xaf\xa1\xbcS@\xd8P^Cy\rU@\xe65\x94\xd7P^V@\xf3\x1a\xcak(\xafW'
b'@\x00\x00\x00\x00\x00\x00Y@\x94u.')
--
Depickled:
{'DATAPAK-0': 'numpy.ndarray-0',
'value': b"\x93NUMPY\x01\x00v\x00{'descr': '<f8', 'fortran_order': "
b"False, 'shape': (20,), } "
b' \n\x00\x00\x00\x00'
b'\x00\x00\x00\x00\xd8P^Cy\r\x15@\xd8P^Cy\r%@Dy\r\xe5'
b'5\x94/@\xd8P^Cy\r5@\x0e\xe55\x94\xd7P:@Dy\r\xe55\x94?@'
b'\xbd\x86\xf2\x1a\xcakB@\xd8P^Cy\rE@\xf3\x1a\xcak(\xafG@'
b'\x0e\xe55\x94\xd7PJ@)\xaf\xa1\xbc\x86\xf2L@Dy\r\xe5'
b'5\x94O@\xb0\xa1\xbc\x86\xf2\x1aQ@\xbd\x86\xf2\x1a\xcakR@'
b'\xcak(\xaf\xa1\xbcS@\xd8P^Cy\rU@\xe65\x94\xd7P^V@'
b'\xf3\x1a\xcak(\xafW@\x00\x00\x00\x00\x00\x00Y@'}
--
NumPy array:
array([ 0. , 5.26315789, 10.52631579, 15.78947368,
21.05263158, 26.31578947, 31.57894737, 36.84210526,
42.10526316, 47.36842105, 52.63157895, 57.89473684,
63.15789474, 68.42105263, 73.68421053, 78.94736842,
84.21052632, 89.47368421, 94.73684211, 100. ])
--
Handling of unsupported types#
If we need a type not currently supported, we can always encode/decode to field of type bytes,
which is supported. If we try to serialize unsupported types, an exception is raised:
Handling of unsupported types
Storing large artifacts#
The Datastore interface is designed to facilitate the storage and reloading of large objects such as datasets, weights and models. See its separate article for a comprehensive discussion.
Unsafe pickling#
It is possible, but not advised, to pickle/unpickle complete Experiment objects.
Safe with limitations
This procedure is safe if experiments are produced, stored and reloaded locally. It is generally much faster than trying to store the state in a controlled way. However, changes in the environment (architecture/package versions) might result in broken reloads and data loss, making it unsuitable for long-term storage.
In the following, we demonstrate how an unsafe object can be stored, pickled and unpickled in the run.state dictionary. The run.steps dictionary is always safeguarded form unsafe objects, and cannot be used.
Danger
Upon loading the experiment from database, the method SomethingUnsafe.__setstate__ is evaluated as part of the unpickling procedure, with potential harmful instructions.
Unsafe unpickling
Info
The serialization format and more details on the persistence logic are presented in more detail at State storage.
Database schema#
Experiments are persisted on two tables:
Table "experiments"#
id_experiment: UUID of the experimentname: Name of the experiment, by default a 6-alphanum hash ofid_experimentmeta: Serialized dictionary with properties on experiment andrunscolumnsfields: Experimentfieldsas set by the userunsafe_pickle: PickledExperimentobject, disabled by default
Tables "experiment_xyz"#
Each experiment is associated to a dedicated table, ending with its sanitized name. E.g., "experiment_xyz" if name=="xyz". There are only two fixed columns:
id_experiment: UUID of the experimentid_run: UUID of the run
Additional columns are named as the keys in the run.fields dictionary present in all runs of the experiment.
Each row represents a run of the experiment.
Columns either use the native database SQL type, or DATAPAK.
Tip
In case of experiments persisted with experiment.persist(store_unsafe_pickle=True) and loaded with experiment.load(unsafe_pickle=True), the experiment is also persisted as a binary blob which is unpickled upon loading (including its runs). Having the pickled blob does not limit/interfere with the regular storage semantics: the fields column in the experiments table, as well as the individual experiment tables, continue to operate as expected, and does not depend on the pickled object. This guarantees an extra level of interoperability and accessibility for the fields dictionaries.