Options management#

MLtraq manages global preferences with the object mltraq.options, whose class follows the Singleton pattern and is transparently replicated in read-only mode to other processes that handle the execution of runs with step functions.

Options are organized in a tree-like structure with values at their leaves and indexed by dot-separated strings. The data is stored as a nested Python dictionary. If the query string does not reach a leaf, a dictionary is returned, with the matching sub-tree.

The options tree#

Diagram of the options tree, query strings and returned values

flowchart LR
X(((Options)))

X --> xa(a)
xa --> ab(b) --> abc(c) --> v1[12]
xa --> ad(d) --> v2['hello']
X --> xe(e) --> v3["#123;'k':3#125;"]
X --> xf(f)
xf(f) --> xfg(g) --> v4[46]
xf(f) --> xfh(h) --> v5[Object]

style X stroke-width:3px
style v1 stroke-width:3px
style v2 stroke-width:3px    
style v3 stroke-width:3px 
style v4 stroke-width:3px
style v5 stroke-width:3px

options.get("a.b.c") returns the int 12
options.get("a.d") returns the string 'hello'
options.get("e") returns the dictionary {'k':3}
options.get("f") returns the dictionary {'g': 46, 'h': Object}

Context manager#

You can use the context manager options.ctx to temporarily modify the configuration.

Using the context manager with options

from mltraq import options

print("Before", options().get("reproducibility.random_seed"))
with options().ctx({"reproducibility.random_seed": 444}):
    print("Inside", options().get("reproducibility.random_seed"))
print("After", options().get("reproducibility.random_seed"))

Output

Before 123
Inside 444
After 123

Nesting options#

You can define a new group of options extending the class BaseOptions defined in mltraq.utils.base_options and requesting its singleton instance. Options are stored in the .options attribute and can be nested to other existing option groups, as we demonstrate in the following example.

Nesting options

from pprint import pprint

from mltraq.utils.base_options import BaseOptions

class OptionsA(BaseOptions):
    pass

class OptionsB(BaseOptions):
    pass

options_a = OptionsA.instance()
options_a.set("hello", 123)

options_b = OptionsB.instance()
options_b.set("world", 456)

options_a.set("options_b", options_b.values)

pprint(options_a.values, width=70)

Output

{'hello': 123, 'options_b': {'world': 456}}

Default options#

Overview#

Listing the default values of the options. The generation of the documentation (which relies on the options.ctx context manager) alters two values:

"tqdm.disable" is set to False to improve readability in the docs.
"sequential_uuids" is set to True to avoid random UUIDs in the documentation.

Default option values

from pprint import pprint

from mltraq import options

pprint(options().values, width=60)

Output

{'app': {},
 'archivestore': {'format': 1,
                  'mode': 'x',
                  'relative_path_prefix': 'undefined',
                  'url': 'file:///mltraq.archivestore'},
 'bunchstore': {'pathname': 'bunchstore.data'},
 'cli': {'logging': {'format': '%(levelname)-9s '
                               '%(asctime)s  %(message)s',
                     'level': 'INFO'},
         'tabulate': {'maxcolwidths': 70}},
 'codelog': {'disable': True, 'field_name': 'codelog'},
 'database': {'ask_password': False,
              'echo': False,
              'experiment_tableprefix': 'experiment_',
              'experiments_tablename': 'experiments',
              'pool_pre_ping': True,
              'query_read_chunk_size': 1000,
              'query_write_chunk_size': 1000,
              'url': 'sqlite:///:memory:'},
 'datastore': {'relative_path_prefix': 'undefined',
               'url': 'file:///mltraq.datastore'},
 'datastream': {'cli_address': 'mltraq.sock',
                'cli_throttle_send': 0.001,
                'disable': True,
                'kind': 'UNIX',
                'srv_address': 'mltraq.sock',
                'srv_throttle_persist': 1,
                'srv_throttle_recv': 0.0001},
 'execution': {'args_field': False,
               'backend': 'loky',
               'backend_params': {},
               'exceptions': {'compact_message': False,
                              'report_basenames': False},
               'loky_chdir': True,
               'n_jobs': -1,
               'return_as': 'list'},
 'reproducibility': {'random_seed': 123,
                     'sequential_uuids': True},
 'serialization': {'compression': {'codec': 'uncompressed'},
                   'serializer': 'DataPakSerializer',
                   'store_unsafe_pickle': False},
 'sysmon': {'disable': True,
            'field_name': 'sysmon',
            'interval': 1,
            'path': '/',
            'percpu': False},
 'tqdm': {'delay': 0.5, 'disable': True, 'leave': False}}

Reference documentation#

The prefix "app.*" is reserved for the application, is empty by default, and can be used by the application to customize the behaviour of steps.
Options "database.*" control the behaviour of the connection to the database, chunking, and table names/prefixes.
- I/O operations are chunked by number of rows, "database.query_read_chunk_size" and "database.query_write_chunk_size", to implement progress bar reporting.
- If "tqdm.disable" is set to False, there is no chunking.
- If "database.ask_password" is set to True, the password of the connection string is requested interactively.
- "database.echo", "database.pool_pre_ping", "database.url" are passed to SQLAlchemy.
- "experiments_tablename" defines the table name used to index the experiments and their meta data. "experiment_tableprefix" is the table prefix used for individual experiment tables.
Options "execution.*" cover how experiments (and their runs) are executed.
- If "execution.exceptions.compact_message" is set to true, exceptions raised within runs are reported with a compact, friendly format. It might hide useful context to debug errors, so it's False by default.
Options "reproducibility.*" handle outputs can be reproduced accurately.
- The random seed of the Python random and numpy packages resets to "reproducibility.random_seed" before executing runs, ensuring reproducibility.
- If "reproducibility.sequential_uuids" is set to True, there is no randomness for UUIDs generated for experiments and run IDs, simplifying tests and avoiding unnecessary changes in the documentation.
Options "serialization.* set defaults on compression and storage of experiments.
Options "tqdm.*" are parameters passed to tqdm to render the progress bars used in the evaluation of runs and SQL queries.
Options "datastore.*" define how objects are serialized outside the database. E.g., the filesystem.
- "datastore.url" defines the storage location. Three slashes indicate a relative path.
- "datastore.relative_path_prefix" is appended to "datastore.url" and defines the relative directory that should be used to store the file(s). DataStore objects manage it transparently, temporarily setting it to the experiment ID.
Options "datastream.*" handle all things streaming.
- "datastream.cli_address": Address the client sends the messages to.
- "datastream.cli_throttle_send": Delay (s) introduced after each sent message.
- "datastream.kind": Type of socket, either "UNIX" or "INET".
- "datastream.srv_address": Address to listen to.
- "datastream.srv_throttle_recv": Delay (s) introduced after each received message.
- "datastream.srv_throttle_persist": Delay (s) introduced after persists to database.