Blog#

March 19, 2024
in examples
3 min read

The ArchiveStore interface lets you manage TAR archives transparently. In the example below, the directory datasets is archived and stored as a regular DataStore asset. Persisting the experiment triggers the creation of the archive. Upon loading the experiment, the archive is extracted in the directory mltraq.archivestore, organized similarly to mltraq.datastore by experiment ID.

Warning

Persisting an experiment is equivalent to removing and saving it, triggering the deletion and recreation of its associated datastore assets, including its archives. You can implement different behaviors with ArchiveStoreIO and DataStoreIO.

ArchiveStore example

from os import mkdir

import pandas as pd

from mltraq import create_session
from mltraq.storage.archivestore import ArchiveStore
from mltraq.utils.fs import glob, tmpdir_ctx

with tmpdir_ctx():
    # Work in a temporary directory

    # Create a directory with two files
    mkdir("datasets")
    pd.Series([1, 2, 3]).to_csv("datasets/first.csv")
    pd.Series([4, 5, 6]).to_csv("datasets/second.csv")

    # Create an experiment
    s = create_session()
    e = s.create_experiment("test")

    # Define an archive (no tar file created!)
    e.fields.archived = ArchiveStore(src_dir="datasets", arc_dir="e")

    # Persist the experiment, creating the tar file
    e.persist()

    # Load the experiment, unarchiving the tar file
    e = s.load_experiment("test")

    print(f"Destination directory: '{e.fields.archived.get_target()}'")

    # Print contents of current directory
    print("Contents of current directory:")
    for idx, name in enumerate(glob("**", root_dir=".", recursive=True)):
        print(f"[{idx:2d}] {name}")

Output

Destination directory: 'mltraq.archivestore/d65df69e-1175-44a5-be2f-2232765703b8'
Contents of current directory:
[ 0] datasets
[ 1] datasets/first.csv
[ 2] datasets/second.csv
[ 3] mltraq.archivestore
[ 4] mltraq.archivestore/d65df69e-1175-44a5-be2f-2232765703b8
[ 5] mltraq.archivestore/d65df69e-1175-44a5-be2f-2232765703b8/e
[ 6] mltraq.archivestore/d65df69e-1175-44a5-be2f-2232765703b8/e/first.csv
[ 7] mltraq.archivestore/d65df69e-1175-44a5-be2f-2232765703b8/e/second.csv
[ 8] mltraq.datastore
[ 9] mltraq.datastore/d65df69e-1175-44a5-be2f-2232765703b8
[10] mltraq.datastore/d65df69e-1175-44a5-be2f-2232765703b8/d65df69e117544a5be2f2232765703ba

ArchiveStoreIO

The class ArchiveStoreIO provides a lower-level interface to manage archives, bypassing the organization by experiment IDs. Its implementation relies on the glob and tarfile modules from the standard library. You can pass patterns to include or exclude and optionally include hidden files.

ArchiveStoreIO example

from os import mkdir

import pandas as pd

from mltraq.opts import options
from mltraq.storage.archivestore import ArchiveStoreIO
from mltraq.utils.fs import glob, tmpdir_ctx

with tmpdir_ctx():
    # Work in a temporary directory

    # Create a directory with two files
    mkdir("datasets")
    pd.Series([1, 2, 3]).to_csv("datasets/first.csv")
    pd.Series([4, 5, 6]).to_csv("datasets/second.csv")

    with options().ctx(
        {
            "datastore.relative_path_prefix": "archives",
            "archivestore.relative_path_prefix": "all",
        }
    ):
        # Create an archive and extract it
        archive = ArchiveStoreIO.create(
            src_dir="datasets", arc_dir="assets"
        )
        archive.extract()

    # Print contents of current directory
    print("Contents of current directory:")
    for idx, name in enumerate(glob("**", root_dir=".", recursive=True)):
        print(f"[{idx:2d}] {name}")

Output

Contents of current directory:
[ 0] datasets
[ 1] datasets/first.csv
[ 2] datasets/second.csv
[ 3] mltraq.archivestore
[ 4] mltraq.archivestore/all
[ 5] mltraq.archivestore/all/assets
[ 6] mltraq.archivestore/all/assets/first.csv
[ 7] mltraq.archivestore/all/assets/second.csv
[ 8] mltraq.datastore
[ 9] mltraq.datastore/archives
[10] mltraq.datastore/archives/d65df69e117544a5be2f2232765703b8

Archive files

ArchiveStore

ArchiveStoreIO