Skip to content

Tools for backward incompatible DB upgrades#1833

Open
gpetretto wants to merge 1 commit into
datalab-org:mainfrom
Matgenix:gp/upgrade
Open

Tools for backward incompatible DB upgrades#1833
gpetretto wants to merge 1 commit into
datalab-org:mainfrom
Matgenix:gp/upgrade

Conversation

@gpetretto

Copy link
Copy Markdown

As mentioned in #48 we would like to introduce a backward incompatible change to split the blocks data from the items. We are thus proposing a general procedure for handling backward incompatible upgrades to the DB (see #1184).

The whole implementation is heavily based on the one we have developed for jobflow-remote and should be used only when truly needed, as we recongnize the pain and the potential issues associated with DB migrations.

Overview

The main change is the addition of the upgrade module that contains a DatabaseUpgrader. A new migration requires writing a function that performs the required updates on the DB and decorating it with@DatabaseUpgrader.register_upgrade("X.Y.Z"), where "X.Y.Z" is the new version that will be released.

In order to determine the datalab version associated with the DB a new collection (database_metadata) is created, where a document stores the current version number of datalab.

When the code is updated to a newer version, the upgrade procedure checks the datalab version stored in the DB and compares it with the version of the code being executed, applying sequentially all the intermediate upgrades to the DB, and finally updating the document in database_metadata with the current datalab version. This allows to build incremental upgrades between different versions.

This upgrade can be executed with an invoke task: migration.upgrade.

Initialization

A downside of the current proposal is that storing the datalab version in the DB requires an initialization. This is done through the invoke task admin.initialise-schema-version that needs to be executed during the initial deployment. We have considered doing this automatically in the create_app, but the main issue is to tell apart a DB associated to an already existing deplyoment from a pristine one, since in both cases the database_metadata will be empty. In principle one could imagine defining a procedure to guess the status of the DB (e.g., no items present or all the collections should be empty), but it would be easy to imagine cases where these checks might lead to the wrong conclusion. For example, if checking that all the collections are empty, this may fail to recognize a new deployment if in the future someone adds some other metadata document or if a user is added to the DB before starting the server. So, we ended up proposing this manual initialization procedure. It would be easy to switch to an automated one if preferred or if a reliable criteria can be determined.

Example

As an example of how to create a new upgrade, here is a draft to modify relationships to be based on refcode instead of item_id (see #1184). I don't think this covers properly all the cases and I did not attempt to make a complete one here, as it is beyond the scope of this PR. Just an idea of how this would work. Some more example can be found in jobflow remote, whose implementation is very similar.

relationships upgrade example
@DatabaseUpgrader.register_upgrade("0.8.0")
def pin_item_id_references_to_refcodes(
    db: pymongo.database.Database,
    session: ClientSession | None = None,
    dry_run: bool = False,
) -> list[UpgradeAction]:
    # Build a lookup from item_id to refcode
    item_id_to_refcode = {
        doc["item_id"]: doc["refcode"]
        for doc in db.items.find(
            {"item_id": {"$ne": None}, "refcode": {"$ne": None}},
            projection={"item_id": 1, "refcode": 1},
            session=session,
        )
    }

    pinned = 0
    deprecated = 0

    for item in db.items.find({}, projection={"_id": 1, "relationships": 1}, session=session):
        relationships = item.get("relationships") or []

        changed = False
        for rel in relationships:
            # Skip relationships that already use refcode or don't use an item_id.
            if rel.get("refcode") or not rel.get("item_id"):
                continue

            refcode = item_id_to_refcode.get(rel["item_id"])
            if refcode is not None:
                rel["refcode"] = refcode
                pinned += 1
            else:
                # Deprecate when no item matches
                rel["deprecated"] = True
                deprecated += 1
            changed = True

        if changed and not dry_run:
            db.items.update_one(
                {"_id": item["_id"]},
                {"$set": {"relationships": relationships}},
                session=session,
            )

    return [
        UpgradeAction(
            description=f"Pin {pinned} item_id reference(s) to their refcode",
            collection="items",
            action_type="update",
            details={"pinned": pinned},
        ),
        UpgradeAction(
            description=f"Deprecate {deprecated} unresolvable item_id reference(s)",
            collection="items",
            action_type="update",
            details={"deprecated": deprecated},
        ),
    ]

TODO

Document the upgrade and initialize procedure if approved.

Let us know what you think or if you need more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant