Chain jobs with Slurm – Moritz' Machine Learning Blog

Sometimes 24 hours are not enough to train your model. Yet big compute providers like the Jülich Supercomuting Centre enforce 24h job limits to ensure everyone gets access.

Chaining jobs allows us to split work into parts without having to submit every part independently by hand. Here is a minimal example:

# python_train.py

import pickle
import os
import subprocess

if __name__ == '__main__':
    if not os.path.exists('model.pkl'):
        model_var = 0
    else:
        with open('model.pkl', 'rb') as f:
            model_var = pickle.load(f)

    model_var += 1
    with open('model.pkl', 'wb') as f:
        model = pickle.dump(model_var, f)
    print(f"saved model: {model_var}")
    if model_var < 5:
        subprocess.call("sbatch",
                        "run_chain.sh")
    else:
        print("Model trained! Done!")

The script above uses pickle to simulate your code’s results. Store it in a file called python_train.py. It checks the model_var and resubmits your code until a condition is met. A slurm file like the one below lets you run this example:

#!/bin/bash
#
#SBATCH -A TODO:your-account-name
#SBATCH --nodes=1
#SBATCH --job-name=test_chain
#SBATCH --output=test_chain-%j.out
#SBATCH --error=test_chain-%j.err
#SBATCH --time=00:05:00
#SBATCH --partition develbooster

module load Python

python python_train.py

Call the file run_chain.sh.

Submit everything with sbatch run_chain.sh and voilà, your job executes itself in a chain. This is useful if you are dealing with deep learning code that needs more than 24h to converge. The pickle code in the example is meant to simulate model storage and loading from disc.