Hyperparameter Search on a Cluster (Part 3)

In our previous blog posts, we showed how to use Rain for hyperparameter search for MNIST models and how to distribute the whole process in a cloud environment. This time, we will modify the example so it employs an external application which performs the actual training process instead of doing this within a Python function decorated with the Rain remote decorator.

Executing external programs

Before we start, let us have a brief detour and have a look at how Rain handles the execution of external programs. Rain features task type tasks.Execute. When this task is executed in a governor, it creates a local temporary directory and maps all inputs and outputs of the program as files stored in this directory and executes the program. When it finishes, the output files are exported and the temporary directory removed.

Rain execute

The executed program may do whatever it wants in its temporary directory. Nonetheless, it should not use files located outside of the directory as it cannot assume on which governor (worker machine) it actually runs. If the program requires some files/directories, they should be properly mapped as inputs. In that case, Rain distributes the data across the cluster as necessary.

Simple external program - mnist.py

Our external application is a TensorFlow code in an external Python file mnist.py with a command line interface:

python3 mnist.py --data <PATH_TO_DATA> --dropout <DROPOUT> --size <SIZE>

The dropout and size parameters have the same meaning as in the previous examples. The --data argument expects a path to the MNIST dataset. The program prints prediction accuracy on the standard output. The full source code of the mnist.py file can be found at the end of this post.

Note that, in this tutorial, we use the application only through its CLI as we would use any other third-party black-box application.

Initilization

Now, let’s start with our Rain client code. Most of the initialization process remains the same as explained in the previous blog posts. We only add a new variable MNIST_PROGRAM which contains a path to mnist.py. Do not forget to modify it so it points to the correct path on your system.

MNIST_PROGRAM = "/path/to/mnist.py" # <-- The only difference

SIZES = [32, 64, 128, 256]
DROPOUTS = [0.0, 0.2, 0.4, 0.6, 0.8]

client = Client("localhost", 7210)
session = client.new_session("MNIST test", default=True)

Data download

Next, we need to download the data. In the previous version, we have used Tensorflow build-in function to download it and then distributed it as pickled data. This time, we use external program mnist.py in order to ensure that it only gets downloaded once. In Rain, this can be as follows:

mnist_data = tasks.Execute(
    "wget https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz",
    output_paths=["mnist.npz"])

First argument is the actual command to be executed. Argument output_paths allows to specify what files represent task outputs. It is a relative path to the working directory of the task.

Training

As mentioned before, the main goal of this blog post is to show how to dedicate the training process to an external program and here we are. Previously, the model training was done by a decorated Python function train_mnist. This time, instead of calling this function we parametrize and call our external program mnist.py as follows:

mnist_tasks = [
    tasks.Execute(
        ["python3", MNIST_PROGRAM, "--data", mnist_data, "--size", str(size), "--dropout", str(dropout)],
        stdout=True,
        name="mnist size={} dropout={}".format(size, dropout))
    for size, dropout in itertools.product(SIZES, DROPOUTS)
]

for task in mnist_tasks:
    task.output.keep()

Note, that we are directly passing our data downloading task mnist_data as the 4th element in the list of command arguments. In such a case, the data object produced by mnist_data is mapped to a randomly named file that is created in the working directory of the executed task. This random name is then expanded into arguments of mnist.py on the place where mnist_data are placed. In other words, the command is actually executed like this:

python3 /path/to/mnist.py --data <RANDOM_NAME_WHERE_DATA_ARE_MAPPED> --size <SIZE> -- data <DROPOUT>

The stdout argument indicates that we want to capture a standard output of the executed program and use it as the result of the task.

Getting the result

Getting the results is again very similar to the previous version. The only difference is that this time we do not get pickled Python objects but raw data captured from standard outputs of executed programs. Hence, we have to covert them into floats explicitly:

session.submit()

accuracies = [float(task.output.fetch().get_str()) for task in mnist_tasks]

This is all. Now you can test the pipeline locally:

rain start --simple
python3 <our-rain-script.py>

Running on Exoscale

We have thoroughly described how to use Rain on Exoscale in our previous blog post. Therefore, here we only focus on the differences.

To run our example on Exoscale cloud, we need to make the mnist.py program available on all nodes. For that, we can use our exoscale.py deployment script:

python3 exoscale.py scp /path/to/mnist.py mnist.py

This copies the program to all nodes into their home directories (that is /home/ubuntu). The last thing that we need to modify is variable MNIST_PROGRAM:

MNIST_PROGRAM = "/home/ubuntu/mnist.py

Now, the script is ready to be executed on the Exoscale cluster.

Full source code of the example

from rain.client import Client, tasks
import tensorflow as tf

import itertools
import numpy as np
import matplotlib.pyplot as plt

MNIST_PROGRAM = "/path/to/mnist.py"

SIZES = [32, 64, 128, 256]
DROPOUTS = [0.0, 0.2, 0.4, 0.6, 0.8]

client = Client("localhost", 7210)
session = client.new_session("MNIST test", default=True)

mnist_data = tasks.Execute(
    "wget https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz",
    output_paths=["mnist.npz"])

mnist_tasks = [
    tasks.Execute(
        ["python3", MNIST_PROGRAM, "--data", mnist_data, "--size", str(size), "--dropout", str(dropout)],
        stdout=True,
        name="mnist size={} dropout={}".format(size, dropout))
    for size, dropout in itertools.product(SIZES, DROPOUTS)
]

for task in mnist_tasks:
    task.output.keep()

session.submit()

accuracies = [float(task.output.fetch().get_str()) for task in mnist_tasks]


def make_heatmap(ax, data):
    data = np.array(data).reshape((len(SIZES), len(DROPOUTS)))

    im = ax.imshow(data)
    ax.set_yticklabels(SIZES)
    ax.set_xticklabels(DROPOUTS)
    ax.set_yticks(np.arange(len(SIZES)))
    ax.set_xticks(np.arange(len(DROPOUTS)))

    for j in range(len(DROPOUTS)):
        for i in range(len(SIZES)):
            text = ax.text(j, i, data[i, j],
                           ha="center", va="center", color="w")


fig, ax1 = plt.subplots()

make_heatmap(ax1, accuracies)

fig.tight_layout()
plt.show()

File ‘mnist.py’

import tensorflow as tf
import argparse
import numpy as np


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--data', help='Path to data', default=None)
    parser.add_argument('--size', type=int, help='Size of hidden layer', default=256)
    parser.add_argument('--dropout', type=float, help='Dropout', default=0.2)
    args = parser.parse_args()

    with np.load(args.data) as f:
        x_train, y_train = f['x_train'], f['y_train']
        x_test, y_test = f['x_test'], f['y_test']

    x_train, x_test = x_train / 255.0, x_test / 255.0

    model = tf.keras.models.Sequential([
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(args.size, activation=tf.nn.relu),
        tf.keras.layers.Dropout(args.dropout),
        tf.keras.layers.Dense(10, activation=tf.nn.softmax)
    ])
    model.compile(optimizer='adam',
                loss='sparse_categorical_crossentropy',
                metrics=['accuracy'])

    model.fit(x_train, y_train, epochs=1, verbose=False)
    result = model.evaluate(x_test, y_test, verbose=False)
    print(result[1])


if __name__ == "__main__":
    main()