Page tree

In the NCI ARE JupyterLab session, direct viewing of TensorBoard within the Notebook is not supported due to an issue in the current Open OnDemand architecture design. However, you can use a workaround by accessing it via SSH port forwarding.

To demonstrate it,  you can set up a MNIST workflow in a Jupyter notebook as described below:

MNIST with tensorboard
# Load the TensorBoard notebook extension
%load_ext tensorboard

import tensorflow as tf
import datetime
import os

# Change to your working directory.
os.chdir("/g/data/z00/rxy900/PROJECT/NCI-MLENV/tensorboard_test")

# Load MNIST dataset from /g/data/wb00 file system.
mnist = tf.keras.datasets.mnist
(x_train, y_train),(x_test, y_test) = mnist.load_data(path="/g/data/wb00/admin/staging/MNIST/npz/mnist.npz")
x_train, x_test = x_train / 255.0, x_test / 255.0

# define the model.
def create_model():
  return tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(512, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(10, activation='softmax')
  ])

# Create the dataset for training and test.
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
test_dataset = tf.data.Dataset.from_tensor_slices((x_test, y_test))

train_dataset = train_dataset.shuffle(60000).batch(64)
test_dataset = test_dataset.batch(64)

loss_object = tf.keras.losses.SparseCategoricalCrossentropy()
optimizer = tf.keras.optimizers.Adam()

# Define our metrics
train_loss = tf.keras.metrics.Mean('train_loss', dtype=tf.float32)
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy('train_accuracy')
test_loss = tf.keras.metrics.Mean('test_loss', dtype=tf.float32)
test_accuracy = tf.keras.metrics.SparseCategoricalAccuracy('test_accuracy')

def train_step(model, optimizer, x_train, y_train):
  with tf.GradientTape() as tape:
    predictions = model(x_train, training=True)
    loss = loss_object(y_train, predictions)
  grads = tape.gradient(loss, model.trainable_variables)
  optimizer.apply_gradients(zip(grads, model.trainable_variables))

  train_loss(loss)
  train_accuracy(y_train, predictions)

def test_step(model, x_test, y_test):
  predictions = model(x_test)
  loss = loss_object(y_test, predictions)

  test_loss(loss)
  test_accuracy(y_test, predictions)

current_time = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
train_log_dir = 'logs/gradient_tape/' + current_time + '/train'
test_log_dir = 'logs/gradient_tape/' + current_time + '/test'
train_summary_writer = tf.summary.create_file_writer(train_log_dir)
test_summary_writer = tf.summary.create_file_writer(test_log_dir)


model = create_model() # reset our model

EPOCHS = 5

for epoch in range(EPOCHS):
  for (x_train, y_train) in train_dataset:
    train_step(model, optimizer, x_train, y_train)
  with train_summary_writer.as_default():
    tf.summary.scalar('loss', train_loss.result(), step=epoch)
    tf.summary.scalar('accuracy', train_accuracy.result(), step=epoch)

  for (x_test, y_test) in test_dataset:
    test_step(model, x_test, y_test)
  with test_summary_writer.as_default():
    tf.summary.scalar('loss', test_loss.result(), step=epoch)
    tf.summary.scalar('accuracy', test_accuracy.result(), step=epoch)

  template = 'Epoch {}, Loss: {}, Accuracy: {}, Test Loss: {}, Test Accuracy: {}'
  print (template.format(epoch+1,
                         train_loss.result(), 
                         train_accuracy.result()*100,
                         test_loss.result(), 
                         test_accuracy.result()*100))

  # Reset metrics every epoch
  train_loss.reset_states()
  test_loss.reset_states()
  train_accuracy.reset_states()
  test_accuracy.reset_states()

Running the above script will give the following outputs

Epoch 1, Loss: 0.2446601688861847, Accuracy: 92.78666687011719, Test Loss: 0.11248774081468582, Test Accuracy: 96.54000091552734 
Epoch 2, Loss: 0.10678393393754959, Accuracy: 96.7933349609375, Test Loss: 0.08360177278518677, Test Accuracy: 97.39999389648438
Epoch 3, Loss: 0.07272648066282272, Accuracy: 97.76499938964844, Test Loss: 0.07011018693447113, Test Accuracy: 97.86000061035156
Epoch 4, Loss: 0.054958563297986984, Accuracy: 98.28166961669922, Test Loss: 0.06938254088163376, Test Accuracy: 97.95999908447266
Epoch 5, Loss: 0.04389793425798416, Accuracy: 98.58499908447266, Test Loss: 0.060392312705516815, Test Accuracy: 98.1199951171875

Next, you can run the following bash command to open the TensorBoard interface.

running tensorboard command
%tensorboard --logdir logs/gradient_tape

However, on ARE JupyterLab session, you can not view TensorBoard interface. This is because the port number used by the TensorBoard interface is blocked in ARE due to he architecture design of Open Ondemand.

Alternatively, you could manually set up the ssh port forwarding to view the TensorBoard dashboard as below.

First of all, open a CMD terminal by click "+" button and then Terminal button.

Then enter your working directory and start the tensor board command as below:

MNIST with tensorboard
NCI-ai-ml_22.08 >  tensorboard --logdir 'logs/gradient_tape' --host `hostname`
NOTE: Using experimental fast data loading logic. To disable, pass
    "--load_fast=false" and report issues on GitHub. More details:
    https://github.com/tensorflow/tensorboard/issues/4784
TensorBoard 2.8.0 at http://gadi-gpu-v100-0112.gadi.nci.org.au:6006/ (Press CTRL+C to quit)

Copy the current Gadi node hostname&port, i.e. "gadi-gpu-v100-0112.gadi.nci.org.au:6006", as shown in the above output.

Open a CMD terminal in your desktop and type in the command as below (with the Gadi worker node name&port you just copied):

client_desktop:$ ssh -N -L 6006:gadi-gpu-v100-0112.gadi.nci.org.au:6006 rxy900@gadi.nci.org.au

Now open your web browser and type in the address as below

http://localhost:6006

You should be able to open the TensorBoard interface as below



  • No labels