Skip to content

In this chapter you are going to run your first jobs on your virtual cluster. All the hard work you have put in has accumulated to this point where you now have a functional HPC system deployment! Let's get started!

Note

Anticipated time to complete this chapter: TBC from user feedback.

8.1 Interactive Job


In Chapter 6 you prepared your resource manager and created a test user to run jobs. Before running jobs you must log in as this user, and compile your application.

  1. Switch to test user

    [root@smshost vagrant]#
    sudo su - test
    

    You will now be in a new shell session for the test user. Your prompt should look like this:

    [test@smshost ~]$
    
  2. Compile MPI 'hello world' example:

    Users on the HPC system

    Everything performed as test simulates what a traditional user on an HPC user can do. In this virtual lab, this includes:

    • Compiling source code
    • Submitting a binary to the Slurm job scheduler
    • Querying the Slurm workload manager
    [test@smshost ~]$
    mpicc -O3 /opt/ohpc/pub/examples/mpi/hello.c 
    

    Note

    The file hello.c is a simple 'hello world' application, provided by OpenHPC, that can be used as a test job for quick compilation and execution.

    OpenHPC also provides a companion job-launch utility named prun that is installed along with the pre-packaged MPI toolchains.

    This convenience script provides a mechanism to abstract job launch across different resource managers and MPI stacks, such that a single launch command can be used for parallel job launch in a variety of OpenHPC environments.

  3. Submit an interactive job request and verify allocation.

    Note

    An interactive job request is submitted to Slurm to request an allocation of resources to the user; if the request is granted, then the user has direct access to the compute resource (typically a bash shell).

    [test@smshost ~]$
    salloc -n 4 -N 2 
    
    OUTPUT:
    
    # salloc: Granted job allocation <jobID> 
    
    Click here to learn more about the above command.

    salloc:
    is used to allocate a Slurm job allocation.

    -n 4:
    will specify the number of tasks. By default this would be one task per node. We are asking for a total of 4 tasks in the above step.

    -N 2:
    will specify the number of nodes to allocate for this job. In the above command, we are requesting 2 Nodes.


    Recall that each compute node has 2 cores, 1 CPU, so by calling for 4 tasks, we will use 2 * 2 cores on the 2 compute nodes.

    To verify / query the number of nodes (and which nodes), a user can run squeue, to list some information about the current interactive job:

    [test@smshost ~]$
    squeue
    
    OUTPUT:
    
    # JOBID   PARTITION       NAME   USER ST   TIME  NODES   NODELIST(REASON)
    # <jobID>    normal   interact   test  R   0:03      2     compute[00-01]
    

    Interpreting 'squeue'

    squeue will list some pertinent information to the user, including:

    • JOBID - useful for job output logs, controlling jobs, etc.
    • ST - the state of the current job (such as R for RUNNING)
    • NODES - the number of nodes reserved for this job
    • NODELIST - the names of the nodes reserved for this job
  4. Use prun to launch executable.

    [test@smshost ~]$
    prun ./a.out
    
    OUTPUT:
    
    # [prun] Master compute host = smshost  
    # [prun] Resource manager = slurm  
    # [prun] Launch cmd = mpirun ./a.out (family=openmpi4)  
    
    # Hello, world (4 procs total)  
    # --> Process #   1 of   4 is alive. -> compute00  
    # --> Process #   0 of   4 is alive. -> compute00  
    # --> Process #   2 of   4 is alive. -> compute01  
    # --> Process #   3 of   4 is alive. -> compute01 
    
    Click here to understand why the order of responses is not sequential.

    In the world of parallel processing, the only predictable outcome from a parallel program is that the order is not predictable.

    For many reasons, one thread (or processor) will take slightly longer or shorter than another, and this affects the seemingly 'random' order of responses that you see in the output from the test job above. If you run your program multiple times, you should see a different order most of the time. This is actually normal and to be expected.

    Click here if you encounter an ORTE daemon error
    [test@smshost ~]$ prun ./a.out
    [prun] Master compute host = smshost
    [prun] Resource manager = slurm
    [prun] Launch cmd = mpirun ./a.out (family=openmpi4)
    --------------------------------------------------------------------------
    An ORTE daemon has unexpectedly failed after launch and before
    communicating back to mpirun. This could be caused by a number
    of factors, including an inability to create a connection back
    to mpirun due to a lack of common network interfaces and/or no
    route found between them. Please check network connectivity
    (including firewalls and network routing requirements).
    --------------------------------------------------------------------------
    

    At this point you may experience an error along the lines of An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirun. This is a common issue, and is typically associated with two possible problems:


    1. The /etc/hosts file is incorrectly populated.

      Please ensure that your hosts file is correctly populated and try again: review the FAQ.

    2. The time synchronisation between the compute nodes and smshost has drifted: review the FAQ.

      Compare the timedatectl for the compute nodes and the smshost. If they are not the same, you need to synchronise them. Review the FAQ for steps.

    Important

    The resources that you have allocated to run this interactive job are still reserved! You will need to free them to run any further jobs.

    This can be done very easily by pressing the 'disconnect' hotkey combination CTRL+D, or by using the squeue and scancel commands provided by Slurm.

    The command squeue will help you identify the ID of the job that you wish to cancel and scancel <jobID> will allow you to cancel the job and free up the resources.

    # EXAMPLE
    #
    # [test@smshost ~]$ scancel 1
    # salloc: Job allocation 1 has been revoked.
    # Hangup
    
    You may need to press CTRL+D to disconnect if it does not automatically disconnect.

    Once the interactive job is completed and successfully revoked, you can verify the state of the Slurm queue by running sinfo. If the available resources are returned to idle in the STATE column, then the nodes are ready to accept new jobs.

    If there are any issues with the job queue, you may need to manually reset the compute resources:

    [test@smshost ~]$
    [test@smshost ~]$
    sudo scontrol update nodename=compute[00-01] state=down reason="not behaving"
    sudo scontrol update nodename=compute[00-01] state=resume
    

Congratulations

You have a functioning (although, not yet quite finished) cluster!

Let's take on the next challenge - submitting batch jobs.

8.2 Batch Job


A batch job is submitted to the queue and executed at an undefined time in the future - there is no user interaction with the compute resources, and when a job terminates (completes, crashes, or expires), an output file can be reviewed by the user to determine the outcome of the job submission.

Once again, before running your batch job, you need to ensure you are logged in as the test user and that you have compiled your application.

An example job script is provided by OpenHPC. In this job script you will see that the same pre-compiled executable from the interactive job example is used. You will not have to repeat the compilation step in this section. We will copy and modify this file to match our lab configuration.

  1. Copy example job script to the current directory.

    [test@smshost ~]$
    cp /opt/ohpc/pub/examples/slurm/job.mpi . 
    
  2. View the example job script.

    [test@smshost ~]$
    cat job.mpi
    

    Which should look like:

    #!/bin/bash  
    #SBATCH -J test        # Job name  
    #SBATCH -o job.%j.out  # Name of stdout output file (%j expands to jobId)  
    #SBATCH -N 2           # Total number of nodes requested  
    #SBATCH -n 16          # Total number of mpi tasks requested  
    #SBATCH -t 01:30:00    # Run time (hh:mm:ss) - 1.5 hours  
    
    # Launch MPI-based executable  
    prun ./a.out 
    

    Important

    Recall that your virtual cluster consists of two compute nodes, each with a single socket, and two cores per socket (See Chapter 6).

    In order for MPI to parse our job script and run the application, the job script must match the parameters of the compute hosts.

    If this is not properly configured, your job will not run, because the workload manager will hold the job in the queue until adequate resources become available (which will never happen).

    Tip

    The use of the %j option in the example batch job script is a convenient way to track the output of the application on an individual job basis. The %j token is replaced with the Slurm job allocation number, <jobID>, once the job is assigned.

  3. Update the job script for our configuration, and verify the change:

    • Change the number of mpi tasks from 16 to 4 (2 compute nodes * 2 cores each = 4 mpi tasks).
    [test@smshost ~]$
    [test@smshost ~]$
    perl -pi -e "s/-n 16/-n 4 /" ./job.mpi 
    cat job.mpi | grep tasks
    
    OUTPUT:
    
    # #SBATCH -n 4                  # Total number of mpi tasks requested
    
  4. Launch your batch job via Slurm by using the sbatch command.

    [test@smshost ~]$
    sbatch job.mpi 
    
    OUTPUT:
    
    Submitted batch job <jobID> 
    
    Click here for tips on how to watch the job unfold.

    In a second pane or shell, you can run the following command:

    watch "squeue && sinfo"  
    

    For a brief moment you should see a job in the squeue section that is in a R state (for RUNNING), and in the sinfo section you should see the compute nodes STATE change from idle to alloc.

  5. Once complete, view the job output, found in the file job.<jobID>.out.

    If successful you can expect an output similar to the following:

    # [prun] Master compute host = compute00  
    # [prun] Resource manager = slurm  
    # [prun] Launch cmd = mpirun ./a.out (family=openmpi4)  
    #
    # Hello, world (4 procs total)  
    #     --> Process #   1 of   4 is alive. -> compute00  
    #     --> Process #   0 of   4 is alive. -> compute00  
    #     --> Process #   2 of   4 is alive. -> compute01  
    #     --> Process #   3 of   4 is alive. -> compute01 
    

That's it! You're done!

Believe it or not, you've managed to set up your very own OpenHPC virtual cluster and have successfully run batch jobs and interactive jobs!

At this point, you can close up the lab or take a few additional steps in your journey with OpenHPC and your virtual cluster. We would recommend considering:

  • Taking a final snapshot of your smshost in this working state.
  • You can package your VM as a new Vagrant .box file to re-import or transfer to another machine.
  • You can continue to explore HPC features and tools, including OpenOnDemand (watch this space), Spack, and so much more!
  • Above all, have fun! And buy yourself a cake to celebrate!

Click here for a recap of Chapter 8.

You have successfully run an interactive job on your cluster. This means that users of your cluster can now successfully run their own interactive jobs using the system scheduler, Slurm.

You have also successfully run a batch submission job on your cluster! This means that users of the cluster can now successfully submit their own jobs for processing and executing on the cluster using Slurm.

Congratulations

You have reached the end of the virtual lab!

Have a cake!

Fill out our feedback survey!

Let us know what you think!


Bug report

Click here if you wish to report a bug.

Provide feedback

Click here if you wish to provide us feedback on this chapter.