In this chapter you are going to run your first jobs on your virtual cluster. All the hard work you have put in has accumulated to this point where you now have a functional HPC system deployment! Let's get started!
Note
Anticipated time to complete this chapter: TBC from user feedback.
8.1 Interactive Job
In Chapter 6 you prepared your resource manager and created a test
user to run jobs. Before running jobs you must log in as this user, and compile your application.
-
Switch to
test
userYou will now be in a new shell session for the
test user
. Your prompt should look like this: -
Compile MPI 'hello world' example:
Users on the HPC system
Everything performed as
test
simulates what a traditional user on an HPC user can do. In this virtual lab, this includes:- Compiling source code
- Submitting a binary to the
Slurm
job scheduler - Querying the
Slurm
workload manager
Note
The file
hello.c
is a simple 'hello world' application, provided by OpenHPC, that can be used as a test job for quick compilation and execution.OpenHPC also provides a companion job-launch utility named
prun
that is installed along with the pre-packaged MPI toolchains.This convenience script provides a mechanism to abstract job launch across different resource managers and MPI stacks, such that a single launch command can be used for parallel job launch in a variety of OpenHPC environments.
-
Submit an interactive job request and verify allocation.
Note
An interactive job request is submitted to Slurm to request an allocation of resources to the user; if the request is granted, then the user has direct access to the compute resource (typically a
bash
shell).Click here to learn more about the above command.
salloc
:
is used to allocate a Slurm job allocation.-n 4
:
will specify the number of tasks. By default this would be one task per node. We are asking for a total of 4 tasks in the above step.-N 2
:
will specify the number of nodes to allocate for this job. In the above command, we are requesting 2 Nodes.
Recall that each compute node has 2 cores, 1 CPU, so by calling for 4 tasks, we will use 2 * 2 cores on the 2 compute nodes.
To verify / query the number of nodes (and which nodes), a user can run
squeue
, to list some information about the current interactive job:OUTPUT: # JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) # <jobID> normal interact test R 0:03 2 compute[00-01]
Interpreting 'squeue'
squeue
will list some pertinent information to the user, including:JOBID
- useful for job output logs, controlling jobs, etc.ST
- the state of the current job (such asR
for RUNNING)NODES
- the number of nodes reserved for this jobNODELIST
- the names of the nodes reserved for this job
-
Use
prun
to launch executable.OUTPUT: # [prun] Master compute host = smshost # [prun] Resource manager = slurm # [prun] Launch cmd = mpirun ./a.out (family=openmpi4) # Hello, world (4 procs total) # --> Process # 1 of 4 is alive. -> compute00 # --> Process # 0 of 4 is alive. -> compute00 # --> Process # 2 of 4 is alive. -> compute01 # --> Process # 3 of 4 is alive. -> compute01
Click here to understand why the order of responses is not sequential.
In the world of parallel processing, the only predictable outcome from a parallel program is that the order is not predictable.
For many reasons, one thread (or processor) will take slightly longer or shorter than another, and this affects the seemingly 'random' order of responses that you see in the output from the test job above. If you run your program multiple times, you should see a different order most of the time. This is actually normal and to be expected.
Click here if you encounter an
ORTE daemon
error[test@smshost ~]$ prun ./a.out [prun] Master compute host = smshost [prun] Resource manager = slurm [prun] Launch cmd = mpirun ./a.out (family=openmpi4) -------------------------------------------------------------------------- An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirun. This could be caused by a number of factors, including an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). --------------------------------------------------------------------------
At this point you may experience an error along the lines of
An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirun
. This is a common issue, and is typically associated with two possible problems:
-
The
/etc/hosts
file is incorrectly populated.Please ensure that your hosts file is correctly populated and try again: review the FAQ.
-
The time synchronisation between the compute nodes and smshost has drifted: review the FAQ.
Compare the
timedatectl
for the compute nodes and the smshost. If they are not the same, you need to synchronise them. Review the FAQ for steps.
Important
The resources that you have allocated to run this interactive job are still reserved! You will need to free them to run any further jobs.
This can be done very easily by pressing the 'disconnect' hotkey combination CTRL+D, or by using the
squeue
andscancel
commands provided by Slurm.The command
You may need to press CTRL+D to disconnect if it does not automatically disconnect.squeue
will help you identify the ID of the job that you wish to cancel andscancel <jobID>
will allow you to cancel the job and free up the resources.Once the interactive job is completed and successfully revoked, you can verify the state of the Slurm queue by running
sinfo
. If the available resources are returned toidle
in the STATE column, then the nodes are ready to accept new jobs.If there are any issues with the job queue, you may need to manually reset the compute resources:
-
Congratulations
You have a functioning (although, not yet quite finished) cluster!
Let's take on the next challenge - submitting batch jobs.
8.2 Batch Job
A batch job is submitted to the queue and executed at an undefined time in the future - there is no user interaction with the compute resources, and when a job terminates (completes, crashes, or expires), an output file can be reviewed by the user to determine the outcome of the job submission.
Once again, before running your batch job, you need to ensure you are logged in as the test
user and that you have compiled your application.
An example job script is provided by OpenHPC. In this job script you will see that the same pre-compiled executable from the interactive job example is used. You will not have to repeat the compilation step in this section. We will copy and modify this file to match our lab configuration.
-
Copy example job script to the current directory.
-
View the example job script.
Which should look like:
#!/bin/bash #SBATCH -J test # Job name #SBATCH -o job.%j.out # Name of stdout output file (%j expands to jobId) #SBATCH -N 2 # Total number of nodes requested #SBATCH -n 16 # Total number of mpi tasks requested #SBATCH -t 01:30:00 # Run time (hh:mm:ss) - 1.5 hours # Launch MPI-based executable prun ./a.out
Important
Recall that your virtual cluster consists of two compute nodes, each with a single socket, and two cores per socket (See Chapter 6).
In order for MPI to parse our job script and run the application, the job script must match the parameters of the compute hosts.
If this is not properly configured, your job will not run, because the workload manager will hold the job in the queue until adequate resources become available (which will never happen).
Tip
The use of the
%j
option in the example batch job script is a convenient way to track the output of the application on an individual job basis. The%j
token is replaced with the Slurm job allocation number,<jobID>
, once the job is assigned. -
Update the job script for our configuration, and verify the change:
- Change the number of mpi tasks from 16 to 4 (2 compute nodes * 2 cores each = 4 mpi tasks).
-
Launch your batch job via Slurm by using the
sbatch
command.Click here for tips on how to
watch
the job unfold.In a second pane or shell, you can run the following command:
For a brief moment you should see a job in the
squeue
section that is in aR
state (for RUNNING), and in thesinfo
section you should see the compute nodesSTATE
change fromidle
toalloc
. -
Once complete, view the job output, found in the file
job.<jobID>.out
.If successful you can expect an output similar to the following:
# [prun] Master compute host = compute00 # [prun] Resource manager = slurm # [prun] Launch cmd = mpirun ./a.out (family=openmpi4) # # Hello, world (4 procs total) # --> Process # 1 of 4 is alive. -> compute00 # --> Process # 0 of 4 is alive. -> compute00 # --> Process # 2 of 4 is alive. -> compute01 # --> Process # 3 of 4 is alive. -> compute01
That's it! You're done!
Believe it or not, you've managed to set up your very own OpenHPC virtual cluster and have successfully run batch jobs and interactive jobs!
At this point, you can close up the lab or take a few additional steps in your journey with OpenHPC and your virtual cluster. We would recommend considering:
- Taking a final snapshot of your smshost in this working state.
- You can
package
your VM as a new Vagrant.box
file to re-import or transfer to another machine. - You can continue to explore HPC features and tools, including OpenOnDemand (watch this space), Spack, and so much more!
- Above all, have fun! And buy yourself a cake to celebrate!
Click here for a recap of Chapter 8.
You have successfully run an interactive job on your cluster. This means that users of your cluster can now successfully run their own interactive jobs using the system scheduler, Slurm.
You have also successfully run a batch submission job on your cluster! This means that users of the cluster can now successfully submit their own jobs for processing and executing on the cluster using Slurm.
Congratulations
You have reached the end of the virtual lab!
Have a cake!
Fill out our feedback survey!
Let us know what you think!