Frequently Asked Questions (FAQ) for the OpenHPC 2.x virtual lab.
How do I use this FAQ?
We have categorised this FAQ into sections based on the common tools used in the OpenHPC 2.x virtual lab. You can use the search button (top right) or scroll through the categories (right sidebar).
Issues relate to common faults in the deployment that require attention to progress. These are represented by a red X in a red box.
Questions relate to general queries about the deployment that do not typically affect progress in the lab. These are represented by a green ? in a green box.
This is an FAQ question example.
This is an FAQ issue example.
Virtual Lab Software
What are the verified versions of the virtual lab software stack?
The following versions of VirtualBox have been successfully used in the test environment for this virtual lab:
Windows 11
- VirtualBox Version 7.0.8, verified 202308.
- VirtualBox Version 6.0.24.
Shell / Terminal
I receive a no such file or directory
error when trying to access /vagrant
.
This is likely to occur in step 3.2.7 after rebooting the smshost after updating SELINUX=disabled
.
bash: /vagrant/input.local.lab: No such file or directory
Exit the VM (Ctrl+D can disconnect one layer of session at a time - repeat until you reach the host machine's Git root).
vagrant reload smshost
.
I am getting errors in my deployment all over the place.
If you have ever rebooted or started a new Shell, you need to ensure that you also re-sourced the input.local.lab
file, otherwise you may experience unpredictable behaviour. If you run commands that refer to environment variables (e.g. sms_ip
) but failed to source the input.local.lab
file first, then sms_ip
will be blank (i.e. hold a value of ) which will then be passed onto any command line instructions expecting a `10.10.10.10` but receiving a
.
I get a ")syntax error: invalid arithmetic operator (error token is "
error when sourcing input.local.lab
.
This is a common problem with Windows and Git. If you are seeing this error, it is possible that CRLF
has replaced the LF
when the git clone
action occurs.
You can find and replace the CRLF
in BASH, or you can set Git's default behaviour to not convert, by issuing this command in a host machine terminal:
git config --global core.autocrlf false
To fix the existing file, you can run a regular expression to remove the CR
in the CRLF
:
sed 's/\r$//' /vagrant/input.local.lab > cleaned.txt
This will remove the CR
(described as \r
) and create a new file cleaned.txt
. You can then source cleaned.txt
or instead output the input.local.lab
to input.local.lab
or whatever name you wish.
How do I resume a tmux
session on reconnect?
tmux ls
:
will list existing sessions. Usually, you will only have one, and that can be resumed by running:
tmux a
for attach.
Warewulf
How do I propagate changes that I've made to a stateless boot image?
To answer this question, it is important to understand how Warewulf behaves.
The TLDR is:
after any modifications to chroot
you must run
The explanation behind this is: Compute nodes need to be provisioned with an image. We create a bootstrap image which is used to boot the nodes and complete this provisioning process. Creating a bootstrap image is accomplished in the guide through:
While most of the provisioned image's configuration is conducted in a chroot filesystem, these chroots cannot be directly provisioned by Warewulf. Once we are satisfied with our chroot configuration, we need to encapsulate and then compress this new filesystem into a Virtual Node File System (VNFS) image which Warewulf can provision.
Think of the chroot behaving as the source code, and the VNFS behaving as the compiled binary of that source.
Vagrant
The machine with the name “client01” was not found configured for this Vagrant environment?
This happens if the vagrantfile is modified during a live vagrant deployment. The original vagrant image will not be consistent with the vagrantfile any more.
Run the following command:
Once done, remove the Virtualbox VM from the Virtualbox Manager.How do I find the vagrant image ID number?
Use vagrant global-status
to list all Vagrant image ID's.
How do I save the current VM state and pause the VM?
Use the vagrant suspend
command from your terminal.
To resume the VM later, use vagrant resume
.
For more information: https://developer.hashicorp.com/vagrant/docs/cli/suspend
PXE
The compute node VM is failing to boot via PXE
There are many reasons for a VM to not boot through PXE (network, firewall, tftp, DHCP, PXE, etc.).
In a working PXE environment, the compute nodes may still fail to boot with PXE when they do not have sufficient RAM available.
As a first step, if the VM has previously successfully booted and now is failing on a reboot, this is a known issue with Vagrant and VirtualBox - to fix this, force a hard reboot of the VM in VirtualBox (i.e. Stop and Start the VM).
If the VM has never successfully booted, you try to increase the allocated RAM from 3GB to 4GB on the compute node image.
Slurm
My compute node is not receiving configuration parameters from the Slurm controller?
Check that the compute node image is configured properly on the smshost
to receive its configuration parameters through a dynamic reference to the Slurm controller (smshost
with IP address 10.10.10.10
):
cat $CHROOT /etc/sysconfig/slurmd
should reveal
SLURMD_OPTIONS=--conf-server 10.10.10.10
My compute node is still not talking to the Slurm controller even when set to the RESUME state?
On rare occassions, the compute node may require a little nudge.
From the compute node, try run ping smshost
, and / alternatively sinfo
to re-establish the connection to the Slurm controller..
My compute node is in a COMPLETING state loop?
You can manually force an interrupt on the compute nodes that are not exitting the CG
state:
I am seeing an error: slurm_load_partitions: Unable to contact slurm controller (connect failure)
9 out of 10 times, Slurm's problems are related to DNS. When a computer queries the local hosts
file, it runs on a FIFO basis, so the first valid entry that it locates for an IP and DNS will be the one that is attempted.
On the affected compute node/s check the local /etc/hosts
and make sure that the first line relating to smshost
is referring to its accessible IP address (in the Lab's case, the default sms_ip
is 10.10.10.10
) and not its localhost
IP address (such as 127.*.*.*
).
If both entries are in the hosts
file, you can safely remove the 127.*.*.*
reference, or place it after the 10.10.10.10
reference.
How do I know that my compute nodes are in a usable state ready to accept tasks?
When you run sinfo
, you should see idle
under the STATE
column to confirm that they are ready to accept jobs.
How do I start Slurm on compute nodes remotely?
Replace the compute_prefix
with the appropriate values, or ensure you have sourced your input.local.lab
parameters. Likewise, the compute range should be updated to match the nodes you wish to interact with.
How do I set compute nodes to RESUME state if they are in another state, such as UNK*, DOWN, DRAIN, etc?
If there are still connection errors (i.e. sinfo
shows an *) then you can try to revive the connection through the My compute node is still not talking to the Slurm controller even when set to the RESUME state?
question tips.
My compute nodes are not automatically set to RESUME state on initial boot?
There are 3 ReturnToService
settings available for Slurm clients. As a general rule, it is recommended to not force a failed compute node back to a RESUME state so appropriate debugging can be performed on the reason for the failure to return to service.
Nonetheless, if you wish to force a (potentially faulty) node back into service automatically on reboot, you can hardcode this in the Slurm configuration file:
sudo vim /etc/slurm/slurm.conf
and either add or edit the ReturnToService
parameter to 2.
ReturnToService=2
General
How do we learn more about the HPC Ecosystems Project?
Thanks for your interest in our project! Please visit our very basic website for additional information - you may want to start at the Getting Started section!
DNS
My smshost DNS is not resolving correctly.
Ensure that your hosts file is correctly populated:
If you look at the contents of /etc/hosts
and find that the defintion for smshost
is present for both 127.0.0.0/24
and 10.10.10.10
, the clients will query the hosts
file FIFO, so there is a risk that smshost
will resolve to the localhost
IP range first, instead of 10.10.10.10
. You can remove the localhost
range from the /etc/hosts
file on the compute nodes or ensure the ordering is correct.
NTP / Chrony
System clock synchronized
is no
.
If System clock synchronized: no
and NTP service: n/a
:
On the compute node/s:
chronyc -n sources
should show10.10.10.10
with^*
.- check
cat /etc/chrony.conf | grep pool
and if anything besides10.10.10.10
is present, remove it. - run
systemctl is-active chronyd
systemctl restart chronyd
- run
timedatectl
to verify thatSystem clock synchronized: yes
.
The BASH equivalent is:
Running Jobs
An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirun
.
If you see an error similar to the following:
[test@smshost ~]$ prun ./a.out
[prun] Master compute host = smshost
[prun] Resource manager = slurm
[prun] Launch cmd = mpirun ./a.out (family=openmpi4)
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
This is may be related to two problems:
-
The
/etc/hosts
file is incorrectly populated. - see the DNS section. -
The time synchronisation between the compute nodes and smshost has drifted. - see the NTP/Chrony section.