Skip to content

Frequently Asked Questions (FAQ) for the OpenHPC 2.x virtual lab.

How do I use this FAQ?

We have categorised this FAQ into sections based on the common tools used in the OpenHPC 2.x virtual lab. You can use the search button (top right) or scroll through the categories (right sidebar).

Issues relate to common faults in the deployment that require attention to progress. These are represented by a red X in a red box.

Questions relate to general queries about the deployment that do not typically affect progress in the lab. These are represented by a green ? in a green box.

This is an FAQ question example.

This is an FAQ issue example.


Virtual Lab Software

What are the verified versions of the virtual lab software stack?

The following versions of VirtualBox have been successfully used in the test environment for this virtual lab:

Windows 11
- VirtualBox Version 7.0.8, verified 202308.
- VirtualBox Version 6.0.24.


Shell / Terminal

I receive a no such file or directory error when trying to access /vagrant.

This is likely to occur in step 3.2.7 after rebooting the smshost after updating SELINUX=disabled.

bash: /vagrant/input.local.lab: No such file or directory

Exit the VM (Ctrl+D can disconnect one layer of session at a time - repeat until you reach the host machine's Git root).

vagrant reload smshost.

I am getting errors in my deployment all over the place.

If you have ever rebooted or started a new Shell, you need to ensure that you also re-sourced the input.local.lab file, otherwise you may experience unpredictable behaviour. If you run commands that refer to environment variables (e.g. sms_ip) but failed to source the input.local.lab file first, then sms_ip will be blank (i.e. hold a value of ) which will then be passed onto any command line instructions expecting a `10.10.10.10` but receiving a.

I get a ")syntax error: invalid arithmetic operator (error token is " error when sourcing input.local.lab.

This is a common problem with Windows and Git. If you are seeing this error, it is possible that CRLF has replaced the LF when the git clone action occurs.

You can find and replace the CRLF in BASH, or you can set Git's default behaviour to not convert, by issuing this command in a host machine terminal:

git config --global core.autocrlf false

To fix the existing file, you can run a regular expression to remove the CR in the CRLF:
sed 's/\r$//' /vagrant/input.local.lab > cleaned.txt

This will remove the CR (described as \r) and create a new file cleaned.txt. You can then source cleaned.txt or instead output the input.local.lab to input.local.lab or whatever name you wish.

How do I resume a tmux session on reconnect?

tmux ls:
will list existing sessions. Usually, you will only have one, and that can be resumed by running:

tmux a for attach.


Warewulf

How do I propagate changes that I've made to a stateless boot image?

To answer this question, it is important to understand how Warewulf behaves. The TLDR is:
after any modifications to chroot you must run

sudo wwvnfs --chroot $CHROOT


The explanation behind this is: Compute nodes need to be provisioned with an image. We create a bootstrap image which is used to boot the nodes and complete this provisioning process. Creating a bootstrap image is accomplished in the guide through:

[root@smshost ~]#
wwbootstrap $(uname-r)

While most of the provisioned image's configuration is conducted in a chroot filesystem, these chroots cannot be directly provisioned by Warewulf. Once we are satisfied with our chroot configuration, we need to encapsulate and then compress this new filesystem into a Virtual Node File System (VNFS) image which Warewulf can provision.

Think of the chroot behaving as the source code, and the VNFS behaving as the compiled binary of that source.

[root@smshost ~]#
wwvnfs --chroot $CHROOT

See the Warewulf docs for more information.


Vagrant

The machine with the name “client01” was not found configured for this Vagrant environment?

This happens if the vagrantfile is modified during a live vagrant deployment. The original vagrant image will not be consistent with the vagrantfile any more.

Run the following command:

vagrant global-status --prune
Once done, remove the Virtualbox VM from the Virtualbox Manager.

How do I find the vagrant image ID number?

Use vagrant global-status to list all Vagrant image ID's.

How do I save the current VM state and pause the VM?

Use the vagrant suspend command from your terminal. To resume the VM later, use vagrant resume.

For more information: https://developer.hashicorp.com/vagrant/docs/cli/suspend


PXE

The compute node VM is failing to boot via PXE

There are many reasons for a VM to not boot through PXE (network, firewall, tftp, DHCP, PXE, etc.).

In a working PXE environment, the compute nodes may still fail to boot with PXE when they do not have sufficient RAM available.

As a first step, if the VM has previously successfully booted and now is failing on a reboot, this is a known issue with Vagrant and VirtualBox - to fix this, force a hard reboot of the VM in VirtualBox (i.e. Stop and Start the VM).

If the VM has never successfully booted, you try to increase the allocated RAM from 3GB to 4GB on the compute node image.

Link to an OpenHPC community thread on this issue.


Slurm

My compute node is not receiving configuration parameters from the Slurm controller?

Check that the compute node image is configured properly on the smshost to receive its configuration parameters through a dynamic reference to the Slurm controller (smshost with IP address 10.10.10.10):

cat $CHROOT /etc/sysconfig/slurmd
should reveal
SLURMD_OPTIONS=--conf-server 10.10.10.10

My compute node is still not talking to the Slurm controller even when set to the RESUME state?

On rare occassions, the compute node may require a little nudge.

From the compute node, try run ping smshost, and / alternatively sinfo to re-establish the connection to the Slurm controller..

My compute node is in a COMPLETING state loop?

You can manually force an interrupt on the compute nodes that are not exitting the CG state:

sudo scontrol update nodename=compute01 state=down reason="CG loop"
I am seeing an error: slurm_load_partitions: Unable to contact slurm controller (connect failure)

9 out of 10 times, Slurm's problems are related to DNS. When a computer queries the local hosts file, it runs on a FIFO basis, so the first valid entry that it locates for an IP and DNS will be the one that is attempted.

On the affected compute node/s check the local /etc/hosts and make sure that the first line relating to smshost is referring to its accessible IP address (in the Lab's case, the default sms_ip is 10.10.10.10) and not its localhost IP address (such as 127.*.*.*).

If both entries are in the hosts file, you can safely remove the 127.*.*.* reference, or place it after the 10.10.10.10 reference.

How do I know that my compute nodes are in a usable state ready to accept tasks?

When you run sinfo, you should see idle under the STATE column to confirm that they are ready to accept jobs.

How do I start Slurm on compute nodes remotely?

Replace the compute_prefix with the appropriate values, or ensure you have sourced your input.local.lab parameters. Likewise, the compute range should be updated to match the nodes you wish to interact with.

sudo pdsh -w ${compute_prefix}[00-19] "sudo systemctl start slurmd"
How do I set compute nodes to RESUME state if they are in another state, such as UNK*, DOWN, DRAIN, etc?
sudo scontrol update nodename=compute00 state=resume

If there are still connection errors (i.e. sinfo shows an *) then you can try to revive the connection through the My compute node is still not talking to the Slurm controller even when set to the RESUME state? question tips.

My compute nodes are not automatically set to RESUME state on initial boot?

There are 3 ReturnToService settings available for Slurm clients. As a general rule, it is recommended to not force a failed compute node back to a RESUME state so appropriate debugging can be performed on the reason for the failure to return to service.

Nonetheless, if you wish to force a (potentially faulty) node back into service automatically on reboot, you can hardcode this in the Slurm configuration file:

sudo vim /etc/slurm/slurm.conf

and either add or edit the ReturnToService parameter to 2.

ReturnToService=2


General

How do we learn more about the HPC Ecosystems Project?

Thanks for your interest in our project! Please visit our very basic website for additional information - you may want to start at the Getting Started section!

HPC Ecosystems Project website.

DNS

My smshost DNS is not resolving correctly.

Ensure that your hosts file is correctly populated:

If you look at the contents of /etc/hosts and find that the defintion for smshost is present for both 127.0.0.0/24 and 10.10.10.10, the clients will query the hosts file FIFO, so there is a risk that smshost will resolve to the localhost IP range first, instead of 10.10.10.10. You can remove the localhost range from the /etc/hosts file on the compute nodes or ensure the ordering is correct.

NTP / Chrony

System clock synchronized is no.

If System clock synchronized: no and NTP service: n/a:

On the compute node/s:

  • chronyc -n sources should show 10.10.10.10 with ^*.
  • check cat /etc/chrony.conf | grep pool and if anything besides 10.10.10.10 is present, remove it.
  • run systemctl is-active chronyd
  • systemctl restart chronyd
  • run timedatectl to verify that System clock synchronized: yes.

The BASH equivalent is:

sed -i 's/pool/#pool/g' $CHROOT/etc/chrony.conf

Running Jobs

An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirun.

If you see an error similar to the following:

[test@smshost ~]$ prun ./a.out
[prun] Master compute host = smshost
[prun] Resource manager = slurm
[prun] Launch cmd = mpirun ./a.out (family=openmpi4)
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------

This is may be related to two problems:

  1. The /etc/hosts file is incorrectly populated. - see the DNS section.

  2. The time synchronisation between the compute nodes and smshost has drifted. - see the NTP/Chrony section.


Bug report

Click here if you wish to report a bug.

Provide feedback

Click here if you wish to provide us feedback on this chapter.