3 Management Node Setup
This section will guide you through the setup of your System Management Server (SMS) host - the node that is responsible for managing the virtual cluster. To align with the OpenHPC install recipe, we will call it the smshost.
Click here to learn more about some of the different node terminologies that you may encounter in an HPC environment.
In an HPC environment, much like a traditional server environment in a datacentre or business, you will typically see various nodes with different roles.
"What's a node?" A node is another term for a server or computer in the HPC environment. Typically the term given to a node is based on the role that this node performs in the cluster environment. Some roles include:
- management node: manages the HPC cluster environment
- master node: another term for the management node
- boss node: yet another term for the management node
- host node: a less common term for the management node
- compute node: the worker node that performs computations in the cluster
- accelerator node: a node that also performs computations, but primarily through an accelerator
- GPU node: a specific type of accelerator node, that uses GPUs
- storage node: a node that manages storage for the cluster
- login node: a public-facing node that fields login attempts to the cluster
- head node: another term for the login node
In the virtual lab, the smshost will perform most of these roles, including a head node, management node, provisioning node and storage node.
Note
Anticipated time to complete this chapter: TBC from user feedback.
3.1 Deploy smshost
We will deploy the smshost using a simple Vagrant command - vagrant up
.
Click here to learn what tasks the vagrant up
command performs:
- initialises the Vagrant environment,
- downloads and initialises the base smshost VM,
- informs VirtualBox to modify the VM configuration according to the parameters set out in the
Vagrantfile
.
This vagrant up
process may take some time - depending on the speed of your internet connection and whether or not you have previously downloaded the Vagrant .box
file being referenced in Vagrantfile
.
Click here to learn more about the Vagrantfile
.
There are many parameters that can be defined within the Vagrantfile
, including:
.vm.box:
this references the base .box
file to be used for provisioning the VM. In the virtual lab, this points to one of two Vagrant boxes, depending on the VM in question ...
smshost.vm.box = "bento/rockylinux-8":
uses the Bento box labelled rockylinux-8
to provision the VM named smshost
.
To see what other Bento boxes are available, go to the HashiCorp Vagrant Cloud.
The compute VMs in the virtual lab both point to a local file called compute-node.box
:
compute00.vm.box = "file://./compute-node.box"
compute01.vm.box = "file://./compute-node.box"
Tip - click here to learn how Vagrant behaves with the Vagrantfile
Running an ls
command (or equivalent) should at least list the Vagrantfile
in the current working directory. The vagrant up
instruction will reference this configuration file. If it is not located in the current working directory, Vagrant will climb up the directory tree (towards the root) looking for the first Vagrantfile
it can find.
This could lead to the wrong Vagrantfile
being used, so please ensure the correct working directory before running vagrant up
.
See this link for more information.
Important - where to run vagrant up
Be careful when and where you run vagrant up
.
Make sure to run vagrant up
from your Git root or root lab directory. Vagrant will initilalise the VM with your current working directory as a shared directory within the VM at /vagrant/
. Doing this ensures that your VM has access to the configuration files that you downloaded from Git - input.lab.local
, compute-node.box
, etc.
-
Navigate to your lab root directory. ie.
~/openhpc-2.x-virtual-lab/
-
Run the command to initialise and deploy the smshost VM with Vagrant, as follows:
The above command will have Vagrant read the parameters in theVagrantfile
for the smshost, create a VirtualBox VM definition (such as vCPUs, RAM, NICs etc.), download and install the Rocky Linux image into the VM, boot the VM and install any addtional stipulated packages.Note
The Rocky Linux 8 image is approximately 680MB in size.
Running the
vagrant up
command may fail with an error relating to "The IP address configured for host-only network is not within the allowed ranges". This is a known issue with the lastest versions of VirtualBox. To resolve this problem please follow the VirtualBox documentation on the matter.Important
This virtual lab uses a single
Vagrantfile
to manage the smshost VM as well as the compute VMs - known as a multi-machine Vagrantfile. To ensure you are accessing the correct VM definition and VM you are required to add the name of the VM definition after any Vagrant commands, such as:vagrant up smshost
vagrant ssh smshost
.
-
Once the VM is booted and the additional packages have been installed, you should be able to access your smshost VM via
ssh
.Tip
During the
vagrant up smshost
step, you can open the VirtualBox interface and watch the processes side-by-side. As Vagrant performs certain steps, you can see how they affect the VirtualBox VM configuration.You can
ssh
to the VM at any time using one of the following methods:-
Using vagrant:
-
Using any SSH client to
127.0.0.1:2299
with the default Vagrant credentials (username::password)vagrant::vagrant
While
vagrant ssh
is the easiest method, some people report that clipboard copy/paste functionality is not available on the VM.Note
Your host machine has a shared directory with the VM as defined in the
Vagrantfile
. By default this directory is the location of theVagrantfile
on your workstation and is at/vagrant/
on the VM. -
Congratulations
You have deployed the smshost virtual machine with a base Rocky Linux OS and configured its VirtualBox parameters!
3.2 Add and Configure Parameters
Prior to continuing with the installation of the OpenHPC components on the smshost, several commands must be issued to set up the shell environment. To ensure that all defined variables are set for the current shell environment, the configuration file must be sourced
.
Tip
The official OpenHPC recipe mentions an input.local
environment file. This file is not present in this OpenHPC 2.x guide.
For the purposes of this virtual lab we are using input.local.lab
in its place, which is a simplified pre-customised environment file. In either case the configuration file (or local input file) must be sourced in the existing shell (ie. loaded into the current shell environment).
If you make any updates to the configuration file, the source
command must be run again to update the environment variables in the current shell.
Recall that /vagrant/
on the VM is shared with the local host system at the directory location of your Vagrantfile
. You will have pulled these configuration files when cloning the lab Git repo.
-
ssh
to the smshost VM and elevate toroot
user:You should be at the following prompt in your terminal:
Tip - click here to learn more about best practice and why running commands as
root
is not a good idea!Running commands as
root
is not best practice!Since this is a test lab, it is considerably easier to issue commands as
root
and not have to worry about occasionalsudo
workarounds. In general, however, it is not recommended, and to ensure the best-practice habit of not running asroot
, we will still issue commands withsudo
, even when logged in asroot
.Running commands as
root
makes it very difficult to have an audit trail of exactly which user ran a privileged command and it is much easier to make an irreversible error when you are not consciously aware that you are a privileged user. -
Examine the current environment variable status
Both commands will show a blank response.
-
After sourcing the local input file, the OpenHPC environment variables return the definitions as sourced from the local input file
input.local.lab
.If you are not in the correct working directory (as the lab anticipates you to be), first navigate to the
/vagrant
directory:Then source the variable file and verify that things loaded correctly by noting the output of the
echo
commands:[root@smshost vagrant]# [root@smshost vagrant]# [root@smshost vagrant]#
Click here for a quick explanation of what just happened with the
echo
example above.When you source the
input.local.lab
you are essentially running line-by-line through the text file as a series of shell commands.As with a standard BASH session, sections prefixed by a
#
are considered as comments, and are safely ignored.Knowing that a
source
will run line-by-line through the file, let's take a look at the first three input lines ininput.local.lab
(recall that we are effectively running three separate commands in the shell, one after the other):# sms host information
- this is treated as a comment and has no effect when it is run in the terminal.sms_name=smshost # Hostname for SMS server
- this will load the stringsmshost
into an environment variablesms_name
for the duration of this shell session. As with the previous line, the section following the#
is safely ignored as a comment.sms_ip=10.10.10.10 # Internal IP address on SMS server
- as with the above line, this will load the string10.10.10.10
into an environment variablesms_ip
for the duration of this shell session. Again, the section following the#
is safely ignored as a comment.
When we run an
echo
command, we are asking the terminal to show us what is stored in the variables with namessms_name
andsms_ip
. The${}
syntax is to properly encapsulate the environment variable names for referencing.Click here if you are experiencing
syntax error
messages withsource input.local.lab
Depending on your local environment, there may be some potential pitfalls that lead to syntax errors when sourcing the input file. There are some workarounds to consider, depending on your local environment setup:
-
Download the file directly via HTTP using
wget
:
-
To fix the local
input.local.lab
file installdos2unix
:
sudo dnf install dos2unix
invoke the abovesource
command starting withdos2unix
. -
Before cloning with Git, stop it from changing line endings:
git config --global core.autocrlf false
.
This only works before cloning the Git repository and will be brute force if you use Git for other repositories on the same machine. - Run a
sed
command to replace theinput.local.lab
file without installingdos2unix
:
-
Once the input file has been sourced successfully (which is verified by the output of the
echo
commands), all the environment variables will be set for the current shell session.Tip - remember to re-source on every shell
Every new shell instance must have the input file re-sourced.
This applies to a new
tmux
window, a newtmux
shell, or after a disconnection, reboot, and so forth. -
Add the DNS entry to the
/etc/hosts
file[root@smshost ~]# [root@smshost ~]#
Click here to understand what just happened with
/etc/hosts
.The
/etc/hosts
file is a local DNS file that is private to the host machine.
The
echo
command will usually display the results to the terminal screen, however, the use of>>
redirects the output to/etc/hosts
. The use of>>
means to append the output to the end of/etc/hosts
. If we used>
instead, we would overwrite/etc/hosts
and have only one line in the new file!
By adding the
sms_ip
andsms_name
to the/etc/hosts
file, we are enabling the ability to reference the smshost locally by a fully-qualified name.In this case, the line
10.10.10.10 smshost
in the/etc/hosts
file means that any reference tosmshost
will resolve locally to10.10.10.10
.
Note that all entries in
/etc/hosts
are traced from line 0 onwards until the first valid match is found or the end of file is reached. This means that the order of your entries is very important.
As an aside, can you see how convenient it is to use environment variables to modify system files? If you were to type these values in manually every time you made a change, you run the risk of inconsistencies when parameters differ across the commands due to typo's or out of date information.
By storing all the variable parameters in a single file
input.local.lab
it is generally easier to review the entire cluster configuration by simply reading over the contents of the file. -
OpenHPC recommends disabling SELinux
Click here to learn more about what the above command does.
sed
is a stream editor that we use to work on text files. Throughout the virtual lab you will need to make changes to existing config files that hold default values. A common habit would be to manually edit each config file with a text editor (vim
,vi
, etc.) and this is perfectly fine for a once-in-a-while edit, but there's always an inherent risk of configuration-drift or simple human error.A safer approach would be to use as much automation as possible. This is where a tool like
sed
can prove valuable. We will follow this approach throughout the virtual lab.
In the above step, we are searching the file
/etc/selinux/config
for a line that containsSELINUX=
and we are replacing it withSELINUX=disabled
.The default configuration in
/etc/selinux/config
isSELINUX=permissive
and if you run awatch
in a separate session /tmux
pane, you will see the line replaceSELINX=permissive
withSELINUX=disabled
.
To invoke
tmux
:
-tmux
- Press CTRL+B and then " to open another horizontal pane.
- type inwatch "cat /etc/selinux/config"
- pay attention to about halfway in the output, whereSELINUX=permissive
- Switch to the top pane by pressing CTRL+B and then the up arrow.
- prepare the guide command:sudo sed -i "s/^SELINUX=.*/SELINUX=disabled/" /etc/selinux/config
- while watching the bottom pane, press ENTER and watch the result. - after a short refresh time (2 seconds by default) theSELINUX=permissive
will be replaced withSELINUX=disabled
. -
Reboot the VM to have the SELINUX settings propagate.
Tip
Always run as
root
and re-source the environment variables on every new shell.As stated before, this is a test environment, so it is considerably easier to bypass occasional
sudo
workarounds by running directly asroot
. This is not best practice in a production environment!OpenHPC makes extensive use of environment variable substitution. If you do not source the local input file
input.local.lab
then these variables will hold blank values and lead to unpredictable behaviour.Always remember to source the
input.local.lab
file on every new Shell instance. -
On reboot,
vagrant ssh smshost
to the smshost and return toroot
profile andsource
the correct environment:Important - losing shared directories after reboot.
When rebooting the VM from within the guest OS, the
/vagrant
directory may not remap on reboot. If the mapping is lost, avagrant reload
will reload the configuration file.Alternatively,
vagrant halt
(graceful shutdown) followed byvagrant up
will reboot the VM. This process also reloads theVagrantfile
, reestablishing the shared directory mapping.vagrant reload
is functionally equivalent tovagrant halt
followed byvagrant up
-
Disable the Firewall
[root@smshost vagrant]# [root@smshost vagrant]#
To verify the Firewall is indeed disabled:
Click here to understand why we disable the firewall and why it's generally a bad idea!
It is not best practice to run a public-facing server without a firewall!
Again, since this is a test lab, it simplifies the learning objectives when we know that commands will not be blocked by an unexpected firewall issue that only serve as an added distraction and detract from the learning experience.
Troubleshooting network issues can be a daunting task for the best of us, so we are mitigating the risk for confusion in the lab by removing the possibility of a firewall rule blocking service traffic.
The OpenHPC install recipe follows this philosophy, but you should be very aware of the risks of following this approach in a production environment.
Luckily, the risk to your smshost and virtual cluster environment is low since this is a locally-hosted cluster and very difficult for a remote malicious actor to access (but even if they did - not much can be stolen or harmed).
Congratulations
You have now installed and prepared the smshost virtual machine for the OpenHPC components.
The next step is to add the necessary packages in order to provision and manage the virtual cluster.
3.3 Add OpenHPC Components
Now that the base operating system is installed and booted, the next step is to add the desired OpenHPC packages to the smshost. These packages will provide provisioning and resource managment services to the rest of the virtual cluster.
3.3.1 OpenHPC Repository
You will need to enable the OpenHPC repository for local use - this requires external internet access from your smshost to the OpenHPC repository which is hosted on the internet.
[root@smshost ~]# |
3.3.2 EPEL Release
In addition to the OpenHPC repository, the smshost needs access to other base OS distro repositories so that it can resolve the necessary dependencies. These include:
- BaseOS
- Appstream
- Extras
- PowerTools, and
- EPEL repositories
From the prior output, you may have noticed that epel-release
is enabled automatically when installing ohpc-release
(see the Installed: epel-release-*
line). Unlike the other repositories which are enabled by default, PowerTools
must be enabled from EPEL manually as follows:
[root@smshost vagrant]# [root@smshost vagrant]# |
Click here to learn more about the EPEL repository.
The EPEL repository is a volunteer-based community repository for Red Hat Enterprise Linux distributions of Linux, and it stands for Extra Packages for Enterprise Linux.
3.3.3 Provisioning and Resource Management
In this virtual lab, system provisioning and workload management will be performed using Warewulf and Slurm, respectively.
To add support for provisioning services, one must add the common base package provided by OpenHPC, as well as the Warewulf Provisioning System.
[root@smshost vagrant]# [root@smshost vagrant]# |
To add support for workload management, we will install the Slurm Workload Manager. Simply put, Slurm will perform the role of a job scheduler for our HPC cluster.
Run the installation command:
Click here to understand what role the smshost plays with Slurm.
The smshost acts as our Slurm server. This means that all jobs submitted to the cluster will be administered by Slurm, which is hosted on the smshost.
It is convenient that in our virtual lab the smshost serves multiple roles and simplifies the complexity slightly.
Users (in this case, you) will remotely connect to the smshost (in its role as a login node) and then submit jobs through the smshost (in its role as the Slurm server) for processing on the compute nodes.
The client-side components of the workload management system will be added to the corresponding compute node image that will eventually be used to boot the compute nodes, in the next chapter.
Note
In order for the Slurm server to function correctly a number of conditions are required to be satisfied.
- It is essential that your Slurm configuration file,
slurm.conf
, is correctly configured. No need to worry! We will do this in the chapter on Resource Management. - Slurm (and HPC systems in general) requires synchronised clocks throughout the system. We will utilise NTP for this purpose in the next section.
3.4 Configure Time Server
We will make use of the Network Time Protocol (NTP) to synchronise the clocks of all nodes on our virtual cluster. The following commands will enable NTP services on the smshost using the time server ${ntp_server}
, and allow this server to act as a local time server for the cluster. We will be using chrony, which is an alternative to ntpd.
[root@smshost vagrant]# [root@smshost vagrant]# [root@smshost vagrant]# |
The official OpenHPC recipe opts to allow all servers on the local network to synchronise with the smshost. In this lab, we will restrict the access to fixed IP addresses for our virtual cluster using the variable cluster_ip_range
, as follows:
[root@smshost vagrant]# [root@smshost vagrant]# |
To verify that the chronyd
service is started correctly:
Click here to understand what just happened with /etc/chrony.conf
.
By using echo
we are once again redirecting the output into the configuration file; in this case /etc/chrony.conf
.
server {$ntp_server}
: if you look at input.local.lab
you will notice that ntp_server=time.google.com
. In the previous steps, we added the NTP server details for time.google.com
into our cluster configuration for NTP.
allow ${cluster_ip_range}
: if you look at input.local.lab
this is defined as 10.10.10.0/24
which is CIDR notation to indicate all valid IPv4 addresses in the range 10.10.10.1 to 10.10.10.254. This means that smshost will only serve time synchronisation to nodes on the IPv4 private network 10.10.10.0/24
which we know is the HPC private IP range.
>>
will append to an existing file, whereas >
will create a new file.
Congratulations
You have now successfully completed the basic configuration of your smshost !
Before moving on to the configuration of the compute images, we will quickly cover how one can make backups of progress throughout this virtual lab.
3.5 Snapshot smshost (Recommended)
While it is plausible to run the entire virtual lab without making any backups of your progress, it is recommended to at least make snapshots of major milestones (such as at the end of each chapter of this guide). Be aware that too many snapshots can bloat your resource usage and will increase the amount of disk space you will need to host the VMs.
You can make snapshots using either the Virtualbox GUI or command line:
-
Through the VirtualBox Manager GUI:
Figure 2: How to snapshot a VM using the VirtualBox GUI -
Through the command prompt:
Call thesnapshot save
instruction from the command line, followed by the<vm_name>
(where applicable) and then the desired<snapshot_name>
.Run this command from your primary host machine's shell, not the VM environment!
snapshost save
is not to be invoked within the vagrant session but at the command prompt outside thevagrant ssh
session.==> smshost: Snapshotting the machine as 'chapter3-smshost-complete'... ==> smshost: Snapshot saved! You can restore the snapshot at any time by ==> smshost: using `vagrant snapshot restore`. You can delete it using ==> smshost: `vagrant snapshot delete`. Machine 'compute00' has not been created yet, and therefore cannot save snapshots. Skipping... Machine 'compute01' has not been created yet, and therefore cannot save snapshots. Skipping...
Note
Running the above command will take a snapshot of all VMs defined by your
Vagrantfile
at that present moment (see the sample output above, wherecompute00
andcompute01
are skipped). It may be useful to only take a snapshot of a subset of your VMs later on (to avoid backing up VMs that have not changed configuration since the last snapshot).To specify the VM that you want to snapshot:
For more information on how to manage and view your saved snapshots, see the Vagrant snapshot documentation.
Click here to recap what you have accomplished in this chapter.
You used vagrant up
to provision the smshost host with the definitions in Vagrantfile
.
The smshost is running Rocky Linux and has been configured with two network interface cards (public-facing and hpcnet
-facing).
To access the smshost you can use vagrant ssh
or an SSH client of your choice.
There is a shared folder between the VM and your local host machine, that is located wherever the Vagrantfile
is present on your local host machine, and it maps to /vagrant/
on the VM's OS.
You have loaded system environment variables through sourcing input.local.lab
and used these parameters to configure the DNS entries in /etc/hosts
.
For the virtual lab, the firewall has been disabled.
Additional OpenHPC components have been installed to prepare for the deployment of the HPC software stack.
Finally, you have saved your smshost state with a snapshot.
Congratulations
You have reached the end of Chapter 3 - Well done!
In this chapter you successfully deployed and configured your smshost VM. You are well on your way to your virtual cluster deployment. In the next chapter you will define and configure your compute node image.