This page documents how we build PBS for use at OSC. We run PBS on four different clusters of machines, with different architectures and operating systems. Most of these machines use the Maui scheduler rather than any of the provided PBS schedulers. The process starts with a stock OpenPBS 2.3.12 source distribution, then applies about 36 different patches, then the usual configure, make, and make install steps.
Information related to the use of PBS at OSC can be found here, including many useful scripts for checking job status, accounting, and scheduling.
This line is what we use to configure a PBS tree, after unpacking and applying the patches described below. It assumes you will be using gcc to compile the code.
CFLAGS='-g -Wall -Wno-unused -Wno-parentheses -DNO_SECURITY_CHECK' \ ./configure \ --prefix=/usr/local/pbs \ --set-server-home=/var/spool/pbs \ --set-sched=no \ --disable-shell-pipe \ --enable-shell-use-argv
An explanation of the flags follows:
-g
: We do not use optimization since debugging is frequently
necessary, and performance without -O is not too bad.
-Wall -Wno-unused -Wno-parentheses
: The three "-W" flags turn
on extra gcc warnings, but turn off ones that are so frequently violated as to
be distracting. There will still be plenty of warnings generated by the
sloppy PBS code.
-DNO_SECURITY_CHECK
: Without this flag, the PBS server will
walk the path from / down to the location of its binary checking the
permissions along the way. We like to keep our /usr/local 775 and owned by
a systems group, rather than root, to allow owners of software to install
their codes directly. The PBS server exits when it finds this, unless we
turn off its silly "security" checks.
--prefix=/usr/local/pbs
: Install binaries here.
--set-server-home=/var/spool/pbs
: Use this directory for
PBS related files, including spool, private space for mom and server, logs.
--set-sched=no
: Do not compile a scheduler; we use maui.
--disable-shell-pipe --enable-shell-use-argv
: See below for
a detailed description of the options near the "shell-use-argv" patch.
To build, type "make". Parallel builds work fine too.
To install, be root, as some of the binaries are root-owned setuid, and do "make install". Then to install the man pages, do:
( cd doc ; make install )
The first time you install you will have to install the docs twice as some part of the make gets confused.
To build editor tags for your particular architecture, use:
( find src/{cmds,iff,include,lib,mom_rcp,resmom/linux,server} -name '*.[ch]' find src/resmom -maxdepth 1 -name '*.[ch]' ) | ctags -L-
where you can substitute some other architecture for linux above. The complexity of those lines is to avoid tagging unused files.
Finally, to put everything back to the way you found it, "make distclean".
Start with a fresh copy of the OpenPBS 2.3.12 distribution, unpack it, and give it a reasonable name:
tar xfz /home/pw/src/Tars/pbs-2.3.12.tgz mv OpenPBS_2_3_12 pbs-2.3.12 cd pbs-2.3.12
Then apply this whole slew of patches, in the order given below. Using a different order may work fine, but you may get "orig" files due to large offsets for some of the patches.
You can download the patches one at a time using the little curly symbol in front of its name, or you can get the entire collection in a single tarball. It comes with a handy Makefile to apply all the patches in order, too:
If you want to experiment with a source RPM format, the patches along with the original pbs tarball and a spec file to put it all together can be found here:
Add a few fixes to the TM interface and some functionality enhancements for the MPI parallel code launcher, mpiexec (http://www.osc.edu/~pw/mpiexec/). Copied from mpiexec/patch/pbs-2.3.12-mpiexec.diff.
Convert almost all blocking system calls to non-blocking to avoid hanging the server when a mom dies. This is based on the now-classic CPlant fault tolerance patch, but heavily modified from that original.
This adds an environment variable which is received by the prologue and epilogue scripts and can be used to modify the system based on the "-lnodes=" request made by the user.
This increases some communication timeouts to allow for busier and larger clusters and networks. This second version decreases the TCP timeout for communication with the scheduler as something seems to be broken with moab.
On linux systems, this fixes parsing of /proc/pid/meminfo to avoid overflow for values larger than a 32-bit integer. It also reads the total memory on the system from /proc/meminfo, rather than /proc/kcore, as the latter source is no longer accurate. Further it reads not the random three header lines used in 2.4 kernels, as those disappeared in 2.6.
This fixes the install script not to rewrite the contents of /var/spool/pbs/server_name on every install. Handy if you make and install from a machine which will not be your PBS server, or if the server_name includes a port number or something more complex than just the short hostname of the installing machine.
Edit the manpage for pbsnodes to fix a typo.
Include newer config.guess and config.sub files from gnu.org. They are still quite old (2001-02-24 compared to 1997, though), but are good enough to know about the ia64 architecture.
Cause "make distclean" to remove generated files in the ers/ subdirectory.
PBS mom was incorrectly including linux kernel headers, which no longer works on modern systems. This fixes those includes.
Increase a static buffer used by tracejob to avoid truncating long lists of nodes used by a parallel job.
Similar to the above, increase some PBS server limits to avoid it truncating long node lists.
Grab bag of ANSI-fication, warning removal, and comment fixes.
Plug a file descriptor leak in the mom that occurs when jobs do not start correctly.
Delete unused, and somewhat confusing, variable to break out of TCP poll loops.
When a job is requeued during the prologue step, be sure that the other moms involved in a multinode allocation find out. Otherwise they will report errors when the job is rerun on them later.
Change default shutdown behavior of PBS server to leave jobs along. Previously it would kill everything off by default. From the Ben collection.
Fix bug in server node allocation code. From the Ben collection.
Disable a default behavior for qterm. The default was especially dangerous, thus this at least makes one think about the action. From the Ben collection.
This extra handy patch enables readline support for qmgr. The command-line editing features of that library are quite nice for those of us used to it in Unix shells. From the Ben collection, modified a bit.
Remove arbitrary limit of 15 characters in the job name field. The claim is that this is required by a specification somewhere, but our users get annoyed at the short names it enforces.
Prevent the operating system from swapping out the pages of a mom process. Inspired by a patch from NCSA, but fixed to make sure that children spawned by the mom do not continue to have all their pages locked too. Also quite simplified.
Prevent "make install" from hiding the fact that it creates a directory.
Another grab bag of ANSI-fication, warning removal, and comment fixes.
Add a new method of job invocation to the already existing two choices. Now you may pick one of three:
--enable-shell-pipe
: The _name_ of the shell script is
written into a pipe connected to standard input of the shell process. This
causes the shell (as specified by -S or passwd file) to spawn another shell to
run the actual script file. That spawning uses the standard unix mechanism:
check for #! on the first line, and if not, use /bin/sh. Hence the user
choice for shell via PBS -S option is ignored and shell aliases and other
non-environment settings are lost.
--disable-shell-pipe
: The standard input of the shell is
connected to the script file. This is better because then the invoked shell
will read commands out of the script file and execute them. It is bad,
however, if anything in the shell script tries to read from stdin since it
will see the command stream and move the shared file pointer for the parent
shell as well leading to corruption there. Running mpiexec codes is one way
to see it: mpiexec will read from stdin and ship it to process #0 of the
parallel task. This stdin is just the shell command stream.
--enable-shell-use-argv
: This new option added by pw invokes
the shell with a single command line argument, the script file. This seems to
be the best of both worlds in that only a single shell is used (no need to
execute the wrong subshell) and the user's preference is honored, and commands
are read from the script file and do not corrupt stdin. There were concerns
that with an argument on the command line that the shell would not "act" like
a login shell and miss reading some environment or other settings, but on the
clusters, csh and bash do the right thing.
Add a magic resource manager field which helps maui and its companion metascheduler, silver.
The third grab bag of ANSI-fication, warning removal, and comment fixes.
For mpiexec-spawned jobs to survive across a mom restart, and to enable proper accounting for all jobs which continue across a mom restart, this patch fixes some behavior of mom when restarted with the "-p" flag. Note that this patch adds functionality to the machine-specific part of the mom code for linux only. Users of other system types could cut-n-paste that code without too much problem, but as it stands, this patch will break compilation on non-linux systems.
This patch does four things:
Add some "$(ROOT)" prefixes to variables in the install scripts to allow PBS packages, such as RPM, to build anywhere.
This patch overhauls the way that memory resource usage is managed:
This is a hack to avoid having a version-specific path end up in libpbs.a because when maui gets built, it will also get the same path. If it was version dependent, then we would have to have a new version of maui installed every time we upgraded PBS.
Fix a small bug with the code that closes all file descriptors when the PBS server starts. Although this fix would be obvious, it exposes some other bugs that depend on it being broken. Instead just comment out the code and add a warning comment.
Allow the scheduler to change the Resource_List.neednodes of jobs that tried but failed to run, otherwise they are stuck waiting for the same set of nodes to run again.
Do not declare errno extern, instead include the proper header file. Fixes broken compilation on modern glibc systems.
Modify the dependency calculation scripts to work with gcc-3.2 and newer compilers. The makedepend command will generate lines like "attr_atomic.o: <built-in>" and "attr_atomic.o: <command", which mean nothing to the Makefile and just end up causing errors. We want to remove these line.
Listen for and service lots more sockets in server code.
If the user specifies a batch script to an interactive job, the old behavior was to parse the script for #PBS directives but otherwise ignore the contents. With this patch, the script is run inside a shell connected to the terminal, allowing the user to see the output and provide interactive input to the script. Patch by Michal Kouril of ECECS at University of Cincinnati.
Replace the call to ruserok() in the authentication procedure with a call to another function which has the same signature, but instead uses the PAM facilities to determine whether a user is allowed to submit a job.
The default 15 sec during "hot recovery" was much too much, and every 5 minutes in general is fine I think.
Some of the above patches come from the Ben collection which also includes other potentially useful ones.
Last updated 6 Aug 2004.