Using dsh for System Monitoring

dsh is a component of the clusterit toolkit that enables parallel execution of shell scripts. This can be quite useful for quickly checking the status of a large number of machines.

Installation

Installing dsh and its associated tools is straightforward. After downloading and extracting the archive contents, follow the usual ./configure, make, make install routine on Linux. On a clean install of OS X 10.6.5, the following procedure worked:

./configure --x-includes=/usr/X11/include --x-libraries=/usr/X11/lib
make
sudo make install

Configuration

dsh uses a simple text file at the location defined in the CLUSTER environment variable. The file data/cluster.txt repeated below contains a current (late 2010) list of machines. Note that this list should be dynamically generated from a service database.

# Set this file as target for CLUSTER env var for use with dsh
GROUP:hardware
host-unm-1.dataone.org
host-orc-1.dataone.org
host-ucsb-1.dataone.org
controller-unm-1.dataone.org
host-unm-2.dataone.org

GROUP:CNs
cn-ucsb-1.dataone.org
cn-dev.dataone.org
cn-unm-1.dataone.org
cn-dev-2.dataone.org
cn-orc-1.dataone.org

GROUP:MNs
dev-dryad-mn.dataone.org
dev-fedora-mn.dataone.org
daacmn-dev.dataone.org
#knb-mn.ecoinformatics.org

GROUP:operations
monitor.dataone.org
mule2.dataone.org
mule1.dataone.org
public-web.dataone.org
redmine.dataone.org
epad.dataone.org
#lists.datone.org
#trac.dataone.org
#docs.dataone.org
#repository.dataone.org
#d1sweb.dataone.utk.edu

When invoked, dsh will by default execute the specified command on all the machines defined in CLUSTER, which in turn requires authenticating with each of those machines. Needless to say, some frustration may be alleviated by setting up public key authentication for each of the machines defined in CLUSTER.

Examples

Uptime on everyone, show any connection errors:

$ dsh -e uptime
host-unm-1.dataone.org      :  15:25:47 up 189 days,  5:34,  0 users,  load average: 1.78, 1.56, 1.52
host-orc-1.dataone.org      :  17:32:42 up 91 days,  4:54,  0 users,  load average: 0.54, 0.36, 0.24
host-ucsb-1.dataone.org     :  14:14:22 up 114 days, 23:33,  0 users,  load average: 0.80, 0.44, 0.37
controller-unm-1.dataone.org:  15:26:02 up 3 days,  5:26,  0 users,  load average: 0.04, 0.03, 0.00
host-unm-2.dataone.org      :  15:25:43 up  3:58,  0 users,  load average: 0.00, 0.00, 0.00
cn-ucsb-1.dataone.org       :  22:25:08 up 5 days, 21:24,  1 user,  load average: 0.00, 0.00, 0.00
cn-dev.dataone.org          :  14:25:45 up 13 days, 34 min,  0 users,  load average: 0.00, 0.00, 0.00
cn-unm-1.dataone.org        :  22:25:46 up 9 days, 22:40,  1 user,  load average: 0.02, 0.07, 0.08
cn-dev-2.dataone.org        :  16:25:43 up 24 min,  3 users,  load average: 0.00, 0.00, 0.00
cn-orc-1.dataone.org        :  22:26:13 up 7 days,  5:33,  1 user,  load average: 0.01, 0.03, 0.00
dev-dryad-mn.dataone.org    :  22:32:31 up 189 days,  5:34,  1 user,  load average: 0.46, 0.11, 0.03
dev-fedora-mn.dataone.org   : ssh: connect to host dev-fedora-mn.dataone.org port 22: Operation timed out
daacmn-dev.dataone.org      :  22:32:35 up 91 days,  4:49,  0 users,  load average: 0.00, 0.04, 0.06
monitor.dataone.org         :  22:25:47 up 189 days,  5:34,  0 users,  load average: 0.20, 0.09, 0.09
mule1.dataone.org           :  22:25:46 up 189 days,  4:06,  0 users,  load average: 0.00, 0.00, 0.00
public-web.dataone.org      : ssh: connect to host public-web.dataone.org port 22: Operation timed out
redmine.dataone.org         :  22:26:03 up 3 days,  5:22,  0 users,  load average: 0.00, 0.00, 0.00
epad.dataone.org            :  22:25:46 up 3 days,  4:32,  0 users,  load average: 0.00, 0.00, 0.00
trac.dataone.org            :  14:25:43 up 202 days, 19:19,  1 user,  load average: 0.20, 0.10, 0.07