dsh
for System Monitoring¶dsh is a component of the clusterit toolkit that enables parallel execution of shell scripts. This can be quite useful for quickly checking the status of a large number of machines.
Installing dsh and its associated tools is straightforward. After downloading and extracting the archive contents, follow the usual ./configure, make, make install routine on Linux. On a clean install of OS X 10.6.5, the following procedure worked:
./configure --x-includes=/usr/X11/include --x-libraries=/usr/X11/lib
make
sudo make install
dsh uses a simple text file at the location defined in the CLUSTER
environment variable. The file data/cluster.txt
repeated below contains a
current (late 2010) list of machines. Note that this list should be
dynamically generated from a service database.
# Set this file as target for CLUSTER env var for use with dsh
GROUP:hardware
host-unm-1.dataone.org
host-orc-1.dataone.org
host-ucsb-1.dataone.org
controller-unm-1.dataone.org
host-unm-2.dataone.org
GROUP:CNs
cn-ucsb-1.dataone.org
cn-dev.dataone.org
cn-unm-1.dataone.org
cn-dev-2.dataone.org
cn-orc-1.dataone.org
GROUP:MNs
dev-dryad-mn.dataone.org
dev-fedora-mn.dataone.org
daacmn-dev.dataone.org
#knb-mn.ecoinformatics.org
GROUP:operations
monitor.dataone.org
mule2.dataone.org
mule1.dataone.org
public-web.dataone.org
redmine.dataone.org
epad.dataone.org
#lists.datone.org
#trac.dataone.org
#docs.dataone.org
#repository.dataone.org
#d1sweb.dataone.utk.edu
When invoked, dsh will by default execute the specified command on all the machines defined in CLUSTER, which in turn requires authenticating with each of those machines. Needless to say, some frustration may be alleviated by setting up public key authentication for each of the machines defined in CLUSTER.
Uptime on everyone, show any connection errors:
$ dsh -e uptime
host-unm-1.dataone.org : 15:25:47 up 189 days, 5:34, 0 users, load average: 1.78, 1.56, 1.52
host-orc-1.dataone.org : 17:32:42 up 91 days, 4:54, 0 users, load average: 0.54, 0.36, 0.24
host-ucsb-1.dataone.org : 14:14:22 up 114 days, 23:33, 0 users, load average: 0.80, 0.44, 0.37
controller-unm-1.dataone.org: 15:26:02 up 3 days, 5:26, 0 users, load average: 0.04, 0.03, 0.00
host-unm-2.dataone.org : 15:25:43 up 3:58, 0 users, load average: 0.00, 0.00, 0.00
cn-ucsb-1.dataone.org : 22:25:08 up 5 days, 21:24, 1 user, load average: 0.00, 0.00, 0.00
cn-dev.dataone.org : 14:25:45 up 13 days, 34 min, 0 users, load average: 0.00, 0.00, 0.00
cn-unm-1.dataone.org : 22:25:46 up 9 days, 22:40, 1 user, load average: 0.02, 0.07, 0.08
cn-dev-2.dataone.org : 16:25:43 up 24 min, 3 users, load average: 0.00, 0.00, 0.00
cn-orc-1.dataone.org : 22:26:13 up 7 days, 5:33, 1 user, load average: 0.01, 0.03, 0.00
dev-dryad-mn.dataone.org : 22:32:31 up 189 days, 5:34, 1 user, load average: 0.46, 0.11, 0.03
dev-fedora-mn.dataone.org : ssh: connect to host dev-fedora-mn.dataone.org port 22: Operation timed out
daacmn-dev.dataone.org : 22:32:35 up 91 days, 4:49, 0 users, load average: 0.00, 0.04, 0.06
monitor.dataone.org : 22:25:47 up 189 days, 5:34, 0 users, load average: 0.20, 0.09, 0.09
mule1.dataone.org : 22:25:46 up 189 days, 4:06, 0 users, load average: 0.00, 0.00, 0.00
public-web.dataone.org : ssh: connect to host public-web.dataone.org port 22: Operation timed out
redmine.dataone.org : 22:26:03 up 3 days, 5:22, 0 users, load average: 0.00, 0.00, 0.00
epad.dataone.org : 22:25:46 up 3 days, 4:32, 0 users, load average: 0.00, 0.00, 0.00
trac.dataone.org : 14:25:43 up 202 days, 19:19, 1 user, load average: 0.20, 0.10, 0.07