Using ``dsh`` for System Monitoring =================================== *dsh* is a component of the clusterit_ toolkit that enables parallel execution of shell scripts. This can be quite useful for quickly checking the status of a large number of machines. Installation ------------ Installing *dsh* and its associated tools is straightforward. After downloading and extracting the archive contents, follow the usual *./configure*, *make*, *make install* routine on Linux. On a clean install of OS X 10.6.5, the following procedure worked:: ./configure --x-includes=/usr/X11/include --x-libraries=/usr/X11/lib make sudo make install Configuration ------------- *dsh* uses a simple text file at the location defined in the *CLUSTER* environment variable. The file ``data/cluster.txt`` repeated below contains a current (late 2010) list of machines. Note that this list should be dynamically generated from a service database. .. include:: ../data/cluster.txt :literal: When invoked, dsh will by default execute the specified command on all the machines defined in *CLUSTER*, which in turn requires authenticating with each of those machines. Needless to say, some frustration may be alleviated by setting up public key authentication for each of the machines defined in *CLUSTER*. Examples -------- Uptime on everyone, show any connection errors:: $ dsh -e uptime host-unm-1.dataone.org : 15:25:47 up 189 days, 5:34, 0 users, load average: 1.78, 1.56, 1.52 host-orc-1.dataone.org : 17:32:42 up 91 days, 4:54, 0 users, load average: 0.54, 0.36, 0.24 host-ucsb-1.dataone.org : 14:14:22 up 114 days, 23:33, 0 users, load average: 0.80, 0.44, 0.37 controller-unm-1.dataone.org: 15:26:02 up 3 days, 5:26, 0 users, load average: 0.04, 0.03, 0.00 host-unm-2.dataone.org : 15:25:43 up 3:58, 0 users, load average: 0.00, 0.00, 0.00 cn-ucsb-1.dataone.org : 22:25:08 up 5 days, 21:24, 1 user, load average: 0.00, 0.00, 0.00 cn-dev.dataone.org : 14:25:45 up 13 days, 34 min, 0 users, load average: 0.00, 0.00, 0.00 cn-unm-1.dataone.org : 22:25:46 up 9 days, 22:40, 1 user, load average: 0.02, 0.07, 0.08 cn-dev-2.dataone.org : 16:25:43 up 24 min, 3 users, load average: 0.00, 0.00, 0.00 cn-orc-1.dataone.org : 22:26:13 up 7 days, 5:33, 1 user, load average: 0.01, 0.03, 0.00 dev-dryad-mn.dataone.org : 22:32:31 up 189 days, 5:34, 1 user, load average: 0.46, 0.11, 0.03 dev-fedora-mn.dataone.org : ssh: connect to host dev-fedora-mn.dataone.org port 22: Operation timed out daacmn-dev.dataone.org : 22:32:35 up 91 days, 4:49, 0 users, load average: 0.00, 0.04, 0.06 monitor.dataone.org : 22:25:47 up 189 days, 5:34, 0 users, load average: 0.20, 0.09, 0.09 mule1.dataone.org : 22:25:46 up 189 days, 4:06, 0 users, load average: 0.00, 0.00, 0.00 public-web.dataone.org : ssh: connect to host public-web.dataone.org port 22: Operation timed out redmine.dataone.org : 22:26:03 up 3 days, 5:22, 0 users, load average: 0.00, 0.00, 0.00 epad.dataone.org : 22:25:46 up 3 days, 4:32, 0 users, load average: 0.00, 0.00, 0.00 trac.dataone.org : 14:25:43 up 202 days, 19:19, 1 user, load average: 0.20, 0.10, 0.07 .. _clusterit: http://www.garbled.net/clusterit.html