Monitoring Systems ================== Synopsis -------- There are three monitoring services with distinct roles: :Java JMX: JConsole of VisualVM can be used to monitor Java processes that have JMX instrumentation enabled. :Statsd: Available at statsd.dataone.org (129.237.201.114) is a metrics collector and rendering solution able to provide close to real time reporting of arbitrary measurements. :Check-mk: Available at https://monitor.dataone.org/check_mk/ provides general system monitoring and problem notification. Java JMX -------- In a nutshell, we open a SOCKS proxy to the host using SSH. This is necessary because JMX will respond on a random port even though the initial connection is on a defined port. In this example 7778 is the port for the SOCKS proxy, and port 8010 is the jmx instrumentation port. Start the application with JMX instrumentation enabled:: JAVA_OPTS="-Djava.awt.headless=true -Xmx8192m -XX:+UseParallelGC -Xms1024M -XX:MaxPermSize=512M \ -Dcom.sun.management.jmxremote \ -Dcom.sun.management.jmxremote.port=8010 \ -Dcom.sun.management.jmxremote.authenticate=false \ -Dcom.sun.management.jmxremote.ssl=false \ -Djava.rmi.server.hostname=127.0.0.1 \ -Dhazelcast.jmx=true" Open a socks proxy with an SSH tunnel to the machine running the java service:: ssh -D7778 cn-dev-ucsb-1.test.dataone.org Connect JConsole to the service:: jconsole -J-DsocksProxyHost=localhost \ -J-DsocksProxyPort=7778 \ service:jmx:rmi:///jndi/rmi://127.0.0.1:8010/jmxrmi Or using VisualVm: 1. In VisualVM Preferences | Network, set to use a localhost SOCKS proxy and remove "127.0.0.1" from No Proxy Hosts 2. Add a new Local JMX connection with the JMX parameter:: service:jmx:rmi:///jndi/rmi://127.0.0.1:8010/jmxrmi Statsd ------ StatsD_ is a statistics collection service able to record arbitrary metrics. It can be likened to set of instruments to which measurements are sent. The graphite_ service tracks and renders the measurements values for each instrument. Measurements are sent to the statsd service over UDP in plain text with individual measurements terminated by a new line character. A measurement is composed of at least three components, with an optional fourth:: name:value|modifier[|@sample] :name: The name of the instrument to which the measurement. If an instrument name does not exist then a new one is created. Instruments can be grouped by using periods to separate *parent.child.grandchild*. In general it's best to use only letters, numbers, underscore, dash, and period for the name. Do not use colon (:), comma (,), pipe (|) or at (@). :value: The value of the measurement reading. Except for gauges, this will be an integer value. :modifier: Indicates which type of measurement is being collected. Can be one of "g" (gauge), "c" (counter), "ms" (timer), and "s" (sets). :sample: If present, indicates that the measurement is being sampled at some proportion of the time. For example, a sample of "@0.1" indicates that the measurement is being sampled 1/10 of the time. Example Measurements: ``hz.create.rate:10.1|g`` : Current reading of "rate" within the group "hz" and sub-group "create" is 10.1 (of whatever units). Using bash ~~~~~~~~~~ Measurements can be sent to statsd using bash, in this case the value *29.5* is being sent to the instrument *test* in the *mymetrics* group:: $ echo "mymetrics.test:29.5|g" | nc -w 1 -u 129.237.201.114 8125 Using Java ~~~~~~~~~~ A java client, `java-statsd-client`_ is available from Maven:: com.timgroup java-statsd-client 1.0.1 To use the client:: import com.timgroup.statsd.StatsDClient; class SomeClass { private StatsDClient stats = null; public SomeClass() { String prefix = "mymetrics"; String host = "129.237.201.114"; int port = 8125; stats = new StatsDClient(prefix, host, port); } public someOp() { Integer measurement = CalculateSomeValue(); stats.gauge("test", measurement); } } The measurement will then appear in graphite under the name "mymetrics/test" Note that this Java client seems to only support integer values for gauge measurements. Using Python ~~~~~~~~~~~~ There are several statsd python clients available. One that works well is `python-statsd-client`_. Documentation for the client is available on that site. .. _StatsD: https://github.com/etsy/statsd .. _graphite: http://graphite.readthedocs.org/ .. _python-statsd-client: https://github.com/gaelenh/python-statsd-client .. _java-statsd-client: https://github.com/youdevise/java-statsd-client Check_mk -------- Check_mk provides a layer of functionality over Nagios that simplifies configuration and monitoring of remote machines. The check_mk installation is located at: https://monitor.dataone.org/check_mk/ and uses the central LDAP for authentication. Adding a Server to Check_mk ~~~~~~~~~~~~~~~~~~~~~~~~~~~ To monitor a new server with check_mk, it is necessary to install ``check-mk- agent``, enable it as a service using xinetd, and ensure that fire walls are set to allow requests from the check_mk server (monitor.dataone.org, 129.237.201.155). By default, the check-mk-service listens on TCP port 6556. For Ubuntu servers, install the ``check-mk-agent``:: sudo apt-get update sudo apt-get install xinetd check-mk-agent Edit the xinetd configuration:: service check_mk { type = UNLISTED port = 6556 socket_type = stream protocol = tcp wait = no user = root server = /usr/bin/check_mk_agent # If you use fully redundant monitoring and poll the client # from more then one monitoring servers in parallel you might # want to use the agent cache wrapper: #server = /usr/bin/check_mk_caching_agent # configure the IP address(es) of your Nagios server here: #only_from = 127.0.0.1 10.0.20.1 10.0.20.2 only_from = 127.0.0.1 129.237.201.155 # Don't be too verbose. Don't log every check. This might be # commented out for debugging. If this option is commented out # the default options will be used for this service. log_on_success = disable = no } Then restart xinetd and poke a hole through the firewall:: sudo service xinetd restart sudo ufw allow from 129.237.201.155 to any port 6556 You can check this is running by connecting with telnet from an address listed in the ``only_from`` configuration parameter:: telnet MY_HOST 6556 The response should be immediate and verbose. Add the server to the monitored set of servers by logging in https://monitor.dataone.org/check_mk then under WATO | Hosts add a new host to the appropriate group. Check the services, save the configuration, and the status should appear in the monitored servers.