Monitoring Systems
==================

Synopsis
--------

There are three monitoring services with distinct roles:

:Java JMX:

  JConsole of VisualVM can be used to monitor Java processes that have JMX
  instrumentation enabled.

:Statsd: 

  Available at statsd.dataone.org (129.237.201.114) is a metrics collector and
  rendering solution able to provide close to real time reporting of arbitrary
  measurements.

:Check-mk: 

  Available at https://monitor.dataone.org/check_mk/ provides general system
  monitoring and problem notification.


Java JMX
--------

In a nutshell, we open a SOCKS proxy to the host using SSH. This is necessary
because JMX will respond on a random port even though the initial connection is
on a defined port.

In this example 7778 is the port for the SOCKS proxy, and port 8010 is the jmx
instrumentation port.


Start the application with JMX instrumentation enabled::

  JAVA_OPTS="-Djava.awt.headless=true -Xmx8192m -XX:+UseParallelGC -Xms1024M -XX:MaxPermSize=512M \
   -Dcom.sun.management.jmxremote \
   -Dcom.sun.management.jmxremote.port=8010 \
   -Dcom.sun.management.jmxremote.authenticate=false \
   -Dcom.sun.management.jmxremote.ssl=false \
   -Djava.rmi.server.hostname=127.0.0.1 \
   -Dhazelcast.jmx=true"

Open a socks proxy with an SSH tunnel to the machine running the java service::

  ssh -D7778 cn-dev-ucsb-1.test.dataone.org

Connect JConsole to the service::

  jconsole -J-DsocksProxyHost=localhost \
    -J-DsocksProxyPort=7778 \
    service:jmx:rmi:///jndi/rmi://127.0.0.1:8010/jmxrmi

Or using VisualVm: 

1. In VisualVM Preferences | Network, set to use a localhost SOCKS proxy and remove "127.0.0.1" from No Proxy Hosts

2. Add a new Local JMX connection with the JMX parameter::
    
    service:jmx:rmi:///jndi/rmi://127.0.0.1:8010/jmxrmi



Statsd
------

StatsD_ is a statistics collection service able to record arbitrary metrics.
It can be likened to set of instruments to which measurements are sent. The
graphite_ service tracks and renders the measurements values for each
instrument.

Measurements are sent to the statsd service over UDP in plain text with
individual measurements terminated by a new line character. A measurement is
composed of at least three components, with an optional fourth::

  name:value|modifier[|@sample]

:name:

  The name of the instrument to which the measurement. If an instrument name
  does not exist then a new one is created. Instruments can be grouped by
  using periods to separate *parent.child.grandchild*. In general it's best to
  use only letters, numbers, underscore, dash, and period for the name. Do not
  use colon (:), comma (,), pipe (|) or at (@).

:value:

  The value of the measurement reading. Except for gauges, this will be an
  integer value.

:modifier:

  Indicates which type of measurement is being collected. Can be one of "g"
  (gauge), "c" (counter), "ms" (timer), and "s" (sets).

:sample:

  If present, indicates that the measurement is being sampled at some
  proportion of the time. For example, a sample of "@0.1" indicates that the
  measurement is being sampled 1/10 of the time.


Example Measurements:

``hz.create.rate:10.1|g`` : Current reading of "rate" within the group "hz"
and sub-group "create" is 10.1 (of whatever units).


Using bash
~~~~~~~~~~

Measurements can be sent to statsd using bash, in this case the value *29.5*
is being sent to the instrument *test* in the *mymetrics* group::

  $ echo "mymetrics.test:29.5|g" | nc -w 1 -u 129.237.201.114 8125


Using Java
~~~~~~~~~~

A java client, `java-statsd-client`_ is available from Maven::

  <dependency>
    <groupId>com.timgroup</groupId>
    <artifactId>java-statsd-client</artifactId>
    <version>1.0.1</version>
  </dependency>

To use the client::

  import com.timgroup.statsd.StatsDClient;
  
  class SomeClass {
  
    private StatsDClient stats = null;
    
    public SomeClass() {
      String prefix = "mymetrics";
      String host = "129.237.201.114";
      int port = 8125;
      stats = new StatsDClient(prefix, host, port);
    }
    
    public someOp() {
      Integer measurement = CalculateSomeValue();
      stats.gauge("test", measurement);
    }
  }


The measurement will then appear in graphite under the name "mymetrics/test"

Note that this Java client seems to only support integer values for gauge
measurements.


Using Python
~~~~~~~~~~~~

There are several statsd python clients available. 

One that works well is `python-statsd-client`_. Documentation for the client
is available on that site.


.. _StatsD: https://github.com/etsy/statsd
.. _graphite: http://graphite.readthedocs.org/
.. _python-statsd-client: https://github.com/gaelenh/python-statsd-client
.. _java-statsd-client: https://github.com/youdevise/java-statsd-client


Check_mk
--------

Check_mk provides a layer of functionality over Nagios that simplifies configuration and monitoring of remote machines. The check_mk installation is located at:

  https://monitor.dataone.org/check_mk/

and uses the central LDAP for authentication.


Adding a Server to Check_mk
~~~~~~~~~~~~~~~~~~~~~~~~~~~

To monitor a new server with check_mk, it is necessary to install ``check-mk-
agent``, enable it as a service using xinetd, and ensure that fire walls are set
to allow requests from the check_mk server (monitor.dataone.org,
129.237.201.155). By default, the check-mk-service listens on TCP port 6556.

For Ubuntu servers, install the ``check-mk-agent``::

  sudo apt-get update
  sudo apt-get install xinetd check-mk-agent

Edit the xinetd configuration::

  service check_mk
  {
      type           = UNLISTED
      port           = 6556
      socket_type    = stream
      protocol       = tcp
      wait           = no
      user           = root
      server         = /usr/bin/check_mk_agent

      # If you use fully redundant monitoring and poll the client
      # from more then one monitoring servers in parallel you might
      # want to use the agent cache wrapper:
      #server         = /usr/bin/check_mk_caching_agent

      # configure the IP address(es) of your Nagios server here:
      #only_from      = 127.0.0.1 10.0.20.1 10.0.20.2
      only_from    = 127.0.0.1 129.237.201.155

      # Don't be too verbose. Don't log every check. This might be
      # commented out for debugging. If this option is commented out
      # the default options will be used for this service.
      log_on_success =

      disable        = no
  }

Then restart xinetd and poke a hole through the firewall::

  sudo service xinetd restart
  sudo ufw allow from 129.237.201.155 to any port 6556

You can check this is running by connecting with telnet from an address listed in the ``only_from`` configuration parameter::

  telnet MY_HOST 6556

The response should be immediate and verbose.

Add the server to the monitored set of servers by logging in
https://monitor.dataone.org/check_mk then under WATO | Hosts add a new host to
the appropriate group. Check the services, save the configuration, and the
status should appear in the monitored servers.