Service Monitoring 
==================

.. contents:: Contents
   :local:
   :backlinks: entry


Logs
----

The following logs are created by Coordinating Node services::

  /var/log/dataone/replicate/cn-replication.log
  /var/log/dataone/synchronize/cn-synchronization.log

  TODO: Need a complete list of logs on the CNS


Splunk
------

TBD


LogStash
--------

TBD


Monitoring Java Processes with JMX
----------------------------------

**Step 0.** If ``hostname -i`` does not report the public IP address of the
system, then edit ``/etc/hosts`` and set the public IP there. For example, on
*cn-dev*, ``hostname -i`` reported 127.0.1.1. The *hosts* file was updated with
the correct value::

  127.0.0.1 localhost
  #127.0.1.1  cn-dev.dataone.org  cn-dev
  128.111.220.50  cn-dev.dataone.org  cn-dev
  
  ...

See: http://docs.oracle.com/javase/1.5.0/docs/guide/management/faq.html#linux1

Watching Hazelcast
~~~~~~~~~~~~~~~~~~

In this example, *d1-processing* is enabled for JMX monitoring.

Create the file ``/etc/dataone/process/jmx.passwd`` with contents::

  monitorRole {PASSWORD}

and the file ``/etc/dataone/process/jmx.access`` with contents::

  monitorRole readonly

Change owners of these to user *tomcat6* and make them readable only by that
user (has to be same user as process that will be launching the JMX service)::

  sudo chown tomcat6:tomcat6 /etc/dataone/process/jmx.*
  sudo chmod 600 /etc/dataone/process/jmx.*

Shutdown *d1-processing*::

  sudo /etc/init.d/d1-processing stop

now startup *d1-processing* with the JMX startup flags::

  sudo env JAVA_OPTS="-Djava.awt.headless=true -Xmx4096M -Xms1024M \
    -Dcom.sun.management.jmxremote \
    -Dcom.sun.management.jmxremote.port=8010 \
    -Dcom.sun.management.jmxremote.authenticate=true \
    -Dcom.sun.management.jmxremote.ssl=false \
    -Dcom.sun.management.jmxremote.password.file=/etc/dataone/process/jmx.passwd \
    -Dcom.sun.management.jmxremote.access.file=/etc/dataone/process/jmx.access \
    -Djava.rmi.server.hostname=128.111.220.50 \
    -Dhazelcast.jmx=true" \
    /etc/init.d/d1-processing start

Temporarily disable the firewall. This is necessary because even though the JMX
service will listen on the specified port, the RMI service, which the JMX client
will be directed to by the JMX service, will be listening on a random port::

  sudo ufw disable

Open jconsole on your desktop, and select "Remote process", entering in::

  hostname:port

and the username "monitorRole" and the password specified in
``/etc/dataone/process/jmx/passwd``.

After a couple of seconds the JMX client should be connected and start
collecting statistics.

Remember to restart the firewall when you're done::

  sudo ufw enable


Watching Tomcat
~~~~~~~~~~~~~~~

There's an issue with Java security, probably need permission to access the
password and access files, but as an interim measure, disable JAVA_SECURITY in
``/etc/init.d/tomcat6``, stop *tomcat6*, and restart with the following
parameters to enable JMX monitoring of tomcat::

  sudo env JAVA_OPTS="-Djava.awt.headless=true -Xmx2048M -Xms1024M \
    -Dcom.sun.management.jmxremote \
    -Dcom.sun.management.jmxremote.port=8020 \
    -Dcom.sun.management.jmxremote.authenticate=true \
    -Dcom.sun.management.jmxremote.ssl=false \
    -Dcom.sun.management.jmxremote.password.file=/etc/dataone/process/jmx.passwd \
    -Dcom.sun.management.jmxremote.access.file=/etc/dataone/process/jmx.access \
    -Djava.rmi.server.hostname=128.111.220.50 \
    -Dhazelcast.jmx=true" \
    /etc/init.d/tomcat6 start

  sudo env JAVA_OPTS="-Djava.awt.headless=true -Xmx2048M -Xms1024M \
    -Dcom.sun.management.jmxremote \
    -Dcom.sun.management.jmxremote.port=8020 \
    -Dcom.sun.management.jmxremote.authenticate=false \
    -Dcom.sun.management.jmxremote.ssl=false \
    -Dcom.sun.management.jmxremote.password.file=/etc/dataone/monitor/jmx.passwd \
    -Dcom.sun.management.jmxremote.access.file=/etc/dataone/monitor/jmx.access \
    -Djava.rmi.server.hostname=129.24.0.109 \
    -Dhazelcast.jmx=true" \
    /etc/init.d/tomcat6 start


Check jmx tool::

  Usage: check_jmx -U url -O object_name -A attribute [-K compound_key] [-I attribute_info] [-J attribute_info_key] -w warn_limit -c crit_limit [-v[vvv]] [-help]

  , where options are:

  -help  Prints this page

  -U   JMX URL, for example: "service:jmx:rmi:///jndi/rmi://localhost:1616/jmxrmi"

  -O   Object name to be checked, for example, "java.lang:type=Memory"

  -A   Attribute of the object to be checked, for example, "NonHeapMemoryUsage"

  -K   Attribute key for -A attribute compound data, for example, "used" (optional)

  -I   Attribute of the object containing information for text output (optional)

  -J   Attribute key for -I attribute compound data, for example, "used" (optional)

  -v[vvv] verbatim level controlled as a number of v (optional)

  -w   warning integer value

  -c   critical integer value

  Note that if warning level > critical, system checks object attribute value to be LESS THAN OR EQUAL warning, critical
  If warning level < critical, system checks object attribute value to be MORE THAN OR EQUAL warning, critical 
  


  ./check_jmx -U service:jmx:rmi:///jndi/rmi://localhost:8020/jmxrmi \
    -O java.lang:type=Memory -A HeapMemoryUsage -K used -I HeapMemoryUsage \
    -J used -vvvv -w 4248302272 -c 5498760192

  ./check_jmx -U service:jmx:rmi:///jndi/rmi://localhost:8020/jmxrmi \
  -O java.lang:type=Memory -A LoadedClassCount -K used -I HeapMemoryUsage -J used -vvvv -w 4248302272 -c 5498760192

Listing JMX Beans
~~~~~~~~~~~~~~~~~

Get a JMX console tool. The one used in the examples here is ``jmxterm``
available from: http://wiki.cyclopsgroup.org/jmxterm

Fire up jmxterm with something like ``java -jar jmxterm.jar``, then connect to
the target using the open command::

  java -jar jmxterm.jar
  $> open 127.0.0.1:8020
  #Connection to 127.0.0.1:8020 is opened

Get a list of domains::

  $>domains
  #following domains are available
  Catalina
  JMImplementation
  Users
  com.sun.management
  java.lang
  java.util.logging
  solr/

Select a domain, in this case Catalina and see what beans it offers::

  $>domain Catalina
  #domain is set to Catalina
  $>beans
  #domain = Catalina:
  Catalina:J2EEApplication=none,J2EEServer=none,WebModule=//localhost/,j2eeType=Servlet,name=default
  Catalina:J2EEApplication=none,J2EEServer=none,WebModule=//localhost/,j2eeType=Servlet,name=jsp
  Catalina:J2EEApplication=none,J2EEServer=none,WebModule=//localhost/,name=jsp,type=JspMonitor
  Catalina:J2EEApplication=none,J2EEServer=none,WebModule=//localhost/solr,j2eeType=Filter,name=SolrRequestFilter
  Catalina:J2EEApplication=none,J2EEServer=none,WebModule=//localhost/solr,j2eeType=Servlet,name=Logging
  Catalina:J2EEApplication=none,J2EEServer=none,WebModule=//localhost/solr,j2eeType=Servlet,name=SolrServer
  Catalina:J2EEApplication=none,J2EEServer=none,WebModule=//localhost/solr,j2eeType=Servlet,name=SolrUpdate
  Catalina:J2EEApplication=none,J2EEServer=none,WebModule=//localhost/solr,j2eeType=Servlet,name=default
  Catalina:J2EEApplication=none,J2EEServer=none,WebModule=//localhost/solr,j2eeType=Servlet,name=jsp
  Catalina:J2EEApplication=none,J2EEServer=none,WebModule=//localhost/solr,j2eeType=Servlet,name=ping
  Catalina:J2EEApplication=none,J2EEServer=none,WebModule=//localhost/solr,name=jsp,type=JspMonitor
  Catalina:J2EEApplication=none,J2EEServer=none,WebModule=//localhost/solr,name=ping,type=JspMonitor
  Catalina:J2EEApplication=none,J2EEServer=none,j2eeType=WebModule,name=//localhost/
  Catalina:J2EEApplication=none,J2EEServer=none,j2eeType=WebModule,name=//localhost/solr
  Catalina:class=org.apache.catalina.UserDatabase,name="UserDatabase",resourcetype=Global,type=Resource
  Catalina:host=localhost,name=ErrorReportValve,type=Valve
  Catalina:host=localhost,name=StandardContextValve,path=/,type=Valve
  Catalina:host=localhost,name=StandardContextValve,path=/solr,type=Valve
  Catalina:host=localhost,name=StandardHostValve,type=Valve
  Catalina:host=localhost,name=solr/home,path=/solr,resourcetype=Context,type=Environment
  Catalina:host=localhost,path=/,resourcetype=Context,type=NamingResources
  Catalina:host=localhost,path=/,type=Cache
  Catalina:host=localhost,path=/,type=Loader
  Catalina:host=localhost,path=/,type=Manager
  Catalina:host=localhost,path=/,type=WebappClassLoader
  Catalina:host=localhost,path=/solr,resourcetype=Context,type=NamingResources
  Catalina:host=localhost,path=/solr,type=Cache
  Catalina:host=localhost,path=/solr,type=Loader
  Catalina:host=localhost,path=/solr,type=Manager
  Catalina:host=localhost,path=/solr,type=WebappClassLoader
  Catalina:host=localhost,type=Deployer
  Catalina:host=localhost,type=Host
  Catalina:name=StandardEngineValve,type=Valve
  Catalina:name=common,type=ServerClassLoader
  Catalina:name=http-8080,type=GlobalRequestProcessor
  Catalina:name=http-8080,type=ThreadPool
  Catalina:name=server,type=ServerClassLoader
  Catalina:name=shared,type=ServerClassLoader
  Catalina:port=8080,type=Connector
  Catalina:port=8080,type=Mapper
  Catalina:port=8080,type=ProtocolHandler
  Catalina:realmPath=/realm0,type=Realm
  Catalina:resourcetype=Global,type=NamingResources
  Catalina:serviceName=Catalina,type=Service
  Catalina:type=Engine
  Catalina:type=MBeanFactory
  Catalina:type=Server
  Catalina:type=StringCache

The Host bean looks interesting::

  $>bean Catalina:host=localhost,type=Host
  #bean is set to Catalina:host=localhost,type=Host
  $>info
  #mbean = Catalina:host=localhost,type=Host
  #class name = org.apache.tomcat.util.modeler.BaseModelMBean
  # attributes
    %0   - aliases ([Ljava.lang.String;, rw)
    %1   - appBase (java.lang.String, rw)
    %2   - autoDeploy (boolean, rw)
    %3   - children ([Ljavax.management.ObjectName;, rw)
    %4   - configClass (java.lang.String, rw)
    %5   - deployOnStartup (boolean, rw)
    %6   - deployXML (boolean, rw)
    %7   - managedResource (java.lang.Object, rw)
    %8   - modelerType (java.lang.String, r)
    %9   - name (java.lang.String, rw)
    %10  - realm (org.apache.catalina.Realm, rw)
    %11  - unpackWARs (boolean, rw)
    %12  - valveNames ([Ljava.lang.String;, rw)
    %13  - valveObjectNames ([Ljavax.management.ObjectName;, rw)
    %14  - xmlNamespaceAware (boolean, rw)
    %15  - xmlValidation (boolean, rw)
  # operations
    %0   - void addAlias(java.lang.String alias)
    %1   - void addChild(org.apache.catalina.Container child)
    %2   - void destroy()
    %3   - [Ljava.lang.String; findAliases()
    %4   - void init()
    %5   - void removeAlias(java.lang.String alias)
    %6   - void start()
    %7   - void stop()
  #there's no notifications

Now let's get a couple attribute values::

  $>get appBase
  #mbean = Catalina:host=localhost,type=Host:
  appBase = webapps;

  $>get children
  #mbean = Catalina:host=localhost,type=Host:
  children = [ Catalina:j2eeType=WebModule,name=//localhost/,J2EEApplication=none,J2EEServer=none, Catalina:j2eeType=WebModule,name=//localhost/solr,J2EEApplication=none,J2EEServer=none ];


Check_mk Monitoring
-------------------

Check_mk provides a layer of functionality over Nagios that simplifies
configuration and monitoring of remote machines. The check_mk installation is
located at:

  https://monitor.dataone.org/check_mk/

and uses the central LDAP for authentication.


Adding a Server to Check_mk
~~~~~~~~~~~~~~~~~~~~~~~~~~~

To monitor a new server with check_mk, it is necessary to install ``check-mk-
agent``, enable it as a service using xinetd, and ensure that fire walls are set
to allow requests from the check_mk server (monitor.dataone.org,
129.237.201.155). By default, the check-mk-service listens on TCP port 6556.

For Ubuntu servers, install the ``check-mk-agent``::

  sudo apt-get update
  sudo apt-get install xinetd check-mk-agent

Edit the xinetd configuration::

  service check_mk
  {
      type           = UNLISTED
      port           = 6556
      socket_type    = stream
      protocol       = tcp
      wait           = no
      user           = root
      server         = /usr/bin/check_mk_agent

      # If you use fully redundant monitoring and poll the client
      # from more then one monitoring servers in parallel you might
      # want to use the agent cache wrapper:
      #server         = /usr/bin/check_mk_caching_agent

      # configure the IP address(es) of your Nagios server here:
      #only_from      = 127.0.0.1 10.0.20.1 10.0.20.2
      only_from    = 127.0.0.1 129.237.201.155

      # Don't be too verbose. Don't log every check. This might be
      # commented out for debugging. If this option is commented out
      # the default options will be used for this service.
      log_on_success =

      disable        = no
  }

Then restart xinetd and poke a hole through the firewall::

  sudo service xinetd restart
  sudo ufw allow from 129.237.201.155 to any port 6556

You can check this is running by connecting with telnet from an address listed
in the ``only_from`` configuration parameter::

  telnet MY_HOST 6556

The response should be immediate and verbose.

Add the server to the monitored set of servers by logging in
https://monitor.dataone.org/check_mk then under WATO | Hosts add a new host to
the appropriate group. Check the services, save the configuration, and the
status should appear in the monitored servers.