Service Monitoring

Logs

The following logs are created by Coordinating Node services:

/var/log/dataone/replicate/cn-replication.log
/var/log/dataone/synchronize/cn-synchronization.log

TODO: Need a complete list of logs on the CNS

Splunk

TBD

Monitoring Java Processes with JMX

Step 0. If hostname -i does not report the public IP address of the system, then edit /etc/hosts and set the public IP there. For example, on cn-dev, hostname -i reported 127.0.1.1. The hosts file was updated with the correct value:

127.0.0.1 localhost
#127.0.1.1  cn-dev.dataone.org  cn-dev
128.111.220.50  cn-dev.dataone.org  cn-dev

...

See: http://docs.oracle.com/javase/1.5.0/docs/guide/management/faq.html#linux1

Watching Hazelcast

In this example, d1-processing is enabled for JMX monitoring.

Create the file /etc/dataone/process/jmx.passwd with contents:

monitorRole {PASSWORD}

and the file /etc/dataone/process/jmx.access with contents:

monitorRole readonly

Change owners of these to user tomcat6 and make them readable only by that user (has to be same user as process that will be launching the JMX service):

sudo chown tomcat6:tomcat6 /etc/dataone/process/jmx.*
sudo chmod 600 /etc/dataone/process/jmx.*

Shutdown d1-processing:

sudo /etc/init.d/d1-processing stop

now startup d1-processing with the JMX startup flags:

sudo env JAVA_OPTS="-Djava.awt.headless=true -Xmx4096M -Xms1024M \
  -Dcom.sun.management.jmxremote \
  -Dcom.sun.management.jmxremote.port=8010 \
  -Dcom.sun.management.jmxremote.authenticate=true \
  -Dcom.sun.management.jmxremote.ssl=false \
  -Dcom.sun.management.jmxremote.password.file=/etc/dataone/process/jmx.passwd \
  -Dcom.sun.management.jmxremote.access.file=/etc/dataone/process/jmx.access \
  -Djava.rmi.server.hostname=128.111.220.50 \
  -Dhazelcast.jmx=true" \
  /etc/init.d/d1-processing start

Temporarily disable the firewall. This is necessary because even though the JMX service will listen on the specified port, the RMI service, which the JMX client will be directed to by the JMX service, will be listening on a random port:

sudo ufw disable

Open jconsole on your desktop, and select “Remote process”, entering in:

hostname:port

and the username “monitorRole” and the password specified in /etc/dataone/process/jmx/passwd.

After a couple of seconds the JMX client should be connected and start collecting statistics.

Remember to restart the firewall when you’re done:

sudo ufw enable

Watching Tomcat

There’s an issue with Java security, probably need permission to access the password and access files, but as an interim measure, disable JAVA_SECURITY in /etc/init.d/tomcat6, stop tomcat6, and restart with the following parameters to enable JMX monitoring of tomcat:

sudo env JAVA_OPTS="-Djava.awt.headless=true -Xmx2048M -Xms1024M \
  -Dcom.sun.management.jmxremote \
  -Dcom.sun.management.jmxremote.port=8020 \
  -Dcom.sun.management.jmxremote.authenticate=true \
  -Dcom.sun.management.jmxremote.ssl=false \
  -Dcom.sun.management.jmxremote.password.file=/etc/dataone/process/jmx.passwd \
  -Dcom.sun.management.jmxremote.access.file=/etc/dataone/process/jmx.access \
  -Djava.rmi.server.hostname=128.111.220.50 \
  -Dhazelcast.jmx=true" \
  /etc/init.d/tomcat6 start

sudo env JAVA_OPTS="-Djava.awt.headless=true -Xmx2048M -Xms1024M \
  -Dcom.sun.management.jmxremote \
  -Dcom.sun.management.jmxremote.port=8020 \
  -Dcom.sun.management.jmxremote.authenticate=false \
  -Dcom.sun.management.jmxremote.ssl=false \
  -Dcom.sun.management.jmxremote.password.file=/etc/dataone/monitor/jmx.passwd \
  -Dcom.sun.management.jmxremote.access.file=/etc/dataone/monitor/jmx.access \
  -Djava.rmi.server.hostname=129.24.0.109 \
  -Dhazelcast.jmx=true" \
  /etc/init.d/tomcat6 start

Check jmx tool:

Usage: check_jmx -U url -O object_name -A attribute [-K compound_key] [-I attribute_info] [-J attribute_info_key] -w warn_limit -c crit_limit [-v[vvv]] [-help]

, where options are:

-help  Prints this page

-U   JMX URL, for example: "service:jmx:rmi:///jndi/rmi://localhost:1616/jmxrmi"

-O   Object name to be checked, for example, "java.lang:type=Memory"

-A   Attribute of the object to be checked, for example, "NonHeapMemoryUsage"

-K   Attribute key for -A attribute compound data, for example, "used" (optional)

-I   Attribute of the object containing information for text output (optional)

-J   Attribute key for -I attribute compound data, for example, "used" (optional)

-v[vvv] verbatim level controlled as a number of v (optional)

-w   warning integer value

-c   critical integer value

Note that if warning level > critical, system checks object attribute value to be LESS THAN OR EQUAL warning, critical
If warning level < critical, system checks object attribute value to be MORE THAN OR EQUAL warning, critical



./check_jmx -U service:jmx:rmi:///jndi/rmi://localhost:8020/jmxrmi \
  -O java.lang:type=Memory -A HeapMemoryUsage -K used -I HeapMemoryUsage \
  -J used -vvvv -w 4248302272 -c 5498760192

./check_jmx -U service:jmx:rmi:///jndi/rmi://localhost:8020/jmxrmi \
-O java.lang:type=Memory -A LoadedClassCount -K used -I HeapMemoryUsage -J used -vvvv -w 4248302272 -c 5498760192

Listing JMX Beans

Get a JMX console tool. The one used in the examples here is jmxterm available from: http://wiki.cyclopsgroup.org/jmxterm

Fire up jmxterm with something like java -jar jmxterm.jar, then connect to the target using the open command:

java -jar jmxterm.jar
$> open 127.0.0.1:8020
#Connection to 127.0.0.1:8020 is opened

Get a list of domains:

$>domains
#following domains are available
Catalina
JMImplementation
Users
com.sun.management
java.lang
java.util.logging
solr/

Select a domain, in this case Catalina and see what beans it offers:

$>domain Catalina
#domain is set to Catalina
$>beans
#domain = Catalina:
Catalina:J2EEApplication=none,J2EEServer=none,WebModule=//localhost/,j2eeType=Servlet,name=default
Catalina:J2EEApplication=none,J2EEServer=none,WebModule=//localhost/,j2eeType=Servlet,name=jsp
Catalina:J2EEApplication=none,J2EEServer=none,WebModule=//localhost/,name=jsp,type=JspMonitor
Catalina:J2EEApplication=none,J2EEServer=none,WebModule=//localhost/solr,j2eeType=Filter,name=SolrRequestFilter
Catalina:J2EEApplication=none,J2EEServer=none,WebModule=//localhost/solr,j2eeType=Servlet,name=Logging
Catalina:J2EEApplication=none,J2EEServer=none,WebModule=//localhost/solr,j2eeType=Servlet,name=SolrServer
Catalina:J2EEApplication=none,J2EEServer=none,WebModule=//localhost/solr,j2eeType=Servlet,name=SolrUpdate
Catalina:J2EEApplication=none,J2EEServer=none,WebModule=//localhost/solr,j2eeType=Servlet,name=default
Catalina:J2EEApplication=none,J2EEServer=none,WebModule=//localhost/solr,j2eeType=Servlet,name=jsp
Catalina:J2EEApplication=none,J2EEServer=none,WebModule=//localhost/solr,j2eeType=Servlet,name=ping
Catalina:J2EEApplication=none,J2EEServer=none,WebModule=//localhost/solr,name=jsp,type=JspMonitor
Catalina:J2EEApplication=none,J2EEServer=none,WebModule=//localhost/solr,name=ping,type=JspMonitor
Catalina:J2EEApplication=none,J2EEServer=none,j2eeType=WebModule,name=//localhost/
Catalina:J2EEApplication=none,J2EEServer=none,j2eeType=WebModule,name=//localhost/solr
Catalina:class=org.apache.catalina.UserDatabase,name="UserDatabase",resourcetype=Global,type=Resource
Catalina:host=localhost,name=ErrorReportValve,type=Valve
Catalina:host=localhost,name=StandardContextValve,path=/,type=Valve
Catalina:host=localhost,name=StandardContextValve,path=/solr,type=Valve
Catalina:host=localhost,name=StandardHostValve,type=Valve
Catalina:host=localhost,name=solr/home,path=/solr,resourcetype=Context,type=Environment
Catalina:host=localhost,path=/,resourcetype=Context,type=NamingResources
Catalina:host=localhost,path=/,type=Cache
Catalina:host=localhost,path=/,type=Loader
Catalina:host=localhost,path=/,type=Manager
Catalina:host=localhost,path=/,type=WebappClassLoader
Catalina:host=localhost,path=/solr,resourcetype=Context,type=NamingResources
Catalina:host=localhost,path=/solr,type=Cache
Catalina:host=localhost,path=/solr,type=Loader
Catalina:host=localhost,path=/solr,type=Manager
Catalina:host=localhost,path=/solr,type=WebappClassLoader
Catalina:host=localhost,type=Deployer
Catalina:host=localhost,type=Host
Catalina:name=StandardEngineValve,type=Valve
Catalina:name=common,type=ServerClassLoader
Catalina:name=http-8080,type=GlobalRequestProcessor
Catalina:name=http-8080,type=ThreadPool
Catalina:name=server,type=ServerClassLoader
Catalina:name=shared,type=ServerClassLoader
Catalina:port=8080,type=Connector
Catalina:port=8080,type=Mapper
Catalina:port=8080,type=ProtocolHandler
Catalina:realmPath=/realm0,type=Realm
Catalina:resourcetype=Global,type=NamingResources
Catalina:serviceName=Catalina,type=Service
Catalina:type=Engine
Catalina:type=MBeanFactory
Catalina:type=Server
Catalina:type=StringCache

The Host bean looks interesting:

$>bean Catalina:host=localhost,type=Host
#bean is set to Catalina:host=localhost,type=Host
$>info
#mbean = Catalina:host=localhost,type=Host
#class name = org.apache.tomcat.util.modeler.BaseModelMBean
# attributes
  %0   - aliases ([Ljava.lang.String;, rw)
  %1   - appBase (java.lang.String, rw)
  %2   - autoDeploy (boolean, rw)
  %3   - children ([Ljavax.management.ObjectName;, rw)
  %4   - configClass (java.lang.String, rw)
  %5   - deployOnStartup (boolean, rw)
  %6   - deployXML (boolean, rw)
  %7   - managedResource (java.lang.Object, rw)
  %8   - modelerType (java.lang.String, r)
  %9   - name (java.lang.String, rw)
  %10  - realm (org.apache.catalina.Realm, rw)
  %11  - unpackWARs (boolean, rw)
  %12  - valveNames ([Ljava.lang.String;, rw)
  %13  - valveObjectNames ([Ljavax.management.ObjectName;, rw)
  %14  - xmlNamespaceAware (boolean, rw)
  %15  - xmlValidation (boolean, rw)
# operations
  %0   - void addAlias(java.lang.String alias)
  %1   - void addChild(org.apache.catalina.Container child)
  %2   - void destroy()
  %3   - [Ljava.lang.String; findAliases()
  %4   - void init()
  %5   - void removeAlias(java.lang.String alias)
  %6   - void start()
  %7   - void stop()
#there's no notifications

Now let’s get a couple attribute values:

$>get appBase
#mbean = Catalina:host=localhost,type=Host:
appBase = webapps;

$>get children
#mbean = Catalina:host=localhost,type=Host:
children = [ Catalina:j2eeType=WebModule,name=//localhost/,J2EEApplication=none,J2EEServer=none, Catalina:j2eeType=WebModule,name=//localhost/solr,J2EEApplication=none,J2EEServer=none ];

Check_mk Monitoring

Check_mk provides a layer of functionality over Nagios that simplifies configuration and monitoring of remote machines. The check_mk installation is located at:

and uses the central LDAP for authentication.

Adding a Server to Check_mk

To monitor a new server with check_mk, it is necessary to install check-mk- agent, enable it as a service using xinetd, and ensure that fire walls are set to allow requests from the check_mk server (monitor.dataone.org, 129.237.201.155). By default, the check-mk-service listens on TCP port 6556.

For Ubuntu servers, install the check-mk-agent:

sudo apt-get update
sudo apt-get install xinetd check-mk-agent

Edit the xinetd configuration:

service check_mk
{
    type           = UNLISTED
    port           = 6556
    socket_type    = stream
    protocol       = tcp
    wait           = no
    user           = root
    server         = /usr/bin/check_mk_agent

    # If you use fully redundant monitoring and poll the client
    # from more then one monitoring servers in parallel you might
    # want to use the agent cache wrapper:
    #server         = /usr/bin/check_mk_caching_agent

    # configure the IP address(es) of your Nagios server here:
    #only_from      = 127.0.0.1 10.0.20.1 10.0.20.2
    only_from    = 127.0.0.1 129.237.201.155

    # Don't be too verbose. Don't log every check. This might be
    # commented out for debugging. If this option is commented out
    # the default options will be used for this service.
    log_on_success =

    disable        = no
}

Then restart xinetd and poke a hole through the firewall:

sudo service xinetd restart
sudo ufw allow from 129.237.201.155 to any port 6556

You can check this is running by connecting with telnet from an address listed in the only_from configuration parameter:

telnet MY_HOST 6556

The response should be immediate and verbose.

Add the server to the monitored set of servers by logging in https://monitor.dataone.org/check_mk then under WATO | Hosts add a new host to the appropriate group. Check the services, save the configuration, and the status should appear in the monitored servers.