Quarterly Metrics
=================

.. contents::
   :depth: 3


Phase 2, Quarter 1
------------------

=================== ============== ==============
Metric              Q1 Value       Q2 Value**
=================== ============== ==============
Start Date          2014-10-01     2015-01-01
End Date            2014-12-31     2015-03-31 **
Data Downloads      13,833         9,915
Data Uploads        516            22
Data Additions*     53,353         2,278
Num Member Nodes    26             26
Num ITK Tools       6              6
CN Uptime           99.9996%       99.9996%
Num data files      157,500        160,942
Data Total Bytes    1,450,075 MiB  1,609,166 MiB
Num metadata files  212,668        216,770
=================== ============== ==============

  \* *Data Additions* indicates the number of new data objects added to Member
  Nodes as recorded by the Coordinating Nodes through the synchronization
  process. This number differs from the number of Data Uploads because Member
  Nodes may choose to add content to their data repository through mechanisms
  other than the DataONE service interfaces. By definition, this is necessary
  for Tier 1 Member Nodes as do not support the ability to write content
  through the DataONE service interfaces (supported by Tier 3 and higher).

  \** Preliminary metrics, covers the period from 1 Jan, 2015 through 
  2015-02-27 at about 09:30ET


Gathering Metrics for Monthly and Quarterly Reports
---------------------------------------------------

The metrics to be reported on are listed in the Project Execution Plan and are
summarized in the `Phase 2 Metrics Worksheet`_.

For Phase 2, Q1 the metrics to be reported by CI are::

  Data Downloads
  Data Uploads
  Number of Member Nodes
  Number of tools in Investigator Toolkit
  Uptime of CNs
  Number of data files available
  Total size of data files available
  Number of metadata files available

Additions for Phase 2, Q2 include::

  Search Events
  Number of users that enter queries only
  Number of users that access the data repositories
  

Data Downloads
~~~~~~~~~~~~~~

Interpreted as the number of "get" requests (logged as READ events) made for objects with an object format classified as DATA.

The log aggregation solr end point is::

  https://cn.dataone.org/cn/v1/query/logsolr/select

A query to retrieve the number of READ events for format type of DATA is::

  formatType:DATA 
  AND event:READ

A query to retrieve the number of READ events for format type of DATA over the
last 91 days is::

  formatType:DATA 
  AND event:read 
  AND dateLogged:[NOW-91DAY TO NOW]

A query to retrieve the number of READ events for format type of DATA within a
specific time range is::

  formatType:DATA 
  AND event:read 
  AND dateLogged:[2014-10-01T00:00:00.000Z TO 2014-12-31T23:59:59.000Z]

It is also necessary to ignore requests from systems internal to DataONE such
as the Coordinating Nodes. This can be done by filtering out the IP address of
the CNs (128.111.54.80, 160.36.13.150 and 64.106.40.6) in the query by adding
the clause::

  AND -ipAddress:(128.111.54.80 OR 160.36.13.150 OR 64.106.40.6)

The complete query would then be::

  formatType:DATA 
  AND event:read 
  AND dateLogged:[2014-10-01T00:00:00.000Z TO 2014-12-31T23:59:59.999Z] 
  AND -ipAddress:(128.111.54.80 OR 160.36.13.150 OR 64.106.40.6)


Expressed as a URL::

  https://cn.dataone.org/cn/v1/query/logsolr/select?q=formatType%3ADATA%20AND%20event%3Aread%20AND%20dateLogged%3A%5B2014-10-01T00%3A00%3A00.000Z%20TO%202014-12-31T23%3A59%3A59.999Z%5D%20AND%20-ipAddress%3A(128.111.54.80%20OR%20160.36.13.150%20OR%2064.106.40.6)

CURL::

  curl -k -s  "https://cn.dataone.org/cn/v1/query/logsolr/select?q=formatType%3ADATA%20AND%20event%3Aread%20AND%20dateLogged%3A%5B2014-10-01T00%3A00%3A00.000Z%20TO%202014-12-31T23%3A59%3A59.999Z%5D%20AND%20-ipAddress%3A(128.111.54.80%20OR%20160.36.13.150%20OR%2064.106.40.6)" | xml sel -t -m "//result" -v "@numFound" -n


Data Uploads
~~~~~~~~~~~~

Interpreted as the number of "create" requests (logged as create events) made
for objects with an object format classified as DATA. Analyzing these log
events will indicate the number of data uploads made through the DataONE
service interfaces. Member Nodes may also have alternative mechanisms for
populating their data repositories, and content added in this manner are not
reflected in the logs, though can be determined by querying DataONE for
content that was newly added over the time period in question.

A query to retrieve the number of CREATE events for format type of DATA within
a specific time range and excluding CNs (and so determine uploads through the
DataONE service interfaces)::

  formatType:DATA 
  AND event:create 
  AND dateLogged:[2014-10-01T00:00:00.000Z TO 2014-12-31T23:59:59.999Z] 
  AND -ipAddress:(128.111.54.80 OR 160.36.13.150 OR 64.106.40.6)

Alternatively, a query to retrieve the number of new DATA objects added within
a specific time range (querying against the query/solr search end point)::

  dateUploaded:[2014-10-01T00:00:00.000Z TO 2014-12-31T23:59:59.999Z] 
  AND formatType:DATA


URL for log records::

  https://cn.dataone.org/cn/v1/query/logsolr/select?q=formatType%3ADATA%20AND%20event%3Acreate%20AND%20dateLogged%3A%5B2014-10-01T00%3A00%3A00.000Z%20TO%202014-12-31T23%3A59%3A59.999Z%5D

URL for new objects (with optional grouping by node)::

  https://cn.dataone.org/cn/v1/query/solr/?q=dateUploaded:%5B2014-10-01T00:00:00.000Z+TO+2014-12-31T23:59:59.999Z%5D+AND+formatType:DATA&facet=true&facet.field=datasource&rows=0

CURL for log records::

  curl -k -s "https://cn.dataone.org/cn/v1/query/logsolr/select?q=formatType%3ADATA%20AND%20event%3Acreate%20AND%20dateLogged%3A%5B2014-10-01T00%3A00%3A00.000Z%20TO%202014-12-31T23%3A59%3A59.999Z%5D" | xml sel -t -m "//result" -v "@numFound" -n

CURL for new content::

  curl -k -s "https://cn.dataone.org/cn/v1/query/solr/?q=dateUploaded:%5B2014-10-01T00:00:00.000Z+TO+2014-12-31T23:59:59.999Z%5D+AND+formatType:DATA&facet=true&facet.field=datasource&rows=0" | xml sel -t -m "//result" - v "@numFound" -n

Number of Member Nodes
~~~~~~~~~~~~~~~~~~~~~~

This is the number of node entries of type "mn" in the node list reported by the CN.

CURL::

  curl -k -s "https://cn.dataone.org/cn/v1/node" | xml sel -t -m "//node[@type='mn']" -v "identifier" -n | wc -l



Number of tools in Investigator Toolkit
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This number needs to be manually counted and should include packages intended for public use.

- Java libclient
- python libclient
- python CLI
- ONEMercury
- ONE-R
- Member Node Dashboard


Uptime of CNs
~~~~~~~~~~~~~

Currently this is recorded manually with the assumption that operations started on July 1, 2012 at 12:00 UTC.

Uptime is calculated as the percentage of of the total duration that systems have been operational. This is::

  uptime = operationalPeriod / totalPeriod
         = (totalPeriod - downTime) / totalPeriod

There have been two downtime events logged:

  - 30 seconds configuration error during upgrade
  - 5 minutes undetected switch failure

Thus uptime can be calculated on the command line with::

  EVENTS=(30 300)
  T0=$(date -j -u -f %Y-%m-%d-%H-%M-%S 2012-07-01-12-00-00 +%s)
  T1=$(date -j -u -f %Y-%m-%d-%H-%M-%S 2014-12-31-23-59-59 +%s)
  PERIOD=$( bc <<< "$T1-$T0" )
  DOWNTIME=$( IFS="+"; bc <<< "${EVENTS[*]}" )
  UPTIME=$( bc <<< "scale=5;100.0*($PERIOD-$DOWNTIME)/$PERIOD")
  echo $UPTIME


Number of Data Files
~~~~~~~~~~~~~~~~~~~~

The number of data files can be determined by querying the search index for
the number of objects with objectType of DATA. Note that to get an actual
total, the request needs to be authenticated as a CN or equivalent trusted
entity.

..Note::  

  ``curl`` on OSX does not work with client certificates. It is necessary to
  install a new version or just run the command from a linux system.

Solr query::
  
  formatType:DATA

URL::

    https://cn.dataone.org/cn/v1/query/solr/?q=formatType:DATA&rows=0

Command Line for public content::

  curl -k -s "https://cn.dataone.org/cn/v1/query/solr/?q=formatType:DATA&rows=0" \
  | xml sel -t -m "//result[@name='response']" -v "@numFound" -n

Command line for all content::

  curl -k -s --cert cnode.pem  "https://cn.dataone.org/cn/v1/query/solr/?q=formatType:DATA&rows=0" \
  | xml sel -t -m "//result[@name='response']" -v "@numFound" -n



Total Size of Data Files
~~~~~~~~~~~~~~~~~~~~~~~~

The total size of data files can be determined by querying the search index for the number of objects with objectType of DATA and including a request for summary statistics on the size field. Note that to get an actual total, the request needs to be authenticated as a CN or equivalent trusted entity.

Solr query::
  
  formatType:DATA&rows=0&stats=true&stats.field=size

URL::

  https://cn.dataone.org/cn/v1/query/solr/?q=formatType:DATA&rows=0&stats=true&stats.field=size

Command line::

  URL="https://cn.dataone.org/cn/v1/query/solr/?q=formatType:DATA&rows=0&stats=true&stats.field=size"
  BYTES=$(curl -k -s --cert cnode.pem "${URL}" \
  | xml sel -t -m "//lst[@name='size']" -v "double[@name='sum']" -n)
  python -c "print \"%.0fMiB\" % ($BYTES / 1048576)"


Number of Metadata Files
~~~~~~~~~~~~~~~~~~~~~~~~

The number of metadata files can be determined by querying the search index for the number of objects with objectType of METADATA. Note that to get an actual total, the request needs to be authenticated as a CN or equivalent trusted entity.

Solr query::
  
  formatType:METADATA

URL::

    https://cn.dataone.org/cn/v1/query/solr/?q=formatType:METADATA&rows=0

Command Line::

  curl -k -s --cert cnode.pem  "https://cn.dataone.org/cn/v1/query/solr/?q=formatType:METADATA&rows=0" \
  | xml sel -t -m "//result[@name='response']" -v "@numFound" -n



Search Events
~~~~~~~~~~~~~

There are two common entry points for searching the CNs, one is through the
ONEMercury interface, the other is through the search index using the REST
query or search calls. The ONEMercury interface also uses the REST search
interface, so it should be possible to use the log for those actions as a
complete representations of search events.


Number of users that enter queries only
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


Number of users that access the data repositories
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


Defining a "User"
-----------------


Verifying Log Record Counts
---------------------------

The following procedure is used to independently verify the content of the logsolr index.

In Summary:

1. Obtain a list of Member Node Base URLs
2. Retrieve log records from each Member Node for the specified period
3. Load log records into an analysis environment that supports the necessary queries.
4. Perform queries to retrieve stats comparable to that provided by the logsolr index.

1. Member Node Base URLs
~~~~~~~~~~~~~~~~~~~~~~~~

::

    d1listnodes -b "https://cn.dataone.org/cn"

	urn:node:KNB        https://knb.ecoinformatics.org/knb/d1/mn          Knowledge Network for Biocomplexity
	urn:node:ESA        https://data.esa.org/esa/d1/mn                    ESA Data Registry
	urn:node:SANPARKS   https://dataknp.sanparks.org/sanparks/d1/mn       SANParks Data Repository
	urn:node:USGSCSAS   http://mercury-ops2.ornl.gov/clearinghouse/mn     USGS Core Sciences Clearinghouse
	urn:node:ORNLDAAC   http://mercury-ops2.ornl.gov/ornldaac/mn          ORNL DAAC
	urn:node:LTER       https://tropical.lternet.edu/knb/d1/mn            LTER Network Member Node
	urn:node:CDL        https://merritt.cdlib.org:8084/knb/d1/mn          Merritt Repository
	urn:node:PISCO      https://data.piscoweb.org/catalog/d1/mn           PISCO MN
	urn:node:ONEShare   https://oneshare.unm.edu/knb/d1/mn                ONEShare Repository
	urn:node:mnORC1     https://mn-orc-1.dataone.org/knb/d1/mn            DataONE ORC Dedicated Replica Server
	urn:node:mnUNM1     https://mn-unm-1.dataone.org/knb/d1/mn            DataONE UNM Dedicated Replica Server
	urn:node:mnUCSB1    https://mn-ucsb-1.dataone.org/knb/d1/mn           DataONE UCSB Dedicated Replica Server
	urn:node:TFRI       https://metacat.tfri.gov.tw/tfri/d1/mn            TFRI Data Catalog
	urn:node:USANPN     https://mynpn.usanpn.org/knb/d1/mn                USA National Phenology Network
	urn:node:SEAD       http://seadva.d2i.indiana.edu:8081/sead/rest/mn   SEAD Virtual Archive
	urn:node:GOA        https://goa.nceas.ucsb.edu/goa/d1/mn              Gulf of Alaska Data Portal
	urn:node:KUBI       https://bidataone.nhm.ku.edu/mn                   University of Kansas - Biodiversity Institute
	urn:node:LTER_EUROPEhttps://data.lter-europe.net/knb/d1/mn            LTER Europe Member Node
	urn:node:DRYAD      https://datadryad.org/mn                          Dryad Digital Repository
	urn:node:CLOEBIRD   https://dataone.ornith.cornell.edu/metacat/d1/mn  Cornell Lab of Ornithology - eBird
	urn:node:EDACGSTORE https://gstore.unm.edu/dataone/                   EDAC Gstore Repository
	urn:node:IOE        https://data.rcg.montana.edu/catalog/d1/mn        Montana IoE Data Repository
	urn:node:US_MPC     https://dataone-prod.pop.umn.edu/mn               Minnesota Population Center
	urn:node:EDORA      http://mercury-ops2.ornl.gov/EDORA_MN/mn          Environmental Data for the Oak Ridge Area (EDORA)
	urn:node:RGD        http://mercury-ops2.ornl.gov/RGD_MN/mn            Regional and Global biogeochemical dynamics Data (RGD)
	urn:node:GLEON      https://poseidon.limnology.wisc.edu/metacat/d1/mn GLEON Data Repository


2. Retrieve Log Records
~~~~~~~~~~~~~~~~~~~~~~~

For each node::

  export NODE="https://knb.ecoinformatics.org/knb/d1/mn"
  ./d1logrecords -c ../.dataone/cnode.pem \
    -C 999999 \
    -X - \
    -B 2014-10-01T00:00:00.000+00:00 \
    -D 2014-12-31T23:59:59.000+00:00 > KNB.xml


3. Put Records in an Analysis Environment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The "analysis environment" used was XMLStarlet.

First wrap the logEntry elements per file in a top level Log element::

	#!/bin/bash
	LOGRECORDS=$1
	echo "<Log>" > tmpxml
	cat $LOGRECORDS >> tmpxml
	echo "</Log>" >> tmpxml
	mv tmpxml $LOGRECORDS

Now count the read and create events. Can not however, do DATA restriction as
that requires a join with another data resource::

	#!/bin/bash
	LOGRECORDS=$1
	nread=$(cat $LOGRECORDS | xml sel -t -m "//Log" \
	        -v "count(logEntry[event/text()='read'])" -n)
	ncreate=$(cat $LOGRECORDS | xml sel -t -m "//Log" \
	        -v "count(logEntry[event/text()='create'])" -n)
	echo $LOGRECORDS $nread  $ncreate  

Note that the CNs can be excluded from the counts by using::

  	count(logEntry[event/text()='read' \
    and not(contains('128.111.54.80 160.36.13.150 64.106.40.6', ipAddress))]) 


With the results (nodes not listed returned 0 for both read and create)::

    NODE         Read   Create
	CDL          33980  16805
	CLOEBIRD     68     0
	DRYAD        101327 0
	EDAC         0      0
	ESA          814    0
	GOA          1388   1
	KNB          109733 19
	LTER         163151 100
	ONEShare     49     0
	PISCO        1995   1
	TFRI         4185   29
	USANPN       23     0
	US_MPC       2597   1032
	mnORC1       1145   28
	mnUCSB1      1235   35
	mnUNM1       1169   29

Equivalent values using logsolr would be found using the query::
 
  nodeId:urn\:node\:CDL AND event:read AND dateLogged:[2014-10-01T00:00:00.000Z TO 2014-12-31T23:59:59.000Z]

URL::
  	https://cn.dataone.org/cn/v1/query/logsolr/select?q=nodeId%3Aurn%5C%3Anode%5C%3ACDL%20AND%20event%3Aread%20AND%20dateLogged%3A%5B2014-10-01T00%3A00%3A00.000Z%20TO%202014-12-31T23%3A59%3A59.000Z%5D

Script::

	#!/bin/bash
	FNAME=$(basename "$1")
	NID="${FNAME##*.}"
	nread=$(curl -k -s "https://cn.dataone.org/cn/v1/query/logsolr/select?q=nodeId%3Aurn%5C%3Anode%5C%3A${NID}%20AND%20event%3Aread%20AND%20dateLogged%3A%5B2014-10-01T00%3A00%3A00.000Z%20TO%202014-12-31T23%3A59%3A59.000Z%5D" | xml sel -t -m "//result" -v "@numFound" -n)
	ncreate=$(curl -k -s "https://cn.dataone.org/cn/v1/query/logsolr/select?q=nodeId%3Aurn%5C%3Anode%5C%3A${NID}%20AND%20event%3Acreate%20AND%20dateLogged%3A%5B2014-10-01T00%3A00%3A00.000Z%20TO%202014-12-31T23%3A59%3A59.000Z%5D" | xml sel -t -m "//result" -v "@numFound" -n)
	echo $FNAME $nread $ncreate


Scripts
-------

::

  .. include:: metrics_report.sh


References
----------

.. _Phase 2 Metrics Worksheet: https://docs.google.com/spreadsheets/d/1bRUyK7Xat88ywDkfa5Py03ytMxxTVsn86KCUbir1TlI/edit#gid=0