<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>18.3. Metacat Usage Statistics Service — Metacat 2.8.1 documentation</title> <link rel="stylesheet" href="_static/bootstrap.min.css" type="text/css" /> <link rel="stylesheet" href="_static/font-awesome/css/font-awesome.min.css" type="text/css" /> <link rel="stylesheet" href="_static/pygments.css" type="text/css" /> <link rel="stylesheet" href="_static/metacatui.css" type="text/css" /> <script type="text/javascript"> var DOCUMENTATION_OPTIONS = { URL_ROOT: './', VERSION: '2.8.1', COLLAPSE_MODINDEX: false, FILE_SUFFIX: '.html', HAS_SOURCE: true }; </script> <script type="text/javascript" src="_static/jquery.js"></script> <script type="text/javascript" src="_static/underscore.js"></script> <script type="text/javascript" src="_static/doctools.js"></script> <link rel="top" title="Metacat 2.8.1 documentation" href="index.html" /> <link rel="up" title="18. Appendix: Development Issues" href="development.html" /> <link rel="prev" title="18.2. DOI Management" href="doi.html" /> <link rel="next" title="18.4. ORE Model for Derived Data Packages" href="ore-model-expansion.html" /> </head> <body> <div id="metacatDocs"> <div class="banner"> <a href="index.html"><img class="logo" src="_static/metacat-logo-white.png" /></a> <a href="index.html"><h1 class="title">Metacat: Metadata and Data Management Server</h1></a> <img class="logo-right" src="_static/nceas-logo-white.png" /> </div> <div class="related"> <h3>Navigation</h3> <ul> <li class="right"> <span id="searchbox" style="display: none;"> <form class="search" action="search.html" method="get"> <input type="text" name="q" size="18" /> <input type="submit" value="Go" class="icon-search"/> <input type="hidden" name="check_keywords" value="yes" /> <input type="hidden" name="area" value="default" /> </form> </span> </li> <script type="text/javascript">$('#searchbox').show(0);</script> <li class="right"> <a href="genindex.html" title="General Index" accesskey="I">index</a> </li> <li class="right"> <a href="ore-model-expansion.html" title="18.4. ORE Model for Derived Data Packages" accesskey="N">next</a> </li> <li class="right"> <a href="doi.html" title="18.2. DOI Management" accesskey="P">previous</a> </li> <li class="breadcrumb first"><a href="index.html">Metacat 2.8.1 documentation</a> »</li> <li class="breadcrumb"><a href="development.html" accesskey="U">18. Appendix: Development Issues</a> »</li> </ul> </div> <div class="document"> <div class="documentwrapper"> <div class="bodywrapper"> <div class="body"> <div class="section" id="metacat-usage-statistics-service"> <h1>18.3. Metacat Usage Statistics Service<a class="headerlink" href="#metacat-usage-statistics-service" title="Permalink to this headline">¶</a></h1> <div class="section" id="overview"> <h2>18.3.1. Overview<a class="headerlink" href="#overview" title="Permalink to this headline">¶</a></h2> <p>This document describes a proposed usage statistics service for Metacat.</p> <p>This new service will provide Metacat usage information to clients about data and metacata access events.</p> </div> <div class="section" id="requirements"> <h2>18.3.2. Requirements<a class="headerlink" href="#requirements" title="Permalink to this headline">¶</a></h2> <p>The statistics service should have an easy to learn API that allows for query fields to be added and provide reports in XML, JSON.</p> <div class="section" id="provided-statistics"> <h3>18.3.2.1. Provided Statistics<a class="headerlink" href="#provided-statistics" title="Permalink to this headline">¶</a></h3> <p>The service will include the following statistics:</p> <blockquote> <div><ul class="simple"> <li>Dataset views</li> <li>Package downloads</li> <li>Size in bytes of package downloads</li> <li>Citations</li> </ul> </div></blockquote> </div> <div class="section" id="results-filtering"> <h3>18.3.2.2. Results Filtering<a class="headerlink" href="#results-filtering" title="Permalink to this headline">¶</a></h3> <p>Reports returned by the service must be able to be filtered by the following fields:</p> <blockquote> <div><ul class="simple"> <li>A PID or list of PIDs</li> <li>Creator or list of creators (DN, or ORCID, or some amalgam – to be discussed)</li> <li>A time range of access event (upload, download, etc.)</li> <li>Spatial location of access event (upload, download, etc.)</li> <li>IP Address</li> <li>Accessor or list of accessors (DN, or ORCID, or some amalgam, needs ACL – to be discussed)</li> </ul> </div></blockquote> </div> <div class="section" id="results-aggregation"> <h3>18.3.2.3. Results Aggregation<a class="headerlink" href="#results-aggregation" title="Permalink to this headline">¶</a></h3> <dl class="docutils"> <dt>Reports must be able to be aggregated by the following fields:</dt> <dd><ul class="first last simple"> <li>User (DN, or ORCID, or some amalgam )</li> <li>Time range, aggregated to requested unit (day, week, month, year)</li> <li>Spatial range, aggregated to requested unit</li> </ul> </dd> </dl> </div> <div class="section" id="performance"> <h3>18.3.2.4. Performance<a class="headerlink" href="#performance" title="Permalink to this headline">¶</a></h3> <p>The query service should provide results quickly, as it will be used to construct the user dashboard and possibly other UI elements.</p> </div> </div> <div class="section" id="statistics-service-solr-index"> <h2>18.3.3. Statistics Service Solr Index<a class="headerlink" href="#statistics-service-solr-index" title="Permalink to this headline">¶</a></h2> <p>Currently Metacat writes access information to the table ‘access_log’ that has the fields:</p> <table border="1" class="docutils"> <colgroup> <col width="29%" /> <col width="71%" /> </colgroup> <tbody valign="top"> <tr class="row-odd"><td>name</td> <td>data type</td> </tr> <tr class="row-even"><td>entryid</td> <td>bigint</td> </tr> <tr class="row-odd"><td>ip_address</td> <td>character varying(512)</td> </tr> <tr class="row-even"><td>user_agent</td> <td>character varying(512)</td> </tr> <tr class="row-odd"><td>principal</td> <td>character varying(512)</td> </tr> <tr class="row-even"><td>docid</td> <td>character varying(250)</td> </tr> <tr class="row-odd"><td>event</td> <td>character varying(512)</td> </tr> <tr class="row-even"><td>date_logged</td> <td>timestamp without time zone</td> </tr> </tbody> </table> <p>In order to provide fast queries, aggregation and faceting of selected fields, access log information will be exported from the current ‘access_log’ table and from the ‘systemmetadata’ table into a new Solr index that will be configured in Metacat as a second Solr core. The new Solr index will be based on access events and will contain the fields shown in the following table:</p> <table border="1" class="docutils"> <colgroup> <col width="42%" /> <col width="58%" /> </colgroup> <tbody valign="top"> <tr class="row-odd"><td>name</td> <td>ddata type</td> </tr> <tr class="row-even"><td>id</td> <td>str</td> </tr> <tr class="row-odd"><td>datetime</td> <td>date</td> </tr> <tr class="row-even"><td>event</td> <td>str</td> </tr> <tr class="row-odd"><td>location</td> <td>location</td> </tr> <tr class="row-even"><td>pid</td> <td>str</td> </tr> <tr class="row-odd"><td>rightsHolder</td> <td>str</td> </tr> <tr class="row-even"><td>principal</td> <td>str</td> </tr> <tr class="row-odd"><td>size</td> <td>int</td> </tr> <tr class="row-even"><td>formatId</td> <td>str</td> </tr> </tbody> </table> <p>The new Solr index will contain the following fields:</p> <div class="highlight-python"><div class="highlight"><pre><doc> <str name="id">2E3E8935-364E-4000-9357-6CE4E067D236</str> <date name="datetime">2014-01-01T01:01:01Z</date> <str name="event">read</str> <location name="location">45.17614,-93.87341</location> <str name="pid">sla.2.1</str> <str name=â€rightsHolderâ€>uid=williams,o=unaffiliated,dc=ecoinformatics,dc=org</str> <str name="principal">uid=williams,o=unaffiliated,dc=ecoinformatics,dc=org</str> <int name="size">52273</int> <str name=â€formatIdâ€>eml://ecoinformatics.org/eml-2.0.1</str> </doc> </pre></div> </div> <p>The second Solr core that will contain usage statistics will require a modification to the existing solr.xml file:</p> <div class="highlight-python"><div class="highlight"><pre><solr persistent="false"> <!-- adminPath: RequestHandler path to manage cores. If 'null' (or absent), cores will not be manageable via request handler --> <cores adminPath="/admin/cores" defaultCoreName="collection1"> <core name="collection1" instanceDir="." /> <core name=â€stats†instanceDir=â€.â€/> </cores> </solr> </pre></div> </div> <p>A Java TimerTask will run the import method that will read event records from the Metacat access_log table and combine these record data from the systemmetadata table and write this combined entry to the stats Solr index. Access_log entry types such as ‘synchronization_failed’ and ‘replication’ will be filtered out and will not be written to the Solr index. The time of the last record imported from access_log will be stored so that subsequent imports would start from the next unimported event record. If required, the data may be aggregated by time interval, such as week or month.</p> <p>The statistics service will be exposed as a new query engine with a DataONE URL such as:</p> <div class="highlight-python"><div class="highlight"><pre>https://hostname/knb/d1/mn/v1/query/stats/<query> </pre></div> </div> <p>Queries will be passed to the new Solr query engine using the standard Solr query syntax.</p> <p>One new class will be added to Metacat to handle stats queries, StatsQueryService. Figure 2 shows a call trace for a statistics service query.</p> <div class="figure" id="id1"> <img alt="_images/stats-query-sequence-diagram.png" src="_images/stats-query-sequence-diagram.png" /> <p class="caption"><span class="caption-text">Figure 2. Statistics query sequence diagram.</span></p> </div> <p>The StatsQuerySerivce class will transform the incoming query to Solr parameters, issue the query and returns the query result as a byte stream of text/html content.</p> </div> <div class="section" id="statistics-service-usage"> <h2>18.3.4. Statistics Service Usage<a class="headerlink" href="#statistics-service-usage" title="Permalink to this headline">¶</a></h2> <p>The following sections show some of the queries that will be available through the statistics service.</p> <div class="section" id="usage-of-pids-provided-by-a-specified-rights-holder"> <h3>18.3.4.1. Usage of pids provided by a specified rights holder<a class="headerlink" href="#usage-of-pids-provided-by-a-specified-rights-holder" title="Permalink to this headline">¶</a></h3> <p>The following example shows a query for download volume for pids created by rightsHolder=williams with download size statistics aggregated by pid:</p> <div class="highlight-python"><div class="highlight"><pre>http://myd1host/knb/d1/mn/v1/query/stats/q=*:*&fq=rightsHolder:uid=williams*&fq=event:read&stats=true&stats.field=size&rows=0&stats.facet=pid </pre></div> </div> <p>The following result is returned:</p> <div class="highlight-python"><div class="highlight"><pre><?xml version="1.0" encoding="UTF-8"?> <response> ... <result name="response" numFound="8" start="0"/> <lst name="stats"> <lst name="stats_fields"> <lst name="size"> <double name="min">30.0</double> <double name="max">1000.0</double> <double name="sum">3150.0</double> <long name="count">8</long> <long name="missing">0</long> <double name="sumOfSquares">3004500.0</double> <double name="mean">393.75</double> <double name="stddev">502.0226944215627</double> <lst name="facets"> <lst name="pid"> <lst name="sla.3.1"> <double name="min">1000.0</double> <double name="max">1000.0</double> <double name="sum">3000.0</double> <long name="count">3</long> <long name="missing">0</long> <double name="sumOfSquares">3000000.0</double> <double name="mean">1000.0</double> <double name="stddev">0.0</double> </lst> <lst name="sla.2.1"> <double name="min">30.0</double> <double name="max">30.0</double> <double name="sum">150.0</double> <long name="count">5</long> <long name="missing">0</long> <double name="sumOfSquares">4500.0</double> <double name="mean">30.0</double> <double name="stddev">0.0</double> </lst> </lst> </lst> </lst> </lst> </lst> </response> </pre></div> </div> <p>The previous query can be constrained to a specific time by adding a time range, i.e.</p> <div class="highlight-python"><div class="highlight"><pre>&fq=datetime:%[2013-01-01T23:59:59Z TO 2013-04-31T23:59:59Z] </pre></div> </div> </div> <div class="section" id="data-uploads"> <h3>18.3.4.2. Data uploads<a class="headerlink" href="#data-uploads" title="Permalink to this headline">¶</a></h3> <p>The following query shows counts of data uploads by format type by a specified user:</p> <div class="highlight-python"><div class="highlight"><pre>http://myd1host/knb/d1/mn/v1/query/stats/q=*:*&fq=rightsHolder:uid=williams*&fq=event:create&facet=true&facet.field=formatId&rows=0 </pre></div> </div> <div class="highlight-python"><div class="highlight"><pre><?xml version="1.0" encoding="UTF-8"?> <response> ... <result name="response" numFound="3" start="0"/> <lst name="facet_counts"> <lst name="facet_queries"/> <lst name="facet_fields"> <lst name="formatId"> <int name="BIN">2</int> <int name="eml://ecoinformatics.org/eml-2.1.1">1</int> <int name="text/csv">0</int> </lst> </lst> <lst name="facet_dates"/> <lst name="facet_ranges"/> </lst> </response> </pre></div> </div> </div> <div class="section" id="data-downloads"> <h3>18.3.4.3. Data downloads<a class="headerlink" href="#data-downloads" title="Permalink to this headline">¶</a></h3> <p>The following query shows data download counts by a specific user for each month in 2013:</p> <div class="highlight-python"><div class="highlight"><pre>http://myd1host/knb/d1/mn/v1/query/stats/q=*:*&fq=principal:williams&fq=event:read&fq=formatId:BIN&facet=true&facet.field=event&facet.range=datetime&facet.range.start=2013-01-01T01:01:01Z&facet.range.end=2013-12-31T24:59:59Z&facet.range.gap=%2B1MONTH </pre></div> </div> <div class="highlight-python"><div class="highlight"><pre><?xml version="1.0" encoding="UTF-8"?> <response> ... <lst name="facet_ranges"> <lst name="datetime"> <lst name="counts"> <int name="2013-01-01T01:01:01Z">0</int> <int name="2013-02-01T01:01:01Z">0</int> <int name="2013-03-01T01:01:01Z">0</int> <int name="2013-04-01T01:01:01Z">0</int> <int name="2013-05-01T01:01:01Z">0</int> <int name="2013-06-01T01:01:01Z">2</int> <int name="2013-07-01T01:01:01Z">1</int> <int name="2013-08-01T01:01:01Z">0</int> <int name="2013-09-01T01:01:01Z">0</int> <int name="2013-10-01T01:01:01Z">0</int> <int name="2013-11-01T01:01:01Z">0</int> <int name="2013-12-01T01:01:01Z">0</int> </lst> <str name="gap">+1MONTH</str> <date name="start">2013-01-01T01:01:01Z</date> <date name="end">2014-01-01T01:01:01Z</date> </lst> </lst> </lst> </response> </pre></div> </div> <p>The following query shows EML metadata downloads by a specific user for each month in 2013.</p> <div class="highlight-python"><div class="highlight"><pre>http://myd1host/knb/d1/mn/v1/query/stats/q=*:*&fq=principal:*williams*&fq=event:read&fq=formatId:*eml*&facet=true&facet.field=event&facet.range=datetime&facet.range.start=2013-01-01T01:01:01Z&facet.range.end=2013-12-31T24:59:59Z&facet.range.gap=%2B1MONTH </pre></div> </div> <div class="highlight-python"><div class="highlight"><pre><?xml version="1.0" encoding="UTF-8"?> <response> ... <lst name="facet_ranges"> <lst name="datetime"> <lst name="counts"> <int name="2013-01-01T01:01:01Z">0</int> <int name="2013-02-01T01:01:01Z">0</int> <int name="2013-03-01T01:01:01Z">0</int> <int name="2013-04-01T01:01:01Z">1</int> <int name="2013-05-01T01:01:01Z">1</int> <int name="2013-06-01T01:01:01Z">0</int> <int name="2013-07-01T01:01:01Z">2</int> <int name="2013-08-01T01:01:01Z">0</int> <int name="2013-09-01T01:01:01Z">0</int> <int name="2013-10-01T01:01:01Z">0</int> <int name="2013-11-01T01:01:01Z">0</int> <int name="2013-12-01T01:01:01Z">0</int> </lst> <str name="gap">+1MONTH</str> <date name="start">2013-01-01T01:01:01Z</date> <date name="end">2014-01-01T01:01:01Z</date> </lst> </lst> </lst> </response> </pre></div> </div> </div> </div> <div class="section" id="unresolved-issues-questions"> <h2>18.3.5. Unresolved Issues/Questions<a class="headerlink" href="#unresolved-issues-questions" title="Permalink to this headline">¶</a></h2> <blockquote> <div><ol class="arabic simple"> <li>How is the location of an event determined? What do we mean by location?</li> <li>Currently Solr (3.x and 4.x) doesn’t allow faceting by date/time interval, so it isn’t possible to use the stats component to calculate total download volume for a time interval over a time range, such as every month for the last 10 years. Therefor for calculated amounts, a query for each time interval is required.</li> <li>Where will citation info come from? Do we import this into the Solr index?</li> <li>Are there text fields that the statistics service should include, i.e. do we want to provide statistics for queries such as “how many pids were downloaded that mention kelp?”?</li> </ol> </div></blockquote> </div> </div> </div> </div> </div> <div class="clearer"></div> </div> <div class="footer"> <div class="footerNav"> <div class="related"> <h3>Navigation</h3> <ul> <li class="right"> <span id="searchbox" style="display: none;"> <form class="search" action="search.html" method="get"> <input type="text" name="q" size="18" /> <input type="submit" value="Go" class="icon-search"/> <input type="hidden" name="check_keywords" value="yes" /> <input type="hidden" name="area" value="default" /> </form> </span> </li> <script type="text/javascript">$('#searchbox').show(0);</script> <li class="right"> <a href="genindex.html" title="General Index" >index</a> </li> <li class="right"> <a href="ore-model-expansion.html" title="18.4. ORE Model for Derived Data Packages" >next</a> </li> <li class="right"> <a href="doi.html" title="18.2. DOI Management" >previous</a> </li> <li class="breadcrumb first"><a href="index.html">Metacat 2.8.1 documentation</a> »</li> <li class="breadcrumb"><a href="development.html" >18. Appendix: Development Issues</a> »</li> </ul> </div> </div> <div class="small-print"> © Copyright 2012, Regents of the University of California. Created using <a href="http://sphinx.pocoo.org/">Sphinx</a> 1.3.1. </div> </div> </div> </body> </html>