Very Large Data Packages
========================

:Document Status:
  ======== ==================================================================
  Status   Comment
  ======== ==================================================================
  DRAFT    (rnahf) committed minor modifications shortly (1hr) after email 
           to developers@dataone.org
  ======== ==================================================================

.. contents::

Synopsis
~~~~~~~~~

While many data packages are of modest size (<100 objects), some large studies 
generate upwards of 100,000 datasets that form a data package.   These very large 
data packages challenge performance limits in the DataONE data ingest cycle and
can present usability issues in user interfaces not prepared for them.  Both memory
and processor time increase dramatically with increased number of data objects
and relationships expressed.

Potential submitters of packages containing large numbers of data objects must be 
mindful that packages of such an large number of objects is likely to be unusable 
for the majority of interested parties, and should consider consolidating and 
compressing the individual objects into fewer objects to allow easier discovery/ 
inspection and download. This should be especially considered if the objects in 
the package would not be usefully retrieved individually.

Creation of large resource maps is potentially the most time consuming activity,
depending on the tool used.  Deserialization is comparatively quick, but the memory
requirements are high, depending on the type of model used during parsing.  At the
stage of indexing, at issue is the time needed to process index record updates, as
well as the resulting number items in certain fields in the solr records.  Last, 
high-level client methods would like to safely be able to do whole-package downloads, 
but need to be able to detect large data packages which could overwhelm their 
ability to handle such as large package.

Below are discussions and test results of the known issues related to very large 
resource maps, presented in order of when encountered in the object lifecycle.

Identified Issues
~~~~~~~~~~~~~~~~~

Resource map creation
---------------------
Use of the foresite library for building resource maps includes many checks to make
sure that the map validates.  First the identifiers of the data and metadata are 
added to a graph held in memory, then the graph is serialized to RDF/XML format.
For small packages the overhead for building the graph and performing consistency 
checks is minimal, but both memory and time to build seem to scale geometrically 
with the number of objects in the package.  

Test results on different size resource maps are summarized below.  In all cases
there is one metadata object that documents all of the objects.

==============  ================  ===========  ===========
 # of objects    time to build     memory       file size
==============  ================  ===========  ===========
    10                                               7 K
    33                                              20 K
   100              2  seconds        45 MB         60 K
   330                                             192 K
  1000              6  seconds        20 MB        600 K
  3300             24  seconds        23 MB          2 Mb      
 10000            4.5  minutes        30 MB          6 Mb            
 33000             66  minutes       142 MB         20 Mb
==============  ================  ===========  ===========



For creating very large resource maps, generation time using the java foresite 
toolkit is an issue.  Directly creating a serialized resource map is much faster.
For example, using an existing resource map as a template, and a short perl script, 
a 100000 member resource map was created in approximately 10 seconds with the only 
memory cost that of holding an identifier array in memory and any output buffering.



RDF Deserialization
---------------------
Deserialization happens both on the client side when downloading resource maps, 
and on coordinating nodes, both when validating the resource map, and also when
indexing the relationships into the solr index.  Performance metrics obtained 
from JUnit tests monitored with Java Visual VM are summarized below.  Fully
expressed resource maps were deserialized using both the default simple model,
and again using an OWL model loaded with the ORE schema to be able to do semantic 
reasoning.  The reasoning model adds an additional 268 triples from the ORE schema. 


==========  =======  ========  ========  =======  ========  ========
    ..             Default model              Reasoning model
----------  ---------------------------  ---------------------------
 # objects  triples   time      memory   triples   time      memory 
==========  =======  ========  ========  =======  ========  ========
    10           61    1 sec.     9 Mb       329    2 sec.    13 Mb 
    33          176    1 sec.    10 Mb       444    2 sec.    13 Mb            
   100          511    2 sec.    15 Mb       779    2 sec.    17 Mb            
   330         1661    2 sec.    20 Mb      1929    3 sec.    17 Mb 
  1000         5011    2 sec.    17 Mb      5279    3 sec.    24 Mb 
  3300        16511    3 sec.    20 Mb     16779    4 sec.    40 Mb 
 10000        50011    6 sec.    30 Mb     50279    8 sec.    90 Mb            
 33000       165011    7 sec.    51 Mb    165279   10 sec.   264 Mb            
100000       500011   15 sec.   138 Mb    500279   26 sec.   792 Mb 
==========  =======  ========  ========  =======  ========  ========

The same information listed by model size shows that for small models, one can 
see that memory requirements are not a simple function of number of triples, but
also a function of the model type.  The reasoning model uses more memory per
triple than the simple model.  Especially noticeable is that at very large sizes,
in terms of number of triples, the reasoning model uses significantly more memory.



========  =======  ========  ==========
triples    time     memory   model type
========  =======  ========  ==========
     61	   1 sec.    9 Mb      simple
    176    1 sec.   10 Mb      simple
    329    2 sec.   13 Mb     reasoning
    444    2 sec.   13 Mb     reasoning
    511    2 sec.   15 Mb      simple
    779    2 sec.   17 Mb     reasoning
   1661    2 sec.   20 Mb      simple
   1929    3 sec.   17 Mb     reasoning
   5011    2 sec.   17 Mb      simple
   5279    3 sec.   24 Mb     reasoning
  16511    3 sec.   20 Mb      simple
  16779    4 sec.   40 Mb     reasoning
  50011    6 sec.   30 Mb      simple
  50279    8 sec.   90 Mb     reasoning
 165011    7 sec.   51 Mb      simple
 165279   10 sec.  264 Mb     reasoning
 500011   15 sec.  138 Mb      simple
 500279   26 sec.  792 Mb     reasoning
========  =======  ========  ==========

The impact of this is that especially automated applications that deserialize RDF
files (such as the index processor) will need to be able to detect when they are
dealing with a resource map that could exceed available system resources.  

It also seems wise, given that memory issues weigh larger than RDF file size,
to specify that resource maps with more than 50,000 triples need to fully express 
relationships, instead of relying on reasoning models to infer semantically-defined
inverse relationships. This implies that if DataONE allows resource maps to sparsely
populate their relationships, that there also be tools to tell whether an RDF
is fully expressing relationships, or will be relying on semantic reasoning.

 
Indexing
---------
When resource maps are synchronized, the map is read and - once all of the package
members are indexed - the relationships in the map are added to the index records
of the data members.  A 10000 member package will trigger the update of 10000
index records, adding the metadata object pid to the 'isDocumentedBy' field.  
Additionally, both the 'contains' field in the resource map and the 'documents' 
field in the metadata records will be updated with the pids of the 10000 members.
Such many-membered fields are difficult to impossible to display, and are time-
consuming to search when queried.

Indexing is by necessity a single-threaded process, one that can update on the 
order of 100 records/minute.  Therefore a package containing 100,000 members will
take about 1000 minutes, or about 17 hours.  During this time, no other updates
will be processed.

Workarounds for this issue requires a redesign of the index processor so that the 
large resource map does not delay other items in the indexing queue.  Ultimately,
the solution would be to implement a different search engine for tracking package 
relationships, and implementing another search endpoint using SPARQL 
(http://en.wikipedia.org/wiki/SPARQL), and probably hiding the search query details
behind new DataONE API methods to spare the end user from having to learn another
query language to interact with DataONE.


Whole-Package Download
------------------------
The high-level DataPackage.download(packageID) method in d1_libclient implementations
by default downloads the entire collection of data package objects for local usage.  
For these very-large data packages, the total package size is likely to be gigabytes
of information. In order to better support such convenience features, there needs 
to be ways for determining the number of members of a package prior to download. 

This would also help in situations where the number of package members is small,
but the individual data objects are large. 


Mitigations
~~~~~~~~~~~~
It is useful for applications to know when a given data package is too large for 
it to work with, or will require special handling.  Ideally, this could be 
determined before deserializing the xml, and even for some clients, prior to 
download of the resource map itself.  

Indexing performance is a function of member count, while deserialization performance
is a function of the number of triples. Download performance is a function of 
total file size.


Determining Member Count
------------------------

For indexed resource maps, the easiest way to get the member count is with the 
query::

  cn/v1/query/solr/?q=resourceMap:{pid}&rows=0
  
For unindexed resource maps, the count of the number of occurences of the term 
"ore:isAggregatedBy" in the RDF file will suffice.


Determining total package size for download
-------------------------------------------
To get the total size of the package, the following solr queries can be used::

  # returns only sizes of package members
  cn/v1/query/solr/?q=resourceMap:{pid}&fl=id,size  

  # returns sizes for package members and the resource map itself
  cn/v1/query/solr/?q=resourceMap:{pid} OR id:{pid}&fl=id,size 

from which the client could calculate the sum of the sizes returned.

To get the size of the resource map itself (useful for estimating memory requirements)::
  
  # returns size of only the resource map
  cn/v1/query/solr/?q=id:{pid}&fl=id,size   




Determining Memory Requirements for deserialization
---------------------------------------------------
It is the number of triples and type of model used, moreso than the number of 
package members, that best determines the graph model's memory requirement, and 
so any additional triples expressed for each member would multiply the model size. 
The use of ORE proxies, for example, or the inclusion of provenance information 
are situations where this would be the case.  DataONE *is* planning for the 
inclusion of provenance statements in the resource maps, so users and developers 
alike should take this into consideration.

The number of triples in an RDF/XML file can be determined either by parsing the
XML, or by estimating off the resource map byte count.  By parsing the XML, one
would use an XML parser of choice to count all of the sub-elements of all of the 
"rdf:Description" elements.  In psuedo-code::

  tripleCount = 0;
  descriptionList = getRDFDescriptionElements();
  foreach descriptionElement in descriptionList {
     tripleCount += descriptionElement.getElementList().size;
  }

To estimate from the file size, an upper limit of the number of triples can be deduced.
RDF/XML organizes triples as predicate-object sub-elements under an rdf:Description
element for each subject. If the ratio of subjects to triples is low, then the number
of bytes per triple is determined by the length of the predicate-object sub-element.
For a 30-character identifier, that sub-element is about 100 characters, and so::


  upper limit on the number of triples = file size (bytes) / 100 bytes-per-triple
  
So for example, a 5Mb resource map has at most 50K triples, assuming an average
identifier size of 30 characters (URL encoded).  

For a point of reference, a resource map for 1 metadata object documenting 1000 
objects, expressing the 'ore:aggregates', 'ore:isAggregatedBy', 'cito:documents', 
'cito:isDocumentedBy', and 'cito:identifier' predicates creates 5005 triples using 
1003 subjects, and was tested to create 600K file.  Applying the upper limit 
approximation, (600K / 100 = 6K) gives 6000 triples, an over-estimate matching 
the number of subjects.

Also note that long identifiers and identifiers predominated by non-ascii characters
that would be percent encoded in the file (3bytes per character) can lead to 
an even higher upper limit than expected, and similarly, short identifiers in 
the resource map could lead to a less robust upper limit.


Determining the memory requirement from the number of triples can be done either
by interpolating from the tables above, or by equation.  Curve-fits of the 
deserialization performance tests using polynomial equations gave the following:: 

 simple model memory(Mb) ~  2.6E-15 * triples^3 - 1.7E-09 * triples^2 + 0.00044 * triples + 12.7
 (R2 = 0.99466)

 reasoning model memory(Mb) ~   1.25E-10 * triples^2 + 0.0015 * triples + 14.3
 (R2 = 0.99997)
 
Note that the simple model required (rightly or wrongly) a third-order equation 
to get a curve-fit with R2 > 0.9, whereas the reasoning model data could be highly
corelated with a binomial equation.

Expressed as a function of file size (bytes)::

 simple model memory(Mb) ~  2.6E-21 * size^3 - 1.7E-13 * size^2 + 4.4E-06 * size + 12.7

 reasoning model memory(Mb) ~   1.25E-14 * size^2 + 1.5E-05 * size + 14.3