Spatial Search and Plotting Using Geohashes
-------------------------------------------

Introduction
------------

A geohash is an encoded character string that is computed from geographic coordinates. For example
the approximate latitude and longitude of
The National Center For Ecological Analysis and Synthesis is 34.419279, -119.698472 from which the geohash of *9q4gu1y4z* can be derived.
The geohash algorithm is bidirectional, so
geographic coordinates can be encoded into geohashes and geohashes can be decoded to obtain coordinates.
 
Geohashes have the property that characters can be incrementally removed from the right side of the geohash to represent
a geographic location less precisely. A geohash is an approximation of a point, where each length of the geohash 
corresponds to a rectangle (a geohash tile) that is an approximation of the original encoded geographic coordinate.

This feature of geohashes can be useful for searching and plotting at different resolutions.

Table 1 shows the relationship between geohash length and the size of the rectangle 
represented by that geohash at the equator.

.. table:: **Table 1** Geohash Tile Sizes   

   ======= =====================
   Length  Tile Size
   ======= =====================
    1      5,009.4km x 4,992.6km
    2      1,252.3km x 624.1km
    3      156.5km x 156km
    4      39.1km x 19.5km
    5      4.9km x 4.9km
    6      1.2km x 609.4m
    7      152.9m  x 152.4m
    8      38.2m x 19m
    9      4.8m x 4.8m
   10      1.2m x 59.5cm
   11      14.9cm x 14.9cm
   12      3.7cm x 1.9cm 
   ======= =====================


Table 2 shows the relationship between a geohash and the resulting latitude and longitude decoded from the different
length geohashes. As characters are removed from the original geohash '9q4gu1y4z' the bounding rectangle and the accuracy
of the decoded geohash becomes less precise. The decoded geohash corresponds to the centroid of the bounding rectangle.

.. table:: **Table 2.** Geohash length vs Accuracy

  ========== =========================== =============================================
  Geohash    Tile Center lat, long       Tile minlat, minlong, maxlat, maxLong
  ========== =========================== =============================================
  9          22.5, -112.5                0, -135, 45, -90
  9q         36.5625, -118.125           33.75, -123.75, 39.375, -112.5
  9q4        34.45312, -120.23437        33.75, -120.9375, 35.15625, -119.53125
  9q4g       34.36523, -119.70703        34.27734, -119.88281, 34.45312, -119.53125
  9q4gu      34.43115, -119.68505        34.40917, -119.70703, 34.45312, -119.66308
  9q4gu1     34.41741, -119.70153        34.41467, -119.70703, 34.42016, -119.69604
  9q4gu1y    34.41947, -119.69810        34.41879, -119.69879, 34.42016, -119.69741
  9q4gu1y4   34.41922, -119.69861        34.41913, -119.69879, 34.41930, -119.69844
  9q4gu1y4z  34.41928, -119.69846        34.41926, -119.69849, 34.41930, -119.69844
  ========== =========================== =============================================
    
Geohashes comprise a nested spatial indexing system with each level of geohashes tile containing 32 tiles of the next
smaller tile size. The level one
geohashes (length=1) divide the earth into 32 tiles. Each of these 32 tiles is then subdivided into 32 level 2 tiles
and so on.

Geohashes also have the property that all smaller tiles within the enclosing geohash tile begin with the same leading
characters, therefor for the level 1 tile '9', all level 2 sub-tiles begin with '9': '90', '91', '93',..., '9z'.

For example the level 3 geohash tile that encloses much of Santa Barbara County is '9q4'. Also
contained in this bounding rectangle is the city center of Santa Maria (geohash 9q4qg7j2hmdz), Goleta (geohash 9q4gckb5jxu7)
and Santa Barbara (geohash 9q4gu4n7y5b7) all of which begin with the characters '9q4' and fall within the
'9q4' geohash rectangle. This property is very useful for searching and sorting datastores that contain geohashes.

DataONE Search Index and Geohashes
----------------------------------

The DataONE search index contains geohashes that have been computed for each geographic coverage associated with a PID containing
geographic coverage information, which currently includes metadata objects in EML and FGDC formats. The search index
is described here: `<SearchMetadata.html>`_. Each PID in the search index has a geohash computed at nine different resolutions,
corresponding to the geohash lengths shown *Table 3*. The field names are appended with the geohash length, so for example the
field *geohash_1* has a string length of one and corresponds to the largest tile size in *Table 1*. 
Geohashes are added to the search index at different lengths to allow for searching and plotting at different resolutions.

For EML documents, the geohashes are computed by determining the centroid of the XML elements *northBoundingCoordinate, southBoundingCoordinatem, 
eastBoundingCoordinate, westBoundingCoordinate* which are child element of *//dataset/coverage/geographicCoverage/boundingCoordinates*. 
Because any number
of coverages may be defined with the EML format, the geohashes for these coverages are stored in a Solr multi-valued field. EML allows for
the four bounding coordinates to specify a single coordinate (i.e. westBoudingCoordinate=eastBoundingCoordinate and 
northBoundingCoordinate=southBoundingCoordinate), in which case this location is used to compute a geohash.

For FGDC documents, the XML elements *northBoundingCoordinate, southBoundingCoordinate, eastBoundingCoordinate, westBoundingCoordinate*
(parent element //metadata/idinfo/spdom/bounding) are used to compute the geohash using the same method as for EML.

Using Geohashes for plotting
----------------------------
Geohashes can be used to efficiently plot the location of the geographic coverages. The following examples show different
search and plotting strategies that are possible using geohashes. Several public domain Javascript libraries are available for assisting
in developing web clients that could use geohashes. For example, the Javascript library *node-geohash* (available at https://github.com/sunng87/node-geohash)
contains routines to encode and decode geohashs in addition to other spatial operators using geohashes. This library will be used for the examples that follow.

**Example: Retrieve geohashes as facets in Solr search**

In this example the Solr search index is queried for a particular field of interest, with the associated geohash counts being returned as
a field facet. Used in this way, the facet field of geohashes becomes a spatial bin, with the size of the geographic area and the spatial
resolution of the binning selected by the geohash level. 

For example, if we are interested in plotting the location of PIDs that have some associated with kelp, we could query the search index with the Solr query:

..

  https://cn.dataone.org/cn/v1/query/solr?q=kelp&facet=true&facet.field=geohash_5&facet.mincount=1&rows=0

The portion of the response that we are interested in are the facet counts:

::

  <lst name="facet_counts">
    <lst name="facet_queries"/>
    <lst name="facet_fields">
      <lst name="geohash_5">
        <int name="9q4qx">5</int>
        <int name="9q4ce">4</int>
        <int name="9q4cf">4</int>
        <int name="9q4ey">4</int>
        <int name="9q4ge">4</int>
        <int name="9q4gx">4</int>
        <int name="9q4kj">4</int>
        <int name="9q4s4">4</int>
        <int name="9q4ez">3</int>
        <int name="9q4g8">3</int>
        <int name="9q4gb">1</int>
      </lst>
    </lst>
    <lst name="facet_dates"/>
    <lst name="facet_ranges"/>
  </lst>
  
To display these search results each geohash can be decoded to obtain the latitude, longitude of the geohash. 
For example, 
we can obtain the coordinates of the first geohash returned from the search as shown in the following code
fragment:

::

  // Use the node-geohash Javascript library
  var geohashLib= require('ngeohash');
  
  // Return [minlat, minlon, maxlat, maxlon] of geohash tile
  var coords = geohashLib.decode("9q4qx");

The variable *coords* now contains the latitude, longitude (coords.latitude, coords.logitude) of the decoded geohash, which is center point of the geohash tile,
in this case for level 5 geohash tiles. We could then place a marker with counts at these coordinates to indicate how many hits occured in this 
geohash tile.

Care must be taken in selecting the right geohash tile level for the Solr query, with the consideration of smaller geohash tiles
providing more accurate spatial results, but returning a greater number of facet results as each greater resolution tile
covers a smaller geographic area.

Using Geohashes for searching
-----------------------------

Geohashes in the search index are multi-valued, so that geohashes have been computed for each geographic coverage for a PID. Since
the geohashes are indexed at different resolutions, you can search all coverages at different spatial resolutions.

**Example: Search using a bounding box**

One appraach to using geohashes for search is to retrieve all PIDs with geohashes that overlap a search box. 
First determine which geohashes overlap a bounding rectangle, in this case the bounding rectangle that encompasses
Santa Cruz Island in the Santa Barbara Channel:

::

  // Use the node-geohash Javascript library
  var geohashLib= require('ngeohash');

  // Search for all geohashes within a geographic bounding box
  // which might be the current browser viewport or alternatively a
  // region of interest

  // Santa Cruz Island bounding coordinates (approximate)
  // lower left

  minlat = 33.959878;
  minlon = -119.914398;

  // upper right
  maxlat = 34.075341;
  maxlon = -119.520264;

  // Find all geohashes that overlap the specified bounding box.
  var geohashes = geohashLib.bboxes (minlat, minlon, maxlat, maxlon, precision=4);

The geohashes that overlap the search bounding box are returned. These geohashes can then be used
to find PIDs that have a coverage that is within these geohash tiles:

::

  https://cn.dataone.org/cn/v1/query/solr?q=*:*&q.op=OR&fq=geohash_4:(9q4c 9q49 9q51)&fl=id
  
This Solr filter query will find all entries for which one of the level 4 geohashes matches any of the specified
geohashes. Because the geohash_* fields are indexed as Solr multivalued fields, all coverages for a PID are
compared to see if they match.

Geohash algorithm
-----------------

A description of the geohash algorith is outside the scope of this document, but an excellent description of the
can be found at http://en.wikipedia.org/wiki/Geohash.
 
One caveat of the geohash algorith that may be of interest to end users however, is that because of the tile
ordering ("Z" ordering) where tile geohashes are incremented in a "Z" pattern and not strictly by
row, column, it is not gauranteed that adjancent tiles have similar geohashes, for example, the level 1
geohashes at the equator, starting from the International Date Line, are named "8", "9", "d", "e", "s", "t"
and so on.