16. Enabling Web Searches: Sitemaps

Sitemaps are XML files that tell search engines - such as Google, which is discussed in this section - which URLs on your websites are available for crawling. Currently, the only way for a search engine to crawl and index Metacat so that individual metadata entries are available via Web searches is with a sitemap. Metacat automatically creates sitemaps for all public documents in the repository that meet these criteria:

  • Is publicly readable
  • Is metadata
  • Is the newest version in a version chain
  • Is not archived

However, you must register the sitemaps with the search engine before it will take effect.

16.1. Configuration

Metacat’s sitemaps functionality is controlled by three properties in metacat.properties.

######## Sitemap section ######################################### # Sitemap Interval (in milliseconds) between rebuilding the sitemap sitemap.interval=86400000 # Base part of the URLs for the location of the sitemap files themselves. # Either full URL or absolute path. Trailing slash optional. sitemap.location.base=/metacatui # Base part of the URLs for the location entries in the sitemaps which should # be the base URL of the dataset landing page. # Either full URL or absolute path. Trailing slash optional. sitemap.entry.base=/metacatui/view

  • sitemap.interval: Controls the interval, in milliseconds, between rebuilding the sitemap index and sitemap files.
  • sitemap.location.base: Controls the URL pattern used in the sitemap_index.xml file. You can use either a full URL (e.g., https://example.com/some_path) or a URL relative to your server (e.g., /some_path). This is different than the sitemap.entry.base property (see directly below).
  • sitemap.entry.base: Controls the URL pattern used for the entires in the individual sitemap files (e.g., sitemap1.xml). You can use either a full URL (e.g., https://example.com/some_path) or a URL relative to your server (e.g., /some_path).

16.2. Creating a Sitemap

Metacat automatically generates a sitemap file for all public documents in the repository on a daily basis. The sitemap file(s) must be available via the Web on your server, and must be registered with Google before they take effect. For information on the sitemap protocol, please refer to the Google page on using the sitemap protocol. You can view Metacat’s sitemap files at:

<your_web_context>/sitemaps

The directory contains an index file:

sitemap_index.xml

and one or more sitemap XML files named:

sitemap<X>.xml

where <X> is a number (e.g., 1 or 2) used to increment each sitemap file. Because Metacat limits the number of sitemap entries in each sitemap file to 50,000, the servlet creates an additional sitemap file for each group of 50,000 entries.

Verify that your sitemap files are available to the Web by browsing to:

<your_web_context>/sitemaps/sitemap<X>.xml
(e.g., https://example.org/metacat/sitemaps/sitemap1.xml)

16.3. Registering a Sitemap

Before Google will begin indexing the public files in your Metacat, you must register the sitemaps. To register your sitemaps and ensure that they are up to date:

  1. Register for a Google Webmaster Tools account, and add your Metacat site to the Dashboard.
  2. From your Google Webmaster Tools site account, register your sitemaps. See the Google help site for more information about how to register sitemaps. Note: Register the full URL path to your sitemap files, including the http:// (or https://) headers.

Once the sitemaps are registered, Google will begin to index the public documents in your Metacat repository.

NOTE: As you add more publicly accessible data to Metacat, you will need to periodically revisit the Google Webmaster Tools utility to refresh your sitemap registration.