Sitemaps 5

Add-on Sitemaps: Documentation

Learn how to administrate the Sophora Sitemaps Add-on to generate sitemaps protocol complient xml for search engine optimization.

Sitemaps protocol

The open sitemaps protocol (https://sitemaps.org/protocol.html) is a human and machine readable xml interface to describe the structure of a website. The sitemaps standard enables search engines to read and understand a website. Doing so will greatly improve your SEO, as all modern search engines understand the sitemaps protocol. The Sophora Sitemaps add-on supports version 0.90 of the protocol. It is able to automatically generate the xml based on your website structure and by using customizable mapping classes in Java to provide meta data.

Google extensions

In addition to the open sitemaps standard, this addon supports the google extensions for news (version 0.9), images (version 1.1) and videos (version 1.1). These sitemap extensions and follow-up links are explained at the google support pages (https://support.google.com/webmasters/answer/183668?hl=en&ref_topic=4581190#extensions).

Sitemap-XML containing these extensions looks like this:

<urlset
    xmlns:news="http://www.google.com/schemas/sitemap-news/0.9"
    xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
    xmlns:image="http://www.google.com/schemas/sitemap-image/1.1"
    xmlns:video="http://www.google.com/schemas/sitemap-video/1.1">
    <url>
        <loc>
            http://my-site.subshell.com:8080/live/demosite/chronicle/2010/index.html
        </loc>
        <lastmod>2016-06-06T13:22:34.653+02:00</lastmod>
        <changefreq>daily</changefreq>
        <priority>0.9</priority>
    </url>
    ...
</urlset>

Project Setup

The Sophora Sitemaps add-on is a separate maven dependency for your delivery. Installing the add-on provides you with the new Sitemaps servlet, which then has to be added to your web.xml and templates.xml.

Maven dependency: pom.xml

<dependency> 	
	<groupId>com.subshell.sophora</groupId>
 	<artifactId>sophora-sitemaps</artifactId>
 	<version>4.0.0</version>
</dependency>

Servlet mapping: web.xml

<servlet> 	 
	<servlet-name>SitemapServlet</servlet-name>
	<servlet-class>com.subshell.sophora.delivery.sitemap.SitemapServlet</servlet-class>
</servlet>
<servlet-mapping> 	 
	<servlet-name>SitemapServlet</servlet-name>
 	<url-pattern>/system/servlet/sitemap.servlet</url-pattern>
</servlet-mapping>

Sophora template mapping: templates.xml

<nodetype name="sophora-nt:structureNode">
	<templateset>
		...
		<template type="sitemap">/system/servlet/sitemap.servlet</template> 
	</templateset>
</nodetype>

Preparing a Solr Core

This addon generates the entries for your sitemap by reading all relevant documents from a specific solr core. Specific converters are used to create those entries from solr documents. Technically any solr core will do for this purpose, but we recommend using a dedicated solr core just containing the documents you want to have in your sitemap.

Custom Sophora Sitemap mapping classes

To use custom Sophora Sitemap mapping classes you need to implement the com.subshell.sophora.delivery.sitemap.api.IMapperFactory interface. The class should be located in the package specified by the property sophora.delivery.sitemap.basePackage.

The Mapperfactory has two main purposes:

  1. It defines from which solr core the documents should be read to convert them to entries in your Sitemap
  2. It provides your mapper implementations. There are four mappers to provide. Three for the google extensions for news, video and image and one for all the other documents.

Each mapper is then used to provide properties based on a solr document. Based on the mapping the xml will be generated. For convenience it is possible to use the com.subshell.sophora.delivery.sitemap.impl.AbstractMapper class:

public class CustomUrlMapper extends AbstractMapper implements IUrlMapper {

	public static final String SOLR_FIELD_URL = "url_s";
	public static final String SOLR_FIELD_LAST_MOD = "sophora_modificationDate_dt";

	public DefaultUrlMapper(Map<String, Object> solrDocument) {
		super(solrDocument);
	}

	@Override
	public boolean isApplicable() {
		return true;
	}

	@Override
	public String getLocation() {
		return Objects.toString(getSolrDocument().get(SOLR_FIELD_URL));
	}

	@Override
	public DateTime getLastMod() {
		return parseDate(getSolrDocument(), SOLR_FIELD_LAST_MOD);
	}

	@Override
	public ChangeFreq getChangefreq() {
		return ChangeFreq.DAILY;
	}

	@Override
	public BigDecimal getPriority() {
		return null;
	}
}

There is a DefaultMapperFactory providing DefaultMapper-Implementations.

Properly writing custom mappers

For every solr document there should be only one mapper implementation that is applicable. The easiest way to achive this is by making a clear distinction by the solr documents nodetype. The default implementation for example always create default URL-entries and never any google extensions.

Generating the sitemap xml

After configuring the add-on as described above, visiting an index document using the template type "sitemap" creates the sitemap index and returnes the link to the generated sitemap xml.

Pregeneration

Generating the sitemaps per request might be slow since a sitemap for a big site might refer to thousands of documents. Therefore, the sitemaps module comes with a pregeneration feature that periodically writes the sitemaps for all the sites (besides the system-site).

This is done by an asynchronous task that runs every X minutes (after startup). This interval can be configured by the property sophora.delivery.sitemap.cacheUpdateInterval and is set to 30 by default.

This pregeneration will only generate sitemaps for all the sites (thus top-level structure nodes in your sophora-repository). You can generate sitemaps also per individual non-site structure node if you want to, but this only happens when done by request and is most likely not needed.

Cache-Directories for the pregeneration

This pregeneration has to put the generated files into the same cache directories, where a request-driven generation would have put it. The specific filenames rely on the way your apache rewrites requests towards the tomcat.

By default, the cache root-directory sophora.delivery.cache.directory/htdocs is used, with sub-directories for each site.

If for some reason the path for a site within this cache directory is not identical to its name, you can overwrite this with the property sophora.delivery.sitemap.<SITENAME>.cachesubroot.

E.g. if for the site "demosite" the proper cache-directory would be "demo/site" you would have to add this to your configuration:

sophora.delivery.sitemap.demosite.cachesubroot=demo/site

Paging

This add-on supports paging using the url parameter p. A typical url with paging looks like this (Here, the fourth page at index 3 is used):
http://my-site.subshell.com/live/demosite/trendcities/copenhagen/index~sitemap_p-3.xml

Properties

PropertyDescription
sophora.delivery.sitemap.basePackageBase Java package to search the implementation of the IMapperFactory in. (Default: com.subshell.sophora.delivery.sitemap)
sophora.delivery.sitemap.cacheUpdateIntervalUpdate interval in minutes to invalidate and regenerate the xml. (Default: 30)
sophora.delivery.sitemap.formatXMLIf set to true the xml output will be formatted. Otherwise the xml will be displayed in one line. (Default: true)
sophora.delivery.sitemap.writeNamespacesForThis property controls which namespace-declarations to write at the start of the generated XML-File. You can use it to filter out the namespace declarations for google's sitemap extensions. Possible values are:
  • news (http://www.google.com/schemas/sitemap-news/0.9)
  • video (http://www.google.com/schemas/sitemap-video/1.1)
  • image (http://www.google.com/schemas/sitemap-image/1.1)
You can use several values separated by a comma. The default is news,image,video.
You should not use this unless you have custom mapper classes for any of those types that never generates entries (thus their method isApplicable always returns false).
The namespace-declaration for the sitemaps standard however is always written and not affected by this property.
sophora.delivery.cache.directoryThis property is part of the delivery configuration anyways but is also used for the pregeneration.
sophora.delivery.sitemap.htdocsDescribes the sub-path path to the actual xml-file root in your cache directory. This is be default set to htdocs.
sophora.delivery.sitemap.<SITENAME>.cachesubrootYou can use this property for each site (so top-level structure node) in your repository to specify a directory below the htdocs directory. All pregenerated sitemap files for this site will be put there. By default the site's name is used for that.

Last modified on 10/16/20

The content of this page is licensed under the CC BY 4.0 License. Code samples are licensed under the MIT License.

Icon