SEO Check | Version 4

SEO Check Administration

Learn how to install, configure and run SEO Check.

Overview

The backend of the SEO Check is a Spring Boot Java app with an embedded Tomcat. It primarily offers a REST interface at http://localhost:8080/rest/...

The frontend of the SEO Check at http://localhost:8080/... is a React app that fetches the data from the backend via REST.

Requirements

SEO Checks needs Java 11 (other versions or the use of an embedded Java upon request.) It also depends on Sophora 4 or higher (other versions upon request.)

Installing and Running

SEO Check consists of a single JAR file. Both frontend and backend are integrated into the file. To start SEO Check, run the following:

java -jar JARFILE

The configuration file application.yaml must be placed in the current working directory.

Configuration

The following is an example application.yaml file:

# Settings for the Sophora server connection.
sophora:
  client:
    server-connection:
      urls: https://sophora.example.com:1196
      username: user
      password: pass

# Settings for the SEO Check.
seo-check:
  # Settings for suggested documents.
  suggested-documents:
    # A list of nodetypes. Only documents matching this list will be suggested.
    nodetypes: subshell-content-nt:article

  # Settings for suggested images.
  suggested-images:
    # A list of nodetypes. Only documents matching this list will be suggested.
    nodetypes: subshell-content-nt:imageObject

  # Settings for suggested keywords.
  suggested-keywords:
    # A map of language codes to file system paths of stop word lists.
    stop-word-files:
      en: src/main/example-resources/suggested-keywords/stop-words/stop-words_en.txt
      de: src/main/example-resources/suggested-keywords/stop-words/stop-words_de.txt

    # File system path of a deny list of words for the German word splitter.
    #
    # For example, the word "Teletext" is usually split into "Tele" and "Text".
    # To prevent this, add "Teletext" to the deny list.
    german-split-word-deny-file: src/main/example-resources/suggested-keywords/split-words-deny/split-words-deny_de.txt

    # Settings for extracting keywords from a document's copy text.
    copytext:
      # A list of paragraph styles. Only paragraphs using these styles will be considered.
      paragraph-styles: intro, paragraph, paragraphDocArticle, h2, h3, h4, h5, h6, h7

  # Optional settings for Google.
  google:
    # Settings for the Search Console API.
    search-console:
      # The web site's property name in Google Search Console.
      property-name: sc-domain:example.com
      # File system path to a JSON key file for a service account that can access the Google Cloud project.
      service-account-key-file: key-123456-1234567890ab.json

  # Optional settings for Bing.
  bing:
    # Settings for the Webmaster Tools.
    webmaster-tools:
      # API key to access the Webmaster Tools API.
      api-key: 1234567890abcdef1234567890abcdef

  links:
    # Optional list of 'internal' domains. Links to these domains will be considered as internal links
    # instead of external links.
    internal-domains: example.com, example2.com, example3.com

    # Optional setting specifying where to show links from link documents (sophora-extension-nt:link node type)
    # that don't contain a URL.
    #
    # Possible values:
    #
    # ignore - Link documents without a URL will be ignored. (default)
    # internal - Link documents without a URL will be considered as internal links.
    # external - Link documents without a URL will be considered as external links.
    missing-url: ignore

    # Optional text to display instead of a link document's URL when the link document doesn't contain a URL.
    # This overrides the default text.
    #
    # Does not apply when missing-url=ignore.
    missing-url-alt-text: URL missing

  # A list of configurations for the SEO Check.
  #
  # Each configuration consists of a "match" section, an "input" section, and a "checks" section.
  # If a document matches a configuration's "match" section, it will be checked according to the
  # configuration's "checks" section.
  configurations:
    # The "match" section configures the set of documents this configuration is for.
    - match:
        # A list of nodetypes.
        nodetypes: subshell-content-nt:article
        # A list of structure nodes.
        structure-nodes: /subshell-de, /sophora-docs-de

      # The "input" section configures which properties/child nodes of the document to consider.
      input:
        # The language of all documents in this configuration. Must be an ISO 639-1 language code,
        # such as "de" or "en".
        language: de

        # A list of referenced image document configurations.
        images:
          # An arbitrary unique ID for this image.
          - id: main
            # An arbitrary label for this image.
            label: Main Image

            # An XPath expression that will dereference the image document.
            # Evaluation of this expression starts at the document referencing the image document.
            document-path: /@subshell-content:mainImage/sophora:deref
            # An XPath expression that will return the image's caption property.
            # Evaluation of this expression starts at the referenced image document.
            caption-path: /@subshell-content:caption
            # An XPath expression that will return the image's alterntive text property.
            # Evaluation of this expression starts at the referenced image document.
            alt-text-path: /@subshell-content:altText

        # A list of property configurations.
        properties:
          # An arbitrary unique ID for this property.
          - id: headline
            # An arbitrary label for this property.
            label: Headline

            # An XPath expression that will return the property.
            # Evaluation of this expression starts at the document containing the property.
            path: /@subshell-content:headline

        # The copytext configuration.
        copytext:
          # An XPath expression that returns the copytext child node.
          # Evaluation of this expression starts at the document containing the copytext.
          path: /subshell-content:articleBody

          # Paragraph configuration.
          paragraph:
            # An optional list of referenced document configurations regarded as 'links'.
            # These are considered in addition to regular text links in the copytext.
            links:
              # A single referenced document configuration.
              #
              # 'path' is an XPath expression that will dereference the link document.
              # Evaluation of this expression starts at the paragraph child nodes.
              #
              # 'text' is an XPath expression that returns the property containing the 'link text'.
              # Evaluation of this expression starts at the referenced document.
              - path: /*[@jcr:primaryType='subshell-content-nt:container']/*[@jcr:primaryType='subshell-content-nt:eventRef']/@sophora:reference/sophora:deref
                text-path: /@subshell-content:title
              - path: /*[@jcr:primaryType='subshell-content-nt:container']/*[@jcr:primaryType='subshell-content-nt:articleRef']/@sophora:reference/sophora:deref
                text-path: /@subshell-content:headline

            # Paragraph image configuration.
            image:
              # An arbitrary unique ID for paragraph images. "copytext" is usually fine.
              id: copytext

              # An XPath expression that will dereference paragraph images.
              # Evaluation of this expression starts at the paragraph child nodes.
              document-path: /*[@jcr:primaryType='subshell-content-nt:container']/*[@jcr:primaryType='subshell-content-nt:imageObjectRef']/@sophora:reference/sophora:deref

              # Alternatively, you can use document-paths to specify multiple expressions:
              #document-paths:
              #  - /*[@jcr:primaryType='subshell-content-nt:container']/*[@jcr:primaryType='subshell-content-nt:imageObjectRef']/@sophora:reference/sophora:deref
              #  - /*[@jcr:primaryType='subshell-content-nt:container']/*[@jcr:primaryType='subshell-content-nt:galleryRef']/@sophora:reference/sophora:deref/subshell-content:imageObject/@sophora:reference/sophora:deref

              # An XPath expression that will return the image's caption property.
              # Evaluation of this expression starts at the referenced image document.
              caption-path: /@subshell-content:caption

              # An XPath expression that will return the image's alternative text property.
              # Evaluation of this expression starts at the referenced image document.
              alt-text-path: /@subshell-content:altText

            # Paragraph video configuration.
            video:
              # An XPath expression that will dereference paragraph videos.
              # Evaluation of this expression starts at the paragraph child nodes.
              document-path: /*[@jcr:primaryType='subshell-content-nt:container']/*[@jcr:primaryType='subshell-content-nt:videoObjectRef']/@sophora:reference/sophora:deref
              # An XPath expression that will return the video's caption property.
              # Evaluation of this expression starts at the referenced video document.
              caption-path: /@subshell-content:headline

            # Paragraph audio configuration.
            audio:
              # An XPath expression that will dereference paragraph audios.
              # Evaluation of this expression starts at the paragraph child nodes.
              document-path: /*[@jcr:primaryType='subshell-content-nt:container']/*[@jcr:primaryType='subshell-content-nt:audioObjectRef']/@sophora:reference/sophora:deref
              # An XPath expression that will return the audio's caption property.
              # Evaluation of this expression starts at the referenced audio document.
              caption-path: /@subshell-content:headline

        # Settings for referencing documents.
        referencing-documents:
          # A list of nodetypes. Only documents matching this list will be regarded when looking for referencing documents.
          nodetypes: subshell-content-nt:article

        # Settings for the document's title.
        title:
          # Settings for unique document titles.
          unique:
            # A list of nodetypes. Only documents matching this list will be regarded when looking for documents having the same title.
            nodetypes: subshell-content-nt:article
            # The name of the property to use to compare titles (optional).
            # If not specified, the property configured in the document types will be used.
            property: subshell-content:title

      # The "checks" section configures the various SEO checks.
      checks:
        # A filesystem path pointing to a file containing a list of stop words.
        stop-words-file: stopwords/stopwords_de.txt
        # A classpath path pointing to a file containing a list of transition words.
        transition-words-file: transition-words/transition-words_de.txt
        # A filesystem path pointing to a file containing a list of blocked words.
        blocked-words-file: blocked-words.txt
        # A filesystem path pointing to a file containing a list of synonyms.
        synonyms-file: synonyms.txt

        # Configures whether to apply stemming or not.
        use-stemming: true

        # Disabled checks configuration.
        #
        #disabled:
        #  keywords:
        #    - keyphrase.stopWords
        #  copytext:
        #    - readability

        # Check weights configuration.
        #
        # The default weight for all checks is 1. If a check's weight is configured to be greater
        # than the median of all checks, it will be marked as "high priority."
        weights:
          # Every check is part of a group of checks. This is the "keywords" group.
          keywords:
            # A regular check weight configuration. This configures the check "keyphrase.length" in the
            # "keywords" group with a weight of 2.
            keyphrase.length: 2

          # Check weight configurations for the "document" group.
          document:
            # A property check weight configuration. The IDs of property checks follow the form
            # "property.<property ID>". The <property ID> is one of the unique property IDs configured
            # in the "input" section.
            property.headline: 10

          # Check weight configurations for the "copytext" group.
          copytext:
            # This configures the check "length" in the "copytext" group with a weight of 10.
            length: 10

        # Various configurations used by SEO checks.
        document:
          # For each check that checks the number of matching keywords, this configures the number of required keywords.
          #
          # This is a map consisting of entries of the form:
          #
          #   <number of keywords in keyword set>: <minimum>
          #
          # <minimum> can either be an integer specifying the minimum number of required keywords. For example,
          # the entry
          #
          #   '3': 2
          #
          # specifies that for a keyword set containing at least 3 keywords, at least 2 of those keywords must match.
          #
          # <minimum> can alternatively be a percentage, such as "80%". For example, the entry
          #
          #   '5': 80%
          #
          # specifies that for a keyword set containing at least 5 keywords, at least 80% of those keywords must match.
          #
          # If there is no map entry for the exact number of keywords in the keyword set, entries with fewer numbers of
          # keywords will be checked, in descending order. If no entry is matching the number of keywords, the default
          # for <minimum> will be "100%".
          #
          # NOTE: Keys of this map must be strings instead of plain integers.
          min-required-keywords:
            '1': 1
            '2': 2
            '4': 3
            '5': 80%

          # Configuration for checking the keyphrase itself.
          keyphrase:
            # Configures the acceptable number of words in the keyphrase (minimum and maximum.)
            words:
              min: 1
              max: 4

          # Configuration for checking the ID stem of a document.
          id-stem:
            # Configures the acceptable number of characters in the ID stem (maximum only.)
            characters:
              max: 40

          # Configuration for checking properties of a document.
          property:
            # Configuration for a single property. This must be one of the unique property IDs
            # configured in the "input" section.
            headline:
              # Configuration for the property length.
              length:
                # Configures the acceptable number of characters in the property (minimum and maximum.)
                chars:
                  min: 10
                  max: 50

          # Configuration for checking the copytext of a document.
          copytext:
            # Configuration for headlines in the copytext.
            headline:
              # A list of paragraph styles used in headline paragraphs.
              styles: h2, h3, h4, h5, h6, h7

            # Configures the acceptable number of words in the copytext (minimum only.)
            words:
              min: 300

            # Configuration for sentences in the copytext.
            sentence:
              # Configuration for words in copytext sentences.
              words:
                # Configures the acceptable number of words in copytext sentences (maximum only.)
                max: 30

                # Configures the acceptable percentage of sentences considered "long" (maximum only.)
                # Range: 0.0-1.0
                long-fraction:
                  max: 0.25

                # Configures the acceptable percentage of sentences that are allowed to start with
                # the same words (maximum only.)
                # Range: 0.0-1.0
                start-fraction:
                  max: 0.1

              # Configures the acceptable percentage of sentences that should contain a transition word (minimum only.)
              # Range: 0.0-1.0
              transition-fraction:
                min: 0.2

            # Configuration for paragraphs in the copytext.
            paragraph:
              # Configures the acceptable number of words in copytext paragraphs (maximum only.)
              words:
                max: 100

              # Configuration for the first paragraph in the copytext.
              first:
                # Optional list of acceptable paragraph styles for the 'first' paragraph.
                # For example, this can be used if the real first paragraph is a headline that should not be considered.
                styles: paragraph, paragraphDocArticle

            # Configures the acceptable score for the Flesch reading ease test (minimum only.)
            flesch-score:
              min: 60.0

            # Configuration for copytext sections. A section is the text between two headlines.
            section:
              # Configures the acceptable number of words in a copytext section (maximum only.)
              words:
                max: 300

            # Configuration for the appearance of the keyphrase in the copytext.
            keyphrase:
              # Configures the acceptable keyphrase density in the copytext (minimum and maximum.)
              density:
                min: 0.005
                max: 0.03

            # Configuration for audios and videos in copytext paragraphs.
            audio-video:
              # Configures the acceptable number of audios and videos in paragraphs (minimum only.)
              count:
                min: 1

            # Configuration for audios in copytext paragraphs.
            audio:
              # Configures the acceptable number of audios in paragraphs (maximum only.)
              count:
                max: 1

              # Configuration for captions of audios.
              caption:
                # Configuration for message texts used in the SEO Check user interface.
                messages:
                  captions: Captions

            # Configuration for videos in copytext paragraphs.
            video:
              # Configures the acceptable number of videos in paragraphs (maximum only.)
              count:
                max: 1

              # Configuration for captions of videos.
              caption:
                # Configuration for message texts used in the SEO Check user interface.
                messages:
                  captions: Captions

          # Configuration for referencing documents.
          referencing-documents:
            # Configures the acceptable number of referencing documents (minimum only.)
            documents:
              min: 1

          # Configuration for images.
          images:
            # Configures the acceptable number of images (minimum and maximum.)
            count:
              min: 1
              max: 3

            # Configures the amount of images that must have keywords in their caption.
            # Possible values: 'any' (default), 'all'
            keywords-caption-match: any

            # Configures the amount of images that must have keywords in their alt text.
            # Possible values: 'any' (default), 'all'
            keywords-alt-text-match: any

            # Configures the amount of images that must have keywords in their Sophora ID.
            # Possible values: 'any' (default), 'all'
            keywords-sophora-id-match: any

          # Configuration for document titles.
          title:
            # Configuration for uniqueness of document titles.
            unique:
              # Configuration for message texts used in the SEO Check user interface.
              messages:
                title: Document title

List of Checks

The following is a complete list of all checks and their IDs along with their groups:

GroupCheck IDChecks for ...
keywordskeyphrase.lengthNumber of Keywords in Keyword Set
keyphrase.stopWordsStop Words in Keyword Set
keyphrase.inUseUniqueness of Keyphrase
documentsophoraID.stem.lengthLength of ID Stem
sophoraIDKeywords in Sophora ID
sophoraID.blockBlocked Keywords in Sophora ID
sophoraID.stem.uniqueUniqueness of ID Stem
title.uniqueUniqueness of title
property.<property ID>Keywords in Property
property.length.<property ID>Length of Property
copytextLinkslinks.internalInternal Text Links
links.externalExternal Text Links
links.internal.keyphraseKeywords in Internal Text Links
links.external.keyphraseKeywords in External Text Links
referencingDocumentsReferences in Other Documents
copytextlengthLength of Text
sentence.lengthLength of Sentences
paragraph.length.wordsLength of Paragraphs
readabilityReadability
section.lengthLength of Sections
densityKeyword Density
paragraph.firstParagraphKeywords in First Paragraph
headlinesKeywords in Headlines
sentence.transitionsTransitions in Sentences
sentence.startStart of Sentences
strongKeywords in Bold Text
imagesimage.<image ID>Keywords in Caption of Image
image.altText.<image ID>Keywords in Alternative Text of Image
images.copytextKeywords in Captions of Images in Copy Text
images.altText.copytextKeywords in Alternative Texts of Images in Copy Text
images.sophoraIdKeywords in Sophora IDs of Images in the Document
images.countNumber of Images in the Document
mediasaudioVideo.count.copytextNumber of Audios or Videos in Copy Text
audio.count.copytextNumber of Audios in Copy Text
video.count.copytextNumber of Videos in Copy Text
audio.sophoraID.copytextKeywords in Sophora IDs of Audios in Copy Text
video.sophoraID.copytextKeywords in Sophora IDs of Videos in Copy Text
audio.caption.copytextKeywords in Captions of Audios in Copy Text
video.caption.copytextKeywords in Captions of Videos in Copy Text

Configuring the Preview

To use the SEO Check module, configure a preview with the name "SEO Check" (or your preferred name) in the administration view of the Sophora DeskClient. The URL should look something like this:

http://seocheck.example.com:8080/document/${sophora:id}/?hl=en

If you want the SEO Check module to display German texts, change the URL parameter to hl=de

Example Word Lists in German and English

Last modified on 11/24/21

The content of this page is licensed under the CC BY 4.0 License. Code samples are licensed under the MIT License.

Icon