Importer 4

Importer: Custom Preprocessing Before Importing

With custom Java or Groovy code the imported data can be preprocessed.

A file to be imported can be transformed into Sophora XML using custom Java or Groovy code, which we call a "preprocessor". When using a preprocessor, the input file can contain any content, e.g. XML, JSON or even a JPG image. It is the responsibility of the preprocessor to transform the input file into Sophora XML, which will then be imported as usual.


Any preprocessor must implement the interface com.subshell.sophora.importer.preprocessing.IPreProcessing which contains the following methods:

  * Pre-processing step of the importer. The implementation reads from the input and writes to the output.
  * @throws PreProcessingException in case some unexpected exception occurs. 	 
void execute(InputStream input, OutputStream output, IErrorTracker errorTracker, Map<String, String> params) throws PreProcessingException; 

void setSophoraClient(ISophoraClient sophoraClient);

For convenience, an abstract base implementation (com.subshell.sophora.importer.preprocessing.AbstractPreProcessing) is provided which wraps all kinds of exceptions into PreProcessingException and offers the Sophora client. The method to implement is:

void executeInternally(...) throws Exception;


For every importer instance a different preprocessor can be configured. Thus, the configuration is done in the instance configuration. One parameter configures the path for groovy scripts (preprocessing.scriptFolder), the second parameter (preprocessing.className) defines the class which implements the IPreProcession interface. For a pure java implementation of the preprocessor or if you provide compiled groovy class file, the path to the script folder can be omitted, in this case the classes used for preprocessing must be added to the classpath by putting a jar file into the "additionalLibs" folder. Using compiled java or groovy class files is recommended for best performance, as files in the scriptFolder are recompiled for each import.

Example configuration in the application.yml

    - name: Common
      key: common
      transform: skipTransform
        # The class which implements the IPreProcessing interface.
        className: DummyPreprocessor
        # Folder containing groovy preprocessing scripts.
        # Can be left undefined if the preprocessor class is on the classpath.
        scriptFolder: /cms/importer/common/groovy


Groovy scripts are compiled and reloaded automatically without a restart of the importer. A minimal implementation, which does not transform the file, looks like this:

import com.subshell.sophora.importer.core.utils.IErrorTracker
import com.subshell.sophora.importer.preprocessing.AbstractPreProcessing

class DummyPreprocessor extends AbstractPreProcessing {

 	public void executeInternally(File inputfile, File outputfile, IErrorTracker errorTracker,  Map<String, String> params) {
 		FileUtils.copyFile(inputfile, outputfile) 

It's also possible to compile groovy classes and assemble them into a jar file added to the "additionalLibs" folder. This is recommended for best performance, as otherwise scripts are recompiled for each import.


If written in Java, the preprocessor class must be added to the classpath by putting a jar file into the "additionalLibs" folder.


It is possible that a file to be imported is not completely written into the watchfolder by the time the import starts. To handle this situation, there is a retry-mechanism that tries to detect incomplete XML and causes another attempt to read the file after a short delay. Learn more about this mechanism here.

When using a preprocessing script, the decision whether a file should be regarded as incomplete lies within the responsibility of the script. To trigger the retry-mechanism the script must call errorTracker.setParseError().

This mechanism is only enabled for imports via watchfolder - when using the web service this call will just let the import fail and not trigger retries.

Last modified on 10/16/20

The content of this page is licensed under the CC BY 4.0 License. Code samples are licensed under the MIT License.