Importer 3

Importer: Import Via Watchfolder

The import process: How to import documents using a watchfolder.

The importer reads files from the watchfolder

The import process starts when one or more files are copied to the configured watchfolder of the importer instance, determined by the sophora.importer.directory.watchfolder property given in the instance configuration file - see Properties in the instance configuration file(s) 'sophora-importer_instance-NNN.properties'.

The importer instance processes the incoming files in ascending, alphabetical order, where by default only files ending with '.xml' are handled, and files ending ending with '.config.xml' or '.bin.xml' are ignored. The property sophora.importer.watchfolder.regex.filesToImport allows to change this default behaviour.
Files in subfolders will be processed if the parameter sophora.importer.watchfolder.includeSubfolder is set true. Files in subfolders will generally be processed after files in the main folder. If the file name contains the string pattern "ID" (ID is a placeholder for an integer) and this ID has assigned the attribute ignore="true" within the site mapping, this file will not be imported.

Importing a file

When importing a file, the Importer checks whether the file is non-empty and contains valid XML. If you are using a preprocessor-script, that script can indicate a parse error by calling errorTracker.setParseError(). If the content is not valid and the modification date of the file is within the last ten seconds, the importer assumes that the file is still being written, and will retry the import after ten seconds.

Every ten seconds, a new attempt of importing the file will be done until one of the following two conditions are met:

  • The modification date of the file is older than the date of the last attempt to import the file.
  • The file is non-empty and contains syntactically valid XML.

If the file hasn't changed within the last ten seconds and ist still syntactically incorrect, it is no longer regarded as still in the process of being written and the import will ultimately fail. On the other hand, if the content of the file is valid XML, the import will proceed.

The Importer checks whether the XML at hand is valid Sophora XML. If that is the case, it starts importing the document. If not, the Importer may execute an XSL transformation to generate valid Sophora XML from the source XML (see XSL Transformation Before Importing).

Next, the (previously generated) Sophora XML will be parsed (see Composition of the Import XML). Based on the extracted information, new documents are created in the Sophora respository or existing ones are updated.

Success or failure

If the import process fails, the source file will be moved to the failure directory (configured by the sophora.importer.directory.failure property in the instance configuration file - see Properties in the instance configuration file(s) 'sophora-importer_instance-NNN.properties'). Additionally, this folder will contain an error protocol file named in the same way as the source file plus a timestamp.

If the import process finishes successfully, the source file is moved to the successful directory as configured by the sophora.importer.directory.successful property in the instance configuration file - see Properties in the instance configuration file(s) 'sophora-importer_instance-NNN.properties'. If binary files were involved in the import process, they are moved to the successful directory as well. Nonetheless, during a successful import there may be minor problems which did not prevent the import. Such problems will be logged to an error protocol file named in the same way as the source file plus a timestamp.

Automatically deleting old files

Old files from the success and failure folders can be deleted automatically using a configuration such as this:

sophora.importer.cleanupFolders.cron=0 0 0 * * ? * 
sophora.importer.cleanupFolders.successful.maxAge=20
sophora.importer.cleanupFolders.failure.maxAge=20

With this example, the importer recursively deletes all files from the success and failure folders of all importer instances that are older than 20 days. Empty subfolders will also be deleted. This process will happen each day at midnight, as specified by the cron expression.

The maxAge properties can be set globally in the sophora-importer.properties or in each instance. The configuration in the instance overrides the global one. Set the maxAge to 0 in an instance to disable deleting old files for this instance.

When this feature is combined with patterns in the configuration of the success and failure folders, there are some considerations. Let's say an importer instance uses this configuration:

sophora.importer.directory.watchfolder=/foo/incoming
sophora.importer.directory.successful=/foo/${date;yyyy.MM}/success/
sophora.importer.directory.failure=/foo/${date;yyyy.MM}/failure
sophora.importer.directory.temp=/foo/temp

This configuration creates a new directory each month, containing the successful and failed imports for each month. When patterns are used, the deletion process cannot know which folders in /foo were created by the patterns. Therefore the deletion process takes the part before the first pattern as the folder to clean. I.e., in the example above, all files and folders in /foo are processed, except for the incoming and temp folders, which the deletion process explicitely ignores. If someone were to create the file /foo/Readme.txt, it would be deleted after 20 days.

Last modified on 7/26/19

The content of this page is licensed under the CC BY 4.0 License. Code samples are licensed under the MIT License.

Icon