Importer | Version 4

Importer: Import Via Watchfolder

The import process: How to import documents using a watchfolder.

The import process starts when one or more files are copied to the configured watchfolder of the importer instance.

The importer instance processes the incoming files in ascending, alphabetical order, where by default only files ending with '.xml' are handled, and files ending ending with '.config.xml' or '.bin.xml' are ignored. The configuration option watchFilesRegex allows to change this default behaviour.

Files in subfolders will be processed if the configuration option watchRecursive is set true. Files in subfolders will generally be processed after files in the main folder.

Importing a file

Importing a file comprises the following steps:

  • The file is given to the preprocessor script, if configured for this instance.
  • The result ist given to the XSL transformation, if configured for this instance.
  • The importer then checks whether the result is valid Sophora XML.
  • Based on the Sophora XML, documents in the Sophora server will be created or updated.
  • When the import is finished, the input file, along with intermediate files created by the preprocessor or XSLT, and additional files referenced from the import file, are moved to the success or failure folder.

Check for incomplete files

The importer uses some heuristics to determine if a file in the watchfolder is incomplete and still being written to by another process. Before importing a file, and if the instance does not use a preprocessor script, the importer checks whether the file is non-empty and contains valid XML. If you are using a preprocessor-script, that script can indicate a parse error by calling errorTracker.setParseError(). If the content is not valid and the modification date of the file is within the last ten seconds, the importer assumes that the file is still being written, and will retry the import after ten seconds.

Every ten seconds, a new attempt of importing the file will be done until one of the following two conditions are met:

  • The modification date of the file is older than the date of the last attempt to import the file.
  • The file is non-empty and contains syntactically valid XML.

If the file hasn't changed within the last ten seconds and ist still syntactically incorrect, it is no longer regarded as still in the process of being written and the import will ultimately fail. On the other hand, if the content of the file is valid XML or the preprocessor script does not indicate a parse error, the import will proceed.

Success or failure

If the import process fails, the source file will be moved to the failure directory. Additionally, this folder will contain an error protocol file named in the same way as the source file plus a timestamp.

If the import process finishes successfully, the source file is moved to the success directory. During a successful import there may be minor problems which did not prevent the import. Such problems will be logged to an error protocol file named in the same way as the source file plus a timestamp.

Automatically deleting old files

Old files from the success and failure folders can be deleted automatically using a configuration such as this:

importer:  
  cleanupFoldersCron: "0 0 0 * * ? *"
  cleanupFoldersSuccessfulMaxAge: 20
  cleanupFoldersFailureMaxAge: 20

With this example, the importer recursively deletes all files from the success and failure folders of all importer instances that are older than 20 days. Empty subfolders will also be deleted. This process will happen each day at midnight, as specified by the cron expression.

The max-age properties can be set globally or in each instance. The configuration in the instance overrides the global one. Set the max-age to 0 in an instance configuration to disable deleting old files for this instance.

When this feature is combined with patterns in the configuration of the success and failure folders, there are some considerations. Let's say an importer instance uses this configuration:

folders:
  watch: /foo/incoming
  temp: /foo/temp
  success: /foo/${date;yyyy.MM}/success/
  failure: /foo/${date;yyyy.MM}/failure

This configuration creates a new directory each month, containing the successful and failed imports for each month. When patterns are used, the deletion process cannot know which subfolders in /foo were created due to patterns, or if there are folders created by someone else. Therefore the deletion process takes the part of the path until the first pattern as the folder to clean. I.e., in the example above, all files and folders in /foo are processed, except for the incoming and temp folders, which the deletion process explicitely ignores. If someone were to create the file /foo/Readme.txt, it would be deleted after 20 days.

Last modified on 10/16/20

The content of this page is licensed under the CC BY 4.0 License. Code samples are licensed under the MIT License.

Icon