Server 4

Archival Storage

Learn about Sophora's extensive possibilities of archiving old documents.

The goal of the archival storage is to remove older versions of documents from the actual repository while keeping them still available in a separate archive, so the repository is disburdened. This is not the same as the archive function within the delivery (which might be "Show me all stories that are 30 days old").

Certain document types (especially homepages) may have thousands of versions. This is not only a matter of disk space but rather a crucial factor when publishing these documents (the more versions a document has, the longer it takes to publish it again).

Archived versions of documents are kept in a separate repository (namely repository_archive) so that older versions can be restored easily.

Implementation

The following is an examplary configuration for enabling an archive repository within the sophora.properties file. You can find a full list of possible arguments and their explanations here.

sophora.archive.enabled=true
sophora.archive.deleteOldVersionsAfterDays=370
sophora.archive.activeOnStartup=true

This repository for the archive content will be located next to the main repository (see the directory structure).

The Archive Repository

The archive repository only contains archived content, i.e. older versions of documents. It neither contains user information nor does it contain node type configurations. As a consequence, the archive repository cannot be turned into a 'real' main repository used by a Sophora server for the daily business. The only data needed from the archived documents' node types are the CND specifications. Without them, it will not be possible to copy nodes from the main repository to the archive. Therefore, the CNDs have to be announced at the Sophora Primary Server repository and changes in these files have to be committed explicitly.

The Procedure

Old versions of a document will be moved from the main repository to the archive repository when the total number of document versions in the main repository exceeds the defined threshold. This threshold is given by the property sophora.archive.maxVersionsToRetain. Hence, the main repository retains the latest X versions of the document whereas X equals the previously mentioned threshold.

The actual work of the archival storage is done by the archive worker, which is described below.

Handling References

When a document version is archived, it is likely that documents which are referenced by this version are not archived yet, i.e. are not available in the archive repository. Therefore, it is not always possible to store a document's version exactly as it has appeared in the main repository.

To solve this problem, each such reference is replaced by an external reference. In more detail: When a document version is read from the main repository and about to be archived, all existing references to other documents are converted into external references. This way, no problems occur when saving this version in the archive repository. When an old version is retrieved from the archive repository, these external references need to be resolved again.

Accessing Archived Versions

Displayed Versions in the Deskclients

The document versions shown in the Sophora DeskClient are retrieved from both the main repository and the archive repository. Hence, the entire history of a document can be viewed there.

Repository Transactions

To access the archive repository you need a separate Sophora session and your own transaction. Transactions on the main repository are generated by annotations within the source code. Thereby, you cannot specify the transaction manager that should be used. Instead, transactions on the archive repository are defined in the spring configuration using AOP-Pointcut statements.

The Archive Worker

When a document is published, its UUID is added to the first-in-first-out queue of the archive worker which moves each version separately. The archive worker takes all document versions exceeding the previously defined threshold and copies them to the archive repository.

Set the maximum number of versions in archive repository for a single document

If you want to set a maximum number of versions to keep for a single document in the archive repository, you can use the following property which is configured in the mixin sophora-mix:document:

- sophora:maxVersionsToKeep (long)

The archive worker checks every 5 minutes for documents which have set that property and deletes all versions of each of these documents that have a higher number. The archive worker removes versions only if a smaller number for the remaining versions (sophora.archive.maxVersionsToRetain) is configured in the repository.

As an example, if maxVersionsToRetain is set to 20 and the document property maxVersionsToKeep has a value of 5, still 20 versions of the document are retained. The number refers only to the versions in the archive repository.

Removal

The Archive Worker deletes old versions from the archive if they are older than the number of days that is specified by the sophora.archive.deleteOldVersionsAfterDays parameter in the sophora.properties file. However, the archiving processes may keep a certain number of versions for all documents, regardless how old the versions are, if the parameter sophora.archive.preserveNumberOfVersionsInArchiveRepository is set.

More detailed information about how Sophora handles deleted documents, can be found in the documentation on deleted documents and trash.

Last modified on 10/16/20

The content of this page is licensed under the CC BY 4.0 License. Code samples are licensed under the MIT License.

Icon