Cloud Binary Store

Cloud Binary Store

The Cloud Binary Store makes it possible to store binary files more efficiently in a S3 cloud data store.

Configuration

The S3 Cloud Data Store allows to store the binary data in any S3-compatible cloud storage. (As opposed to the DataStore).

These include Google Cloud Storage, Amazon S3, the self-hostable minio or any other compatible product.

This shared binary store is intended to store binary data in a location that is accessible by all sophora servers. This way no binary data has to be transferred during replication, reducing replication time immensely.

sophora.binarystore=s3
sophora.binarystore.s3.credentials.accessKeyId=<accessKeyId>
sophora.binarystore.s3.credentials.secretAccessKey=<secretAccessKey>
sophora.binarystore.s3.host=<host>
sophora.binarystore.s3.bucketName=<bucketName>

It is also possible to configure the accessKeyId and the secretAccessKey via the environment variables SOPHORA_BINARYSTORE_S3_CREDENTIALS_ACCESS_KEY_ID and SOPHORA_BINARYSTORE_S3_CREDENTIALS_SECRET_ACCESS_KEY. This way you don't have to store these secrets in your sophora.properties file.

Cache size

The settings from the sophora.properties take precedence over the environment variables.

If S3 is used as the data store, binary data will be cached in the file system. The size of this cache can be configured with the property sophora.binarystore.cacheSize.

sophora.binarystore.cacheSize=1GB

If this property is not used, the size of the cache will default to 4GB. Examples for valid values are 5GB, 1MB, 3KB, 1000B or 1000 (1000 bytes). Units must be written in capital letters. Numbers with fractional parts are not valid, e.g. 1.5GB.

General recommendations for cloud storage bucket settings

Most cloud storage providers offer buckets with different possible options regarding location type, storage classes, encryption and access control.

It is important to choose location type and storage class based on SLAs and privacy policy requirements.

  • Most cloud storage providers offer storage locations in Europe and Germany.
  • Sophora accesses the data in the storage frequently. Most default storage classes should be suited well for this.

Specific recommendations for Google Cloud Storage

  • Location type: There are three location types available: region, multi-region and dual-region. The first one has an availability SLA of 99.9%, the others 99.95%. Because of data privacy policies region might be the best choice as it allows to choose a specific location, i.e. europe-west3 (Frankfurt).
  • storage class: Because the data in the data store is accessed frequently, standard is the best option.
  • access control: uniform, because the service accounts need the same permissions on each file.
  • encryption: Google-managed key is the most convenient option. If you have specific compliance needs choose customer-managed key.

Enabling the S3 API in Google Cloud Storage

Next to Google's own API, Google Cloud Storage offers an S3 compatible API which needs to be enabled. To do this open the Google Cloud Platform Console, navigate to Storage and open the settings.

There you need to look for Interoperability and enable the Interoperability API. Then generate the required accessKey and secretAccessKey for Sophora and note the request endpoint.

Usage of cloud data store and filesystem binary store together

On the other hand it is possible to use file system binary store in a Sophora Replica Server or Sophora Staging Server when the Sophora Primary Server uses Cloud Binary Store.

Since additional events sent by the Sophora Primary Server which include the binary data are needed in this case and in order to spend only resources that are actually needed, this mode has to be enabled explicitly.

This can be done just by adding the following configuration parameter to the configuration of the Sophora Primary Server.

sophora.binarystore.s3.mixed.mode=true

Migration of binary data from Filesystem Binary Store to Cloud Binary Store and vice versa

Filesystem Store to Cloud Binary Store

In this scenario a Sophora Server exists with the DataStore activated and the binary data stored in the file system.

Copy the content of the base directory of your filesystem store into your S3 bucket. No adjustment to the files is needed, since the data is stored in exactly the same structure and format.

Cloud Binary Store to Filesystem Binary Store

There might be circumstances where you want to download your binary data stored in the cloud storage to a local file system.

For example to set up local test environments or when one Sophora Staging Server should keep the data locally as a backup (remember to enable the mixed mode in this case!).

Download all folders and the files in these folders from your S3 bucket into a folder without any adjustments and configure this folder as base directory of your binary store. After the download is finished you can start the server configured with sophora.binarystore=fs.

Purge job

When document versions or documents in the main repository are removed from the repository, the binary data in the Binary Store is unaffected.

An asynchronous job checks all binary data in the Binary Store. When the binary data is not referenced anymore the binary data is removed from the Binary Store. This purge job can be controlled by the following properties:

sophora.binarystore.purgeJob.cronTriggerExpression=0 0 2 * * ?
sophora.binarystore.purgeJob.delayMs=10
sophora.binarystore.purgeJob.durationMinutes=18080

Last modified on 5/10/22

The content of this page is licensed under the CC BY 4.0 License. Code samples are licensed under the MIT License.

Icon