Cloud Binary Store

Cloud Binary Store

The Cloud Binary Store makes it possible to store binary files more efficiently in a S3 cloud data store.

Configuration

The S3 Cloud Data Store allows to store the binary data in any S3-compatible cloud storage. (As opposed to the DataStore).

These include Google Cloud Storage, Amazon S3, the self-hostable minio or any other compatible product.

This shared binary store is intended to store binary data in a location that is accessible by all sophora servers. This way no binary data has to be transferred during replication, reducing replication time immensely.

sophora.binarystore=s3
sophora.binarystore.s3.credentials.accessKeyId=<accessKeyId>
sophora.binarystore.s3.credentials.secretAccessKey=<secretAccessKey>
sophora.binarystore.s3.host=<host>
sophora.binarystore.s3.bucketName=<bucketName>

It is also possible to configure the accessKeyId and the secretAccessKey via the environment variables SOPHORA_BINARYSTORE_S3_CREDENTIALS_ACCESS_KEY_ID and SOPHORA_BINARYSTORE_S3_CREDENTIALS_SECRET_ACCESS_KEY. This way you don't have to store these secrets in your sophora.properties file.

Cache size

The settings from the sophora.properties take precedence over the environment variables.

If S3 is used as the data store, binary data will be cached in the file system. The size of this cache can be configured with the property sophora.binarystore.cacheSize.

sophora.binarystore.cacheSize=1GB

If this property is not used, the size of the cache will default to 4GB. Examples for valid values are 5GB, 1MB, 3KB, 1000B or 1000 (1000 bytes). Units must be written in capital letters. Numbers with fractional parts are not valid, e.g. 1.5GB.

General recommendations for cloud storage bucket settings

Most cloud storage providers offer buckets with different possible options regarding location type, storage classes, encryption and access control.

It is important to choose location type and storage class based on SLAs and privacy policy requirements.

  • Most cloud storage providers offer storage locations in Europe and Germany.
  • Sophora accesses the data in the storage frequently. Most default storage classes should be suited well for this.

Specific recommendations for Google Cloud Storage

  • Location type: There are three location types available: region, multi-region and dual-region. The first one has an availability SLA of 99.9%, the others 99.95%. Because of data privacy policies region might be the best choice as it allows to choose a specific location, i.e. europe-west3 (Frankfurt).
  • storage class: Because the data in the data store is accessed frequently, standard is the best option.
  • access control: uniform, because the service accounts need the same permissions on each file.
  • encryption: Google-managed key is the most convenient option. If you have specific compliance needs choose customer-managed key.

Enabling the S3 API in Google Cloud Storage

Next to Google's own API, Google Cloud Storage offers an S3 compatible API which needs to be enabled. To do this open the Google Cloud Platform Console, navigate to Storage and open the settings.

There you need to look for Interoperability and enable the Interoperability API. Then generate the required accessKey and secretAccessKey for Sophora and note the request endpoint.

Usage of cloud data store and other binary store types together

On the other hand it is possible to use other binary store types (e.g. file system or database) in a Sophora Replica Server or Sophora Staging Server when the Sophora Primary Server uses Cloud Binary Store.

Since additional events sent by the Sophora Primary Server which include the binary data are needed in this case and in order to spend only resources that are actually needed, this mode has to be enabled explicitly.

This can be done just by adding the following configuration parameter to the configuration of the Sophora Primary Server.

sophora.binarystore.s3.mixed.mode=true

Migration of binary data to cloud storage

Migration of the filesystem store

In this scenario a Sophora Server exists with the DataStore activated and the binary data stored in the file system.

Copy the content of the base directory of your filesystem store into your S3 bucket. No adjustment to the files is needed, since the data is stored in exactly the same structure and format.

Migration of the database store

For the migration of binary data from the DataStore with activated database store to the S3-compatible cloud storage the tool com.subshell.sophora.binarydata.migrator is available.

The migrator needs a configuration of the connection to the database from which data should be extracted and a configuration of the target S3-compatible cloud bucket to store data in.

The configuration is done inside the application.yml file and should look like this:

database : 
    datasource : 
        driver : <databaseDriverName> 
        url : <databaseUrl> 
        username : <databaseUsername> 
        password : <databasePassword> 
s3 : 
        bucket: <bucketName>
        accessKeyId: <accessKeyId>
        secretAccessKey: <secretAccessKey>
        host: <host>

The migration status is stored in a file named migratedIds.txt and will be generated if it doesn't exist (inside the execution directory of the application).

The file contains the IDs of already migrated binary data (one ID per line). IDs can be added or removed manually in order to skip or repeat migration of certain binary data.

The migrator is provided as an executable jar file named com.subshell.sophora.binarydata.migrator-<version>.jar. Just copy that jar into a directory, place a config folder next to it.

The folder must contain the application.yml in which the configuration of source and target of the migration has to be done.

Migration without a data store

If no binary store was used before, you should migrate to the Sophora DataStore first. You can then perform the migration as described above. This can happen for example by synchronizing a new Sophora Replica Server with binary store. The previously used Sophora Primary Server can only be used further on if it's repository is replaced by a new one (including synchronizing the content back from the mentioned Sophora Replica Server).

Migration of binary data from cloud storage

There might be circumstances where you want to download your binary data stored in the cloud storage to a local file system.

For example to set up local test environments or when one Sophora Staging Server should keep the data locally as a backup (remember to enable the mixed mode in this case!).

Download all folders and the files in these folders from your S3 bucket into a folder without any adjustments and configure this folder as base directory of your binary store. After the download is finished you can start the server configured with sophora.binarystore=fs.

Purge job

When document versions or whole documents are removed from the repository, the binary data in the data store is unaffected.

An asynchronous job checks all binary data in the data store. When no reference to the binary is found (not in the main repository nor in the archive repository) the binary data is removed from the data store. This purge job can be controlled by the following properties:

sophora.binarystore.purgeJob.cronTriggerExpression=0 0 2 * * ?
sophora.binarystore.purgeJob.delayMs=10
sophora.binarystore.purgeJob.durationMinutes=18080

Last modified on 5/10/22

The content of this page is licensed under the CC BY 4.0 License. Code samples are licensed under the MIT License.

Icon