Data Store | Version 3

Data Store Guide

The add-on Data Store makes it possible to store binary files more efficiently, in either a cloud binary store, database or in the file system.

Archived documentation for Sophora 3. End-of-support date for this version: 7/25/21

Documentation for Sophora 4

Normally, binary content (e.g. pictures) is stored in binary properties in the repository.

This content is duplicated when versions of Sophora document are created or when documents are copied to the archive repository. To avoid this duplicated binary content a new binary data store is introduced.  The binary data is only stored once.

Even if a picture is created twice (e.g., by cloning a document or by uploading the same picture to different Sophora documents) the binary content is stored only once. To identify duplicate binary content a hash code is built and used as the primary key.

When the binary data store is activated the content is stored in the file system or in a special database. If a binary data store is configured the Sophora Server still reads binary datas in the repository. So a migration is not necessary.

Configuration of the data stores

You can choose between three binary store options:

  1. A file system data store which stores the binary files on each server's hard disk,
  2. a database backed binary store which stores the binary files in a per-server database
  3. and a shared cloud based S3-compatible binary store (e.g. Google Cloud Storage, Amazon S3 or a self-hosted minio-Cluster).

While the cloud storage is shared among all Sophora servers the other two data stores exist on a per-server basis.

Configuration of the file system data store

If the file system data store is used, the individual files are stored on the hard disk or mounted volume separately from the repository.

sophora.binarystore=fs

By default the binary data is stored in the directory ${sophora.home}/repository/binaries.

If you want to save the binary data in a different directory, you can use the server property sophora.binarystore.fs.baseDirectory. The configured path is relative to sophora.home.

Configuration of the database data store

The database data store allows you to store the binary data in a MySQL, Derby or Oracle database. The necessary tables are created automatically if the user has the necessary permissions.

sophora.binarystore=db
# Possible values: mysql, derby or oracle
sophora.binarystore.db.databaseType=mysql
sophora.binarystore.db.url=jdbc:mysql://mysql.subshell.com:3306/binarystore
sophora.binarystore.db.user=sophora
sophora.binarystore.db.password=sophora
sophora.binarystore.db.driver=com.mysql.jdbc.Driver

Configuration of the S3 cloud data store

The S3 Cloud Data Store allows to store the binary data in any S3-compatible cloud storage.

These include Google Cloud Storage, Amazon S3, the self-hostable minio or any other compatible product.

This shared binary store is intended to store binary data in a location that is accessible by all sophora servers. This way no binary data has to be transferred during replication, reducing replication time immensely.

sophora.binarystore=s3
sophora.binarystore.s3.credentials.accessKeyId=<accessKeyId>
sophora.binarystore.s3.credentials.secretAccessKey=<secretAccessKey>
sophora.binarystore.s3.host=<host>
sophora.binarystore.s3.bucketName=<bucketName>

It is also possible to configure the accessKeyId and the secretAccessKey via the environment variables SOPHORA_BINARYSTORE_S3_CREDENTIALS_ACCESS_KEY_ID and SOPHORA_BINARYSTORE_S3_CREDENTIALS_SECRET_ACCESS_KEY. This way you don't have to store these secrets in your sophora.properties file.

The settings from the sophora.properties take precedence over the environment variables.

General recommendations for cloud storage bucket settings

Most cloud storage providers offer buckets with different possible options regarding location type, storage classes, encryption and access control.

It is important to choose location type and storage class based on SLAs and privacy policy requirements.

  • Most cloud storage providers offer storage locations in Europe and Germany.
  • Sophora accesses the data in the storage frequently. Most default storage classes should be suited well for this.

Specific recommendations for Google Cloud Storage

  • Location type: There are three location types available: region, multi-region and dual-region. The first one has an availability SLA of 99.9%, the others 99.95%. Because of data privacy policies region might be the best choice as it allows to choose a specific location, i.e. europe-west3 (Frankfurt).
  • storage class: Because the data in the data store is accessed frequently, standard is the best option.
  • access control: uniform, because the service accounts need the same permissions on each file.
  • encryption: Google-managed key is the most convenient option. If you have specific compliance needs choose customer-managed key.

Enabling the S3 API in Google Cloud Storage

Next to Google's own API, Google Cloud Storage offers an S3 compatible API which needs to be enabled. To do this open the Google Cloud Platform Console, navigate to Storage and open the settings.

There you need to look for Interoperability and enable the Interoperability API. Then generate the required accessKey and secretAccessKey for Sophora and note the request endpoint.

Usage of cloud data store and other binary store types together

On the other hand it is possible to use other binary store types (e.g. file system or database) in a Sophora Replica (Slave) Server or Sophora Staging Server when the Sophora Primary (Master) Server uses Cloud Binary Store.

Since additional events sent by the Sophora Primary (Master) Server which include the binary data are needed in this case and in order to spend only resources that are actually needed, this mode has to be enabled explicitly.

This can be done just by adding the following configuration parameter to the configuration of the Sophora Primary (Master) Server.

sophora.binarystore.s3.mixed.mode=true

Migration of binary data to cloud storage

Migration of the filesystem store

In this scenario a Sophora Server exists with binary data stored in the file system.

To copy all files to your S3 cloud storage, you can use a script like this one (this example uses Google Cloud Storage and needs to be adjusted for other vendors):

#!/bin/bash

if [ ! -s "date.txt" ]
then
        list="$(find . -name \*.bin)"
else
        list=$(find . -newermt "$(cat date.txt)" -name \*.bin)
fi

date -R > date.txt

for file in ${list}
do
        gsutil cp ${file} gs://<bucket name>/$(echo ${file} | sed -e 's/.*\/\(.*\)\.bin/\1/g')
done

The script has to be executed in the folder where the binary files are located (i.e. repository/binaries). It copies all files to the bucket <bucket name>, keeps the id (the part in front of the file ending .bin) and flattens the directory structure.

Furthermore it stores the current date in date.txt. The copy process may take a while. It allows files to be copied incrementally. So you may execute it multiple times after the first run to copy all files which were created after the previous execution.

When you want to take the binary store live, you have to stop the server and execute the script. After this last execution all binaries are copied to the bucket and the cloud binary data store may be used.

Important when using Google Cloud Storage: When running the script in a compute engine VM it needs at least read/write access to the Storage API.

To edit an instance's API access scopes after creation you need t open it in the GCP console, stop it, click "Edit" and under "Cloud API access scopes" set "Storage" to "Read Write" (or full access).

Migration of the database store

For the migration of binary data from a database store to the S3-compatible cloud storage the tool com.subshell.sophora.binarydata.migrator is available.

The migrator needs a configuration of the connection to the database from which data should be extracted and a configuration of the target S3-compatible cloud bucket to store data in.

The configuration is done inside the application.yml file and should look like this:

database : 
    datasource : 
        driver : <databaseDriverName> 
        url : <databaseUrl> 
        username : <databaseUsername> 
        password : <databasePassword> 
s3 : 
        bucket: <bucketName>
        accessKeyId: <accessKeyId>
        secretAccessKey: <secretAccessKey>
        host: <host>

The migration status is stored in a file named migratedIds.txt and will be generated if it doesn't exist (inside the execution directory of the application).

The file contains the IDs of already migrated binary data (one ID per line). IDs can be added or removed manually in order to skip or repeat migration of certain binary data.

The migrator is provided as an executable jar file named com.subshell.sophora.binarydata.migrator-<version>.jar. Just copy that jar into a directory, place a config folder next to it.

The folder must contain the application.yml in which the configuration of source and target of the migration has to be done.

Migration without a data store

If no binary store was used before, you should migrate to one of the other two binary stores first. You can then perform the migration as described above. This can happen for example by synchronizing a new Sophora Replica (Slave) Server with binary store. The previously used Sophora Primary (Master) Server can only be used further on if it's repository is replaced by a new one (including synchronizing the content back from the mentioned Sophora Replica (Slave) Server).

Migration of binary data from cloud storage

There might be circumstances where you want to download your binary data stored in the cloud storage to a local file system.

For example to set up local test environments or when one Sophora Staging Server should keep the data locally as a backup (remember to enable the mixed mode in this case!).

To download all data from a bucket change to a Sophora server's repository/binaries/ folder and execute a script like this (again, this example is suitable for Google Cloud Storage, adjustments needed for different vendors):

#!/bin/bash

bucketName=<bucket name>

function listBucket {
        gsutil ls "gs://$bucketName" | grep "[^\/]$"
}

function getObjectId {
        echo $1 | grep -Eo "\w+[^\/]$"
}

function downloadObject {
        gsutil cp $1 $2
}

objectList=$(listBucket)
echo "Downloading $(echo $objectList | wc -w) objects..."
for objectUrl in ${objectList}
do
        objectId=$(getObjectId $objectUrl)
        subFolder1=${objectId:0:2}
        subFolder2=${objectId:2:2}
        targetPath="$subFolder1/$subFolder2/$objectId.bin"
        echo "$objectUrl -> $targetPath"
        downloadObject $objectUrl $targetPath
done

This will download all objects from the root level of the bucket <bucket name>. It ignores subfolders, as it expects the structure created by the cloud data store, which is simply flat.

The objects are downloaded into the folder structure needed by the Filesystem Data Store. After the download is finished you can start the server configured with sophora.binarystore=fs.

Purge job

When document versions or whole documents are removed from the repository, the binary data in the data store is unaffected.

An asynchronous job checks all binary data in the data store. When no reference to the binary is found (not in the main repository nor in the archive repository) the binary data is removed from the data store. This purge job can be controlled by the following properties:

sophora.binarystore.purgeJob.cronTriggerExpression=0 0 2 * * ?
sophora.binarystore.purgeJob.delayMs=10
sophora.binarystore.purgeJob.durationMinutes=18080

Last modified on 10/26/20

The content of this page is licensed under the CC BY 4.0 License. Code samples are licensed under the MIT License.

Icon