We're going to provide an efficient backup and restore capability in ModeShape 3.x (see MODE-1581). This will work at the repository level, meaning the backups will contain all of the content in all of the workspaces of a single repository. This will be useful for times when a repository needs to be recovered to an earlier state (due to a corruption, hardware failure, etc.), and it will also be part of our solution for migrating ModeShape 2.x repositories to 3.x.
Note: ModeShape 3.x can rely upon an Infinispan cache that is replicated (every node is stored on every process in the Infinispan cluster) or distributed (every node is stored on a fixed number of processes, e.g., 3 or 5, in the Infinispan cluster, but where the copies are balanced across the cluster to maximize availability). Both approaches increase availability even in the event of machine failures, and in many respects reduce the need for backups (since the cache essentially actively maintains its own backups via copies of each node). However, backups are far more important for very small clusters and single-process configurations.
In ModeShape 3 there are two places where content can be stored: representations of the nodes are stored in the Infinispan cache, and (larger) binary values are stored in the BinaryStore. ("Smaller" binary values are stored with the regular node representations; the size limit for storing with the regular nodes or in the binary store is specified in the repository configuration, and defaults to 4kB.) The Infinispan cache and the BinaryStore each have multiple persistence options, and ModeShape's backup mechanism should be independent of these persistence mechanisms.
Basic backup and restore algorithm
The initial design is that the backup process will simply extract the node representations and binary values and write them to files in a directory on the file system. The node representations are actually Schematic documents, which are in-memory documents that have all the capability of both JSON and BSON, and can easily be written out in either format without loss of information. During a backup, the following steps will be performed:
- Iterate over the Infinispan cache entries, appending a number (e.g., 1000) of node documents in JSON format to a file. Whenever the maximum number of entries per backup file is reached, the file will be closed a new file will created, and the appending will continue. Note that we'll compress the files; the JSON format will compress quite well. We'll also experiment to determine an acceptable and practical value for the maximum number of entries per backup file, since a larger number will result in fewer but larger files, and a smaller number will result in a greater number of smaller files. UPDATE: By default, 100K nodes will be exported to a single backup file. So, if each node requied about 200 bytes (compressed), the resulting files will be about 19 MB in size.
- Write out each of the binary values to a separate file.
We'll use a naming convention and organization within a single directory so that the restore process can simply process all of these files, load them into the new repository's Infinispan cache and binary store.
We think that we can easily make the process work even when the repository is in use, and this will greatly increase the value and lower the invasiveness of the backup process. This will be ideal for normal backup operations, when you simply want to periodically obtain a consistent backup copy; if anything goes wrong, you can restore your repository to the state of the last backup. Of course, if you're migrating a repository (e.g., from 2.x to 3.0, or from a repository with one Infinispan cache store to another repository with a different cache store), you will want to suspend all other repository users so that the backup accurately represents the current state.
Migration from 2.x
We're planning on providing a separate utility in 2.8.x that can read an existing configuration and write all of the content into a 3.0-compatible backup format. This utility would be run while the 2.x repository no longer being used, and the resulting backup files can then be used to restore the 2.x content into a 3.0 repository. There's also one huge advantage to this approach: if the 2.x backup fails, simply clean up the files on the file system and start the backup process again. (If we were to directly write the 2.x content into a running 3.0 repository without an intermediate format, any failure would mean the 3.0 repository would also need to be cleaned up.)
Where possible, we'd like the actual running repository to be able to perform the backup and restore processes. That makes it far easier for configurations that use external configurations for data sources, etc, and it also means that managing such operations can be included in the general administrative and monitoring activities of the repository.
Although the JCR API doesn't have such repository-level operations, the JCR API does place similar methods (e.g., export and import) on the javax.jcr.Workspace interface. This means that users must first authenticate by obtaining javax.jcr.Session, and the regular session-based authorization mechanism can be used to ensure that only privileged users can invoke the Workspace methods.
We think that this is a good approach for introducing a backup API. ModeShape already provides a org.modeshape.jcr.api.Workspace interface that extends javax.jcr.Workspace (for adding re-indexing methods), so adding a new "backup" method here seems to make sense. This also is compatible with the ability for the backups to be performed while the repository is in use (see above).
However, the mechanism to restore a repository is a bit more challenging, primarily because the restore process will be completely replacing all content over a period of time, and therefore there should be no users of the repository. Also, if the repository is not empty, all of the existing content will need to be removed. I see two options that might work:
Option 1: Specify a "initializeFromBackup" field in the configuration. Currently, when a repository starts up, it looks for some key pieces of information (e.g., repository metadata) and if not there initializes the repository metadata, "/jcr:system" content, indexes, etc. This initialization mechanism could be altered to simply load the content from the backup. Advantages are that the restore fits pretty cleanly into the intialization mechanism, and the repository is not considered "ready for use" until initialization is complete. One disadvantage is that having the "initializeFromBackup" in the configuration file seems a bit forced and unnatural. Another disadvantage is that this would really only work well if the repository were already empty, making it harder for cases where an running repository needs to be restored to a previously backed-up state.
Option 2: Add a "restore" method to the org.modeshape.jcr.api.Workspace interface, along with the "backup" method discussed above. Like "backup", the repository would only allow users with proper privileges to invoke the method. Advantages are that it's fairly symmetric with "backup", it works with the existing privilege mechanism, and that it's easy to first cleaning out any existing content. The primary disadvantage is that the restore cannot be done while there are other users logged in, so we'd need to add a different repository state (other than "running", "starting", "stopping", and "stopped").
Are there other possibilities? How would you want to restore a repository to a previous state, if it were necessary? Thoughts?