This is tracked by ISPN-3351.
Purpose
Recently it has been more and more a requirement from the users to be able to:
- controlled shutdown a cluster and flush data to persistent storage
- restart the whole cluster and preload the data from the storage (no data loss)
Design
Here's a suggested design to achieve this functionality:
Controlled shutdown
- when a *ClusterShutdown* operation (JMX) is triggered:
- the coordinator doesn't initiate any rebalance on nodes leaving
- each node:
- stops accepting client requests (in future we might get smarter and have a timeout for ongoing transactions)
- flushes whatever state it has in memory to the cache store: e.g. if async store waits for it to stop, if SingleFileCacheStore is used directly forces a sync
- writes a "successfulShutdown" flag to the *LocallyRegistry*
- process exits
Restart
- each individual node is restarted with a "ClusterRestart" option indicating they want to restore cluster state
- each node:
- reads the "successfulShutdown" and "shutDownView" information from the *LocalRegistry* and sends it to the coordinator
- if the node doesn't have have the "successfulShutdown" flag present it wipes out the cache store assuming the data is stale
- the coordinator doesn't start a rebalance on nodes joining, but:
- the coordinator waits for all nodes from the previously stored shutDownView to join again - ignoring new nodes
- when each of the expected nodes have rejoined the coordinator informs the cluster to preload
- after each nodes finished preloading, the coordinator install the "shutDownView" in all the nodes in the cluster (no state transfer yet if there are new nodes as well)
- enable new nodes to join (enable normal state transfer)
- in order to support the case in which not all nodes are being restarted the following JMX operations are available
- restartStatus - provides the following information
- list of the expected cluster nodes: "shutDownView"
- current cluster state: how many nodes have joined up to this moment, which ones are we still waiting on
- noDataLost: a boolean indicating weather enough nodes joined so that no state is lost (e.g. numOwners = 3, if shutDownView.size -2 nodes have joined; this flag could become smarter in future by looking at the actual segment distribution)
- forceRestart - allows an explicit restart even if not all nodes from the "shutDownView" are present
- the nodes that have currently joined preload the state from cache store
- after all have preloaded a rebalance is triggered in which old topology == "shutDownView" and thenew topology == current existing topology based on the available nodes
LocalRegistry
Implemented either as a file on the local disk (backed by the SingleFileCacheStore)
<global> <localRegistry store="file:/path/to/my/local/regitry"/> <global>
Or in environments where a local store is not available (e.g. clouds):
<global> <localRegistry store="cahe:/myCache/myCacheStore"/> <global> ... <namedCache name="myCache"> <loaders> <loader name="myCacheStore" class="..."> ... </loader> </loaders> </namedCache>
Comments