I'm trying to help with Farm deployment and would like to start a forum discussion on issue JBCLUSTER-26. Ben Wang and I discussed the need for atomic deployment support last week, the information below is partly based on our conversation and my understanding of the task. These are my words and not his (okay, disclosure is complete :-)
Deployment should be atomic or atomic like. Deployment should only complete on any machine if the application can be copied to all nodes and deployed on all nodes. If the deployment fails on any one machine, the deployment should rollback on every machine to its previous state. If a user farm deploys a newer application that is already farm deployed, the new application replaces the old one, unless we have to rollback (in which case we stick with the old app version.) The user that initiates the deployment should be considered the administrator for the deployment operation, results from the cluster nodes should be delivered to the administrator machine in some form (something like ?CrimePortalBeans.jar successfully copied to node1, CrimePortalBeans.jar deployed on node1, CrimePortalBeans.jar failed to be copied to node2, rollback...?.) The results for each node should also be logged locally on each node to help with troubleshooting.
Does this sound right? Should we go with a two phase approach backed by a transaction log or take a lighter approach to tighten up the current support.
Some interesting cases might be:
1.Reboot server during farm deploy (after file is copied into farm folder). rollback should occur after reboot. Do the same for all nodes in the cluster, rollback should occur after reboot.
2.Same as #1 but previous copy of the application exists already and needs to be restored to the farm folder without appearing as a new update.
3.Read only farm folders (http url) may present some challenges, not sure how they fit into the atomic model.
4.Start cluster node1 with app already in farm folder, then start node2, make sure that application changes deploy in the right direction. This is difficult because the current information is ambiguous (should the app be removed from node1 or added to node2). I propose that the decision should always be to add the app to the cluster rather than remove it from the cluster. If you want to remove an application from a cluster, you will have to remove it from a ?live? cluster node (node that is currently part of the cluster).
5.If we use a transaction log (would contain changes to the cluster) how does the user manage it?
6.Rollback operation may fail, how do we handle rollback failures?
A third approach might be to support farm deployment from a source control system. This might be nice as you would know exactly what is in use on the cluster and have a nice history of deployed archives. This solves different problems than atomic deployment but wanted to mention it in case others thought it would help. The objective would be to maintain a ?single truth? as to what should be running on the cluster.
This discussion really belongs in the deployment forum, so I will move it there.
It also deals with a number of "cross cutting concerns" so I will deal with
each individually and give a sort of road-map of our thinking for JBoss5.
First here's a critique of the farm service.
The major concern with the farm service is that It lacks the ability to stage deployments.
1) It just distributes the deployments on a best effort across the cluster
with no reconciliation on whether the deployment works on each peer.
2) There is the potential that every node in the cluster could be redeploying the
application, making it generally unavailable.
3) There are race conditions where two nodes could deploy things in different orders
based on farm deployments to different nodes "at the same time".
The first direction we will be taking in JBoss5 is to make the notion
farm, singleton, replication, federation, etc. an aspect of each deployment rather than
have different folders for these notions.
This allows the features to be "mixed and matched" more easily.
e.g. being able to farm a singleton where you deploy it to one node, it is distributed
to all nodes but only activated on one.
This is our "aspectized" deployment framework, here's the original requirements
docs if you are not familiar with it:
These features will be configurable on individual services, not just whole deployments
so you can have for example an EAR that has some components as singletons
and others doing what farm does today.
The second direction is versioned deployment which will eventually lead
to "atomic deployments".
The original aim of this feature is to allow a reference machine (not a production
machine) to be configured from a known production configuration (e.g. version 5).
When the admin is happy with the new config, he can make it the production config (version 6).
At any point, if problems are found, the admin can "rollback" to a known config version.
On Atomic Deployments. The aim here is to replace
redeploy == undeploy/deploy
with something as near atomic and undistruptive as possible.
The redeploy will become something like:
* See whether the deployment can be constructed
* If it works, wait for current requests to complete and hold new requests (this is called the "valve")
* Once current requests are done, switch the from old to new deployment, e.g. flip
the jndi bindings and other outward facing references - this may also require some
handoff of state, e.g. handing over cache/session objects
* Remove the old/replaced deployments
NOTE: Because of classloading requirements, the cache/session handover
will almost certain require some of form serialzation. Either passivation of the
sessions to disk or replication from the other cluster nodes.
On Transactional Deployment.
This is where you want the deployment to work on all nodes or none at all.
In general this will require the atomic deployment described above, but it also
needs to take into staging and recovery features, both should be pluggable
The most obvious solution is to take each machine to the point where we
know the new deployment will work, but then flip each individually in a round robin
This means only one node is not serving requests and state can be retrieved
from the rest of the cluster.
This is a bit more complicated, but again a simple solution is to use the
elected cluster "co-ordinator" as a reference (assuming you don't have a reference
node as described above in the version processing).
If the co-ordinator can deploy something and another node fails, this will generally
mean that it has got out-of-sync or has some other problem.
The obvious solution then is for that machine to be automatically restarted so
it can recover its deployment state from the correct versioned deployment
either from the co-oridinator or the reference node.
Like I said, a lot of this should be subject to policy.
e.g. in a cluster you don't need to hold new requests while the switch is taking place,
instead you can force them to failover to a machine that is not currently making a
switch between deployment versions.
e.g. you may not be bothered about some state surviving redeployment
Finally, the cross-cutting and deployment versioning must eventually allow
for node specific configuration where the nodes heterogenous.
In this case, some nodes will need their own local configuration
overrides because they are not as capable as other nodes.
You can also imagine nodes co-operating more to manage resources.
e.g. If you one node has idle db connections while another is running out,
the idle connections could be closed in favour of the busier node.
But we are getting slightly off the topic of deployment at here. :-)
All of these features are going to be available in JBoss 5.
However, we may need something simple(r) for 3.x and 4.x.
I'm proposing the following:
- On deployment, call _deploy() across the cluster
This returns a list of values, if one of them is an exception, showing that the deployment was not successful, we call _undeploy() across the cluster
- This is configurable, e.g. through an attribute "atomicDeployments" in the FarmService, which is off by default
This solution would be trivial to implement. Do you guys think this adds value to Farming ?
Without proper versioning, _deploy can be problematic as well though since there is no guarantee that _undeploy will succeed afterwards if values are not all successful.
We will need _rollback instead. Of course, without implementing the full feature, the trade off is deciding where to draw the line?
I would like to help with the "aspectized" deployment framework implementation, can you assign some tasks to me?
The initial tasks are here:
I know Dimitris/Scott started on the infrastructure but they have been (like me)
working on more high prority stuff for JBoss4, so I don't know what state
this is in? Probably nonfunctional?
I plan to do the cutover to the new MicroKernel in JBoss5 over the next
month, so this should start dropping into place.
http://jira.jboss.com/jira/browse/JBAS-1841 + others
The key part that really needs doing is the
VFS - virtual file system
VDF - virtual deployment framework
to better integrate the classloading and deployment framework.
See the discussions in the POJO Server forum, Scott might have
some extra thoughts?
Or knowing him, an uncommitted prototype lying around on his disk :-)