The problem is that the read/writeExternal change in the TreeNode is a major optimisation to prevent a recursive writing of the entire tree structure when only specific nodes needed writing.
Yes, the old Externalizable implementation was very inefficient. I wouldn't advocate going back to it as the default way of doing state transfer. But the StateTransferVersion config option and the related factory pattern for generating/integrating transferred state makes that unnecessary. If a > 1.2.3 instance of TreeCache wants to do a state transfer, I think it should by default use the more efficient approach. Only if the cache has its StateTransferVersion set to 123 will it use the old inefficient externalization technique. This would be something users would have to specifically configure to allow interoperability, at the cost of performance.
I looked at the current TreeNode, and it doesn't implement Externalizable (or even Serializable). So, there is no other code in JBossCache that is depending on a particular serialization format for this class. If the old inefficient read/writeExternal were restored, the only use of it would be for 1.2.3-compatible state transfer. (But we should comment the hell out of the method to ensure no other usage creeps in!).
My purpose in starting this thread was to document the technical issues related to version incompatibility in order to facilitate policy discussions about what to do about it. But I'll go ahead and post a "policy" related comment :)
I initially wasn't particularly in favor of trying to restore compatibility, since the fix is fairly ugly and doing it would leave the 1.2.4 releases "marooned", incompatible with the versions both before and after.
But, if the 4.0 series is like 3.2, it could be up to another couple of years before we stop cutting releases on that branch. Having JBossCache stuck at version 1.2.3 in those releases would IMHO be a very bad thing for the health of the project and for sure will drive up support costs. If it was just a matter of waiting another few months until 5.0 comes out and then the issue goes away, I'd feel differently.
We have to support 3.2.x for 2 years after the release of 5.0. The first step is to get a wire format that will allow for evolution while maintaining backward compatibilty. If this can be and restore backward compatibility, great. If it cannot, we can consider a one time incompatibility to allow for improved behavior and supportability.
Scott, when you talk about this one time incompatibility deal, you are not referring specifically to 3.2.x only right? Just need clarification since we are not planning to update 3.2.x with new JBossCache release.
Perhaps a 'one-time' incompatibility to break the readExternal/writeExternal wire formats in 1.2.3? And since this would break anyway, stick with the Node being an Interface change as well? I don't foresee either of these changing again for the foreseeable future, but perhaps this what we should discuss here as well.
I have added this as item to be discussed during our Neuchatel meeting in 2 weeks, I will schedule a conf call after the meeting to discuss our findings and suggest a strategy going forward
I was able to check out 1.2.4SP1 and resolve the four issues listed in my first post. Ran the "all-functionaltests" target from the test suite and saw no regressions. I then did some (successful) manual interoperability testing in JBossAS as follows:
1) Start an instance of 4.0.3SP1; it uses 220.127.116.11.
2) Drop the new jboss-cache.jar in /server/all/lib in my Branch_4_0 build module
3) Update the build module's tc5-cluster-service.xml to set attribute "StateTransferVersion" to "123".
4) Run the http session replication unit tests in the test suite with REPL_ASYNC. This launches 2 4.0.4RC1 instances, which saw and formed a cluster with the 4.0.3SP1 instance. Tests passed, saw no errors in the logs of the 4.0.3SP1 server.
5) Reconfigured and re-ran with REPL_SYNC. All OK.
6) 4.0.3SP1 server was still running, so started another 4.0.4RC1 instance. From the 4.0.3SP1 server it successfully received the left-over state from the unit tests.
7) Restarted the 4.0.3SP1 server. It successfully received state from the 4.0.4 instance.
This isn't a full, formal interoperability test, but it shows the basic functions working fine between the versions.
Great ! When you're done can we get doco on this (in docbook and wiki format) ?
This is great!
1. Have you tried to run the AS compatibility Junit test?
2. Regarding to the flag, since it also covers the Fqn read/write external. Can we use a more generic name, say, "ReplicationVersion"?
1) Not yet, although I expect no problems there.
2) Good idea, although the flag already exists in 1.2.4SP1. But I guess there is no harm in renaming it. Right now values are string versions of shorts "123", "124", "1241" , "130" etc. Could replace w/ something more string-like, "1_2_3", "1_2_4_SP1" etc. and convert to short internally. Let me know if you want that.
See the updated versioning conventions:
We should be using a string that is compatible with theses conventions so that any version manipulation utilities can be applied. How a version string gets compacted to a short is one such utility function that needs consistent handling.