Version 9

    Introduction

     

    The aim of this wiki is to explain different strategies to upgrade the JGroups jar file and/or JGroups stack in AS 4.0.x servers without having to do a complete cluster shutdown. Rolling upgrades enable individual cluster nodes to be shut down, upgraded and restarted without affecting existing clusters nodes. As nodes upgrade and restart, they should see each other and create a new cluster that works in paralell to the old one. Eventually, once all nodes have been upgraded, they should be forming a cluster back again.

     

    The wiki is structured on the basis of different use cases, i.e. with/without Gossip Router, udp or tcp...etc.

     

    Scenario 1 - 2.2.7 to 2.4.3.GA upgrade with TCP, TCPGOSSIP and Gossip Router in 4.0.4.GA with EJB2 deployments only.

     

    Assumptions

    The upgrade instructions below make the following assumptions and uses them as starting point:

     

    • Assump. 1.* There are two JGroups 2.2.7 Gossip Router instances running on a separate machine (IP address x.x.x.x) using ports 20082 and 20083:

    # Note: Linux/Unix style script
    CLASSPATH="./jg227/dist/jgroups-all.jar:./jg227/lib/commons-logging.jar:./jg227/lib/log4j-1.2.6.jar
    java -cp $CLASSPATH org.jgroups.stack.GossipRouter -port 20082 &
    java -cp $CLASSPATH org.jgroups.stack.GossipRouter -port 20083 &

     

    • Assump. 2.* The cluster is formed of 3 cluster nodes (node01, node02, and node03) but it could be any number and they're current JGroups section in deploy/cluster-service.xml looks something like this:

    <Config>
       <TCP loopback="true"></TCP>
       <TCPGOSSIP initial_hosts="x.x.x.x[20082],x.x.x.x[20083]" gossip_refresh_rate="10000" num_initial_members="3"></TCPGOSSIP>
       <MERGE2 min_interval="5000" max_interval="10000"></MERGE2>
       <FD shun="true" timeout="10000" max_tries="5" up_thread="true" down_thread="true" ></FD>
       <VERIFY_SUSPECT timeout="8000" down_thread="false" up_thread="false" ></VERIFY_SUSPECT>
       <pbcast.NAKACK down_thread="true" up_thread="true" gc_lag="100" retransmit_timeout="3000"></pbcast>
       <pbcast.STABLE desired_avg_gossip="20000" down_thread="false" up_thread="false" ></pbcast>
       <pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="true" print_local_addr="true" 
                   down_thread="true" up_thread="true"></pbcast>
       <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"></pbcast>
    </Config>

     

    • Assump. 3.* Client are doing JNDI lookup calls on these EJB2 beans using the partition auto discovery feature explained in HAJNDI usage in cluster wiki. It is also assumed that the current cluster is using default DefautlPartition name.

     

    Upgrade Steps

     

    The aim of the upgrade is to get these nodes using a more up to date JGroups jar (2.4.3.GA) and a more up to date JGroups stack:

     

    Step 1. Download the latest JGroups 2.4.x distribution (at the time of writing 2.4.3.GA) and unzip it in the cluster nodes and the Gossip Router machine.

     

    Step 2. Start two new Gossip Router instances that run JGroups 2.4.x libraries in ports 20182 and 20183. Example:

    # Note: Linux/Unix style script
    CLASSPATH="./jg243/dist/jgroups-all.jar:./jg243/lib/commons-logging.jar:./jg243/lib/log4j-1.2.6.jar:./jg243/lib/concurrent.jar"
    java -cp $CLASSPATH org.jgroups.stack.GossipRouter -port 20182 &
    java -cp $CLASSPATH org.jgroups.stack.GossipRouter -port 20183 &

     

    Step 3. Stop node01 and modify <Config> section in deploy/cluster-service.xml to look like this. Note that this configuration uses a different start_port, 8000, compared to the default one, 7800. This is highly recommended if you're gonna have nodes starting up in the same machine that are gonna join different clusters:

    <Config>
       <TCP loopback="true" 
            recv_buf_size="20000000"
            send_buf_size="640000"
            discard_incompatible_packets="true"
            use_incoming_packet_handler="true"
            use_outgoing_packet_handler="false"
            down_thread="false" up_thread="false"
            use_send_queues="false"
            sock_conn_timeout="300"
            skip_suspected_members="true"
            start_port="8000"></TCP>
       <TCPGOSSIP initial_hosts="x.x.x.x[20182],x.x.x.x[20183]" 
                  gossip_refresh_rate="10000" num_initial_members="3"
                  down_thread="false" up_thread="false"></TCPGOSSIP>
       <MERGE2 min_interval="5000" max_interval="10000"
               down_thread="false" up_thread="false"></MERGE2>
       <FD_SOCK down_thread="false" up_thread="false"></FD_SOCK>
       <FD timeout="10000" max_tries="5" down_thread="false" up_thread="false" shun="true"></FD>
       <VERIFY_SUSPECT timeout="8000" down_thread="false" up_thread="false"></VERIFY_SUSPECT>
       <pbcast.NAKACK max_xmit_size="60000"
                      use_mcast_xmit="false" gc_lag="0"
                      retransmit_timeout="300,600,1200,2400,4800"
                      down_thread="false" up_thread="false"
                      discard_delivered_msgs="true"></pbcast>
       <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
                      down_thread="false" up_thread="false"
                      max_bytes="400000"></pbcast>
       <pbcast.GMS print_local_addr="true" join_timeout="3000"
                   down_thread="false" up_thread="false"
                   join_retry_timeout="2000" shun="true"
                   view_bundling="true"></pbcast>
       <FRAG2 frag_size="60000" down_thread="false" up_thread="false"></FRAG2>
       <pbcast.STATE_TRANSFER down_thread="false" up_thread="false"></pbcast>
    </Config>

     

    Step 4. Override jgroups.jar in server's lib directory with jgroups-all.jar from the 2.4.x distribution.

     

    Step 5. Add -Djgroups.marshalling.compatible=true to the startup system properties and restart node01 passing a different partition name to the one used by the old cluster (i.e. -g DefaultPartition2). The logs should show that node01 is the only node in the cluster and node02 and node03 should still form part of the old cluster. Any clients wanting to use the new cluster would need to adjust JNDI settings to point to the new partition. This enables clients to be migrated progressively to the new cluster.

     

    Step 6. Repeat steps 3, 4 and 5 with node02. The logs should show node01 and node02 as being part of the new cluster and node03 the only remaining node in the old cluster.

     

    Step 7. Repeat steps 3, 4 and 5 with node03. The logs should show node01, node02 and node03 part of the new cluster.

     

    Step 8. Stop Gossip Router instances on ports 20082 and 20083.