4 Replies Latest reply on Jul 1, 2010 5:51 AM by chrismeadows

    Unexpected failover behaviour in distributed transaction

    chrismeadows

      JBoss AS 5.1.0 GA

      JDK 1.6.0  update 10 64 bit

      Ubuntu 8.0.4 server 64 bit

       

      Hi there,

       

      I have a  two node JBoss cluster, each node on a separate server. I have a  non-clustered SLSB JMS client inside the container sending a JMS message  to a queue that is consumed by a clustered MDB. The MDB inserts a row  into a database. The JMS delivery and database insertion are governed by  CMP and CMT with an XA datasource. This all works fine, I see the MDBs  loadbalancing the queue consumption across the nodes in the cluster.

       

      However, if  I kill -9 one of the nodes, the failover doesn't behave as I would  expect. That is, I would expect transactions that hadn't completed the  atomic action (of delivering a message and inserting a row into the  database) would failover to the remaining node and complete  successfully. This is not what I see.

       

      If I send (say) 500 messages, then  initiate failover, the JBoss logs detect the node failure and indicate  that failover is complete. But not all the messages failover, and I  don't end up with the expected number of rows inserted in the database.  The JBM_MSG and JBM_MSG_REF tables still have data in them.

       

      If I saw no  failover at all, I'd suspect my own understanding of what I'm doing. But  the fact that some messages/databases inserts failover and some don't  lead me to believe that I've either got some screwy JB config, or hit a  JB defect. The XA concept appears fine (that is, I don't see situations  where a message is delivered and a db insert doesn't occur or vice  versa)

       

      The  JMSClient uses the jmsXA ConnectionFactory. Each JBoss node has a different ServerPeerID (0 and 99)

       

      I can upload a full sample maven project if  need be, but here are the relevant code portions.

       

      The JMS Client is on node 1 and is not clustered

       

       

      {code}

      @Stateless

      @LocalBinding(jndiBinding  = "Sender/local")

      @RemoteBinding(jndiBinding = "Sender/remote")

      public  class Sender implements SenderLocal, SenderRemote {


          ConnectionFactory connectionFactory;

          Queue queue;


          private  Log log = LogFactory.getLog(getClass());


          public Sender() {

          }


          public  void requestProduct(int iMessage) throws Exception {

              Context c = getInitialContext();

              connectionFactory =    (ConnectionFactory)c.lookup(JNDIConstants.JMS_CONNECTION_FACTORY);

              queue = (Queue)  c.lookup(JNDIConstants.MDBCLUSTERTEST_QUEUE_JNDI);

              requestProductImpl(iMessage);

          }

       

          private Context  getInitialContext() {

              InitialContext ic = null;

              try {

                  Properties p = new Properties();

                  p.put(Context.INITIAL_CONTEXT_FACTORY,   "org.jnp.interfaces.NamingContextFactory");

                  p.put(Context.URL_PKG_PREFIXES,  "org.jboss.naming:org.jnp.interfaces");

                  p.put(Context.PROVIDER_URL, "10.3.94.132" + ":1100"  +  "," +  "10.3.94.122" + ":1100");

                  ic = new InitialContext(p);

              } catch (NamingException e) {

                  e.printStackTrace();

              }

              return ic;

          }

       

          public void  requestProductImpl(int iMessage) throws Exception {

              Connection connection = null;

              Session session = null;

              MessageProducer sender = null;

              try {

                  connection = connectionFactory.createConnection();

                  session = connection.createSession(true,  Session.AUTO_ACKNOWLEDGE);

                  ObjectMessage m = session.createObjectMessage();

                  FooMessage message = new FooMessage();

                  message.sMessage = "Message: " + iMessage;

                  m.setObject(message);

                  sender =  session.createProducer(queue);

                  sender.send(m);

                  log.info(message.sMessage + " sent to queue");

              }

              catch (RuntimeException e) {

                   log.error("RuntimeException caught at sender: " +  e.getMessage(), e );

                   throw e;

              }

              catch (Exception e) {

                   log.error("Exception caught at sender: " +  e.getMessage(), e );

                   throw e;

              }

              finally {

                  if(  sender != null ) {

                      sender.close();

                  }

                  if( session != null ) {

                       session.close();

                  }

                  if( connection !=  null ) {

                      connection.close();

                  }

               }

          }

      }

      {code}


       

      and the MDB receiver is on both nodes and is clustered (via the queue and postoffice) config

       

      {code}
      @MessageDriven(activationConfig = {
              @ActivationConfigProperty(propertyName = "destinationType", propertyValue = "javax.jms.Queue"),
              @ActivationConfigProperty(propertyName = "destination", propertyValue = JNDIConstants.MDBCLUSTERTEST_QUEUE_JNDI) })
      public class Receiver implements MessageListener {

          static int iCount = 0;
          private static Log log = LogFactory.getLog(Receiver.class);

          @Resource
          private MessageDrivenContext mdc;
         
          @EJB(mappedName = JNDIConstants.ACCOUNTING_LOG_SERVICE_LOCAL_JNDI)
          AccountingLogService acclog;
         
          @TransactionAttribute(TransactionAttributeType.REQUIRED)
          public void onMessage(Message m) {
              String sMessage = null;
              try {
                  ObjectMessage tm = (ObjectMessage) m;

                  if (!(tm.getObject() instanceof FooMessage)) {
                      throw new IllegalArgumentException("message is not a FooMessage: " + tm.getObject().getClass());
                  }

                  sMessage = ((FooMessage)tm.getObject()).sMessage;
                 
                  log.info("Received: " + sMessage + " iCount = " + ++iCount );

                  acclog.log(new AccountingLogEntryBean(sMessage));
                  log.info("Persisted: " + sMessage + " iCount = " + iCount );                    
              } catch (Exception e) {
                  mdc.setRollbackOnly();
                  log.info("Rolling Back receiver: " + sMessage + " : " + e.getMessage(), e);
                  throw new RuntimeException(e);
              }
          }
      }
      {code}

       


      The queue is nothing unusual

       

       

      {code}

      <mbean code="org.jboss.jms.server.destination.QueueService

            name="jboss.messaging.destination:service=Queue,name=MDBClusterTestQueue"

            xmbean-dd="xmdesc/Queue-xmbean.xml">

            <depends optional-attribute-name="ServerPeer">jboss.messaging:service=ServerPeer</depends>

            <depends>jboss.messaging:service=PostOffice</depends>

            <attribute name="DLQ">jboss.messaging.destination:service=Queue,name=MDBClusterTestQueueDLQ</attribute>

            <attribute name="ExpiryQueue">jboss.messaging.destination:service=Queue,name=MDBClusterTestQueueExpiryQueue</attribute>

            <attribute name="RedeliveryDelay">5000</attribute>

            <attribute name="Clustered">true</attribute>
      </mbean>

      {code}

       

       

      The  AccountingLog is just a simple DAO facade

       

       

      {code}

      @Stateless

      @LocalBinding(jndiBinding = JNDIConstants.ACCOUNTING_LOG_SERVICE_LOCAL_JNDI)

      @RemoteBinding(jndiBinding = JNDIConstants.ACCOUNTING_LOG_SERVICE_REMOTE_JNDI)

      public class AccountingLogServiceBean implements AccountingLogServiceRemote, AccountingLogServiceLocal {

       

          @PersistenceContext(unitName = "AccountingLogService")

          private EntityManager em;

       

          static Log log = LogFactory.getLog(AccountingLogServiceBean.class);

       

          public AccountingLogServiceBean() {

          }

       

          @SuppressWarnings("unchecked")

          @TransactionAttribute(TransactionAttributeType.REQUIRED)

          public void log(AccountingLogEntryBean entry)  {

              em.persist(entry);

          }
      }

      {code}


      The  persistence xml file uses JTA and an XA datasource

       

      {code}

      <persistence xmlns="http://java.sun.com/xml/ns/persistence"

          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

          xsi:schemaLocation="http://java.sun.com/xml/ns/persistence http://java.sun.com/xml/ns/persistence/persistence_1_0.xsd"

          version="1.0">

          <persistence-unit name="AccountingLogService" transaction-type="JTA">

              <jta-data-source>java:/OracleXADS</jta-data-source>

              <class>com.foo.mdbcluster.common.entity.AccountingLogEntryBean</class>

              <properties>

                  <property name="hibernate.dialect"

                      value="org.hibernate.dialect.Oracle10gDialect" />

                  <property name="hibernate.hbm2ddl.auto" value="update" />

              </properties>

          </persistence-unit>
      </persistence>

      {code}

       

       

      The XA  datasource itself is

       

       

      {code}

      <datasources>

        <xa-datasource>

          <jndi-name>OracleXADS</jndi-name>

          <track-connection-by-tx></track-connection-by-tx>

          <isSameRM-override-value>false</isSameRM-override-value>

          <xa-datasource-property name="URL">jdbc:oracle:oci:@(DESCRIPTION=(LOAD_BALANCE=YES)(FAILOVER=ON)(ADDRESS=(PROTOCOL=TCP)(HOST=10.3.94.135)(PORT=1521))(ADDRESS=(PROTOCOL=TCP)(HOST=10.3.94.136)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=FOO_TAF)(FAILOVER_MODE=(TYPE=SELECT)(METHOD=BASIC)(RETRIES=20)(DELAY=15))))</xa-datasource-property>

          <xa-datasource-class>oracle.jdbc.xa.client.OracleXADataSource</xa-datasource-class>

          <xa-datasource-property name="User">xxxx</xa-datasource-property>

          <xa-datasource-property name="Password">xxxx</xa-datasource-property>

          <no-tx-separate-pools></no-tx-separate-pools>

          <prepared-statement-cache-size>100</prepared-statement-cache-size>

          <share-prepared-statements>true</share-prepared-statements>

          <track-statements>warn</track-statements>

          <min-pool-size>140</min-pool-size>

          <max-pool-size>140</max-pool-size>

          <blocking-timeout-millis>180000</blocking-timeout-millis>

          <transaction-isolation>TRANSACTION_READ_COMMITTED</transaction-isolation>

          <exception-sorter-class-name>org.jboss.resource.adapter.jdbc.vendor.OracleExceptionSorter</exception-sorter-class-name>
          <!-- corresponding type-mapping in the standardjbosscmp-jdbc.xml -->

          <metadata>

            <type-mapping>Oracle9i</type-mapping>

          </metadata>

        </xa-datasource>

      </datasources>

      {code}

       

       

      The JMS oracle-persistence-service is plumbed to the same XA datasource, and is using a clustered post office.

       

      For testing convenience, I have a Service  MBean for invoking the JMS Client

       

      {code}

      @Service( objectName = JNDIConstants.TESTER_SERVICE)

      @Management(MDBClusterTestManagement.class)

      public class MDBClusterTest implements MDBClusterTestManagement {

       

          private Log log = LogFactory.getLog(getClass());

          @EJB( mappedName = "Sender/local")

          SenderLocal sender;

          Context c = null;


          ConnectionFactory connectionFactory;   

          public void startTest(int iNumMessages) {

              c = getInitialContext();

              try {

                  connectionFactory =  (ConnectionFactory)c.lookup(JNDIConstants.JMS_CONNECTION_FACTORY);

                  for( int i = 0; i < iNumMessages; i++) {

                      try {

                          sender.requestProduct(i);

                      } catch (Exception e) {

                          log.error("Exception at tester level:" + e.getMessage(), e);

                      }
                  }      

              } catch (NamingException e) {

                  e.printStackTrace();

              }

          }

       

          private Context getInitialContext() {

       

              InitialContext ic = null;

              try {

                  Properties p = new Properties();

                  p.put(Context.INITIAL_CONTEXT_FACTORY, "org.jnp.interfaces.NamingContextFactory");

                  p.put(Context.URL_PKG_PREFIXES, "org.jboss.naming:org.jnp.interfaces");

                  p.put(Context.PROVIDER_URL, "10.3.94.132" + ":1100"  + "," + "10.3.94.122" + ":1100");

                  ic = new InitialContext(p);

              } catch (NamingException e) {

                  e.printStackTrace();

              }

              return ic;

          }
      }

      {code}


       

      So I have node 1 with an MBean that invokes a JMS client 500 times on node 1. This sends 500 messages load balanced across node 1 and node 2.  During the message consumption, I kill node 2 with kill -9

       

      A sample log  file of node 1 (that remains up) is

       

       

      {code}

      2010-05-10 13:36:06,837 INFO  [com.foo.mdbcluster.Sender] (http-10.3.94.132-8080-1) Message: 0 sent to queue
      2010-05-10 13:36:06,848 INFO  [com.foo.mdbcluster.Sender] (http-10.3.94.132-8080-1) Message: 1 sent to queue
      2010-05-10 13:36:06,851 INFO  [com.foo.mdbcluster.Sender] (http-10.3.94.132-8080-1) Message: 2 sent to queue
      2010-05-10 13:36:06,854 INFO  [com.foo.mdbcluster.Sender] (http-10.3.94.132-8080-1) Message: 3 sent to queue
      ...
      2010-05-10 13:36:07,629 INFO  [com.foo.mdbcluster.Sender] (http-10.3.94.132-8080-1) Message: 497 sent to queue
      2010-05-10 13:36:07,631 INFO  [com.foo.mdbcluster.Sender] (http-10.3.94.132-8080-1) Message: 498 sent to queue
      2010-05-10 13:36:07,632 INFO  [com.foo.mdbcluster.Sender] (http-10.3.94.132-8080-1) Message: 499 sent to queue
      2010-05-10 13:36:09,113 INFO  [com.foo.mdbcluster.Receiver] (WorkManager(2)-9) Received: Message: 6 iCount = 1
      2010-05-10 13:36:09,137 INFO  [com.foo.mdbcluster.Receiver] (WorkManager(2)-15) Received: Message: 12 iCount = 3
      2010-05-10 13:36:09,141 INFO  [com.foo.mdbcluster.Receiver] (WorkManager(2)-14) Received: Message: 11 iCount = 11
      2010-05-10 13:36:09,139 INFO  [com.foo.mdbcluster.Receiver] (WorkManager(2)-10) Received: Message: 7 iCount = 6
      2010-05-10 13:36:09,141 INFO  [com.foo.mdbcluster.Receiver] (WorkManager(2)-4) Received: Message: 1 iCount = 12
      2010-05-10 13:36:09,137 INFO  [com.foo.mdbcluster.Receiver] (WorkManager(2)-13) Received: Message: 10 iCount = 5
      2010-05-10 13:36:09,139 INFO  [com.foo.mdbcluster.Receiver] (WorkManager(2)-17) Received: Message: 14 iCount = 9
      2010-05-10 13:36:09,137 INFO  [com.foo.mdbcluster.Receiver] (WorkManager(2)-6) Received: Message: 3 iCount = 4
      ...
      2010-05-10 13:36:12,764 INFO  [com.foo.mdbcluster.Receiver] (WorkManager(2)-81) Received: Message: 78 iCount = 78
      2010-05-10 13:36:12,766 INFO  [com.foo.mdbcluster.Receiver] (WorkManager(2)-81) Persisted: Message: 78 iCount = 78
      2010-05-10 13:36:12,769 INFO  [com.foo.mdbcluster.Receiver] (WorkManager(2)-82) Received: Message: 79 iCount = 79
      2010-05-10 13:36:12,771 INFO  [com.foo.mdbcluster.Receiver] (WorkManager(2)-82) Persisted: Message: 79 iCount = 79
      2010-05-10 13:36:12,799 INFO  [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.DefaultPartition] (VERIFY_SUSPECT.TimerThread,DefaultPartition,10.3.94.132:34771) Suspected member: 10.3.94.122:47594
      2010-05-10 13:36:12,814 INFO  [com.foo.mdbcluster.Receiver] (WorkManager(2)-83) Received: Message: 80 iCount = 80
      2010-05-10 13:36:12,816 INFO  [com.foo.mdbcluster.Receiver] (WorkManager(2)-83) Persisted: Message: 80 iCount = 80
      2010-05-10 13:36:12,818 INFO  [com.foo.mdbcluster.Receiver] (WorkManager(2)-84) Received: Message: 81 iCount = 81
      2010-05-10 13:36:12,818 INFO  [com.foo.mdbcluster.Receiver] (WorkManager(2)-85) Received: Message: 82 iCount = 82
      2010-05-10 13:36:12,821 INFO  [com.foo.mdbcluster.Receiver] (WorkManager(2)-84) Persisted: Message: 81 iCount = 82
      2010-05-10 13:36:12,823 INFO  [com.foo.mdbcluster.Receiver] (WorkManager(2)-85) Persisted: Message: 82 iCount = 82
      2010-05-10 13:36:12,829 INFO  [com.foo.mdbcluster.Receiver] (WorkManager(2)-86) Received: Message: 83 iCount = 83
      2010-05-10 13:36:12,831 INFO  [com.foo.mdbcluster.Receiver] (WorkManager(2)-86) Persisted: Message: 83 iCount = 83
      2010-05-10 13:36:12,847 INFO  [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.DefaultPartition] (Incoming-8,10.3.94.132:34771) New cluster view for partition DefaultPartition (id: 2, delta: -1) : [10.3.94.132:1099]
      2010-05-10 13:36:12,847 INFO  [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] (AsynchViewChangeHandler Thread) I am (10.3.94.132:1099) received membershipChanged event:
      2010-05-10 13:36:12,847 INFO  [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] (AsynchViewChangeHandler Thread) Dead members: 1 ([10.3.94.122:1099])
      2010-05-10 13:36:12,847 INFO  [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] (AsynchViewChangeHandler Thread) New Members : 0 ([])
      2010-05-10 13:36:12,847 INFO  [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] (AsynchViewChangeHandler Thread) All Members : 1 ([10.3.94.132:1099])
      2010-05-10 13:36:12,883 INFO  [org.jboss.messaging.core.impl.postoffice.GroupMember] (Incoming-15,10.3.94.132:34771) org.jboss.messaging.core.impl.postoffice.GroupMember$ControlMembershipListener@32c98179 got new view [10.3.94.132:34771|2] [10.3.94.132:34771], old view is [10.3.94.132:34771|1] [10.3.94.132:34771, 10.3.94.122:47594]
      2010-05-10 13:36:12,884 INFO  [org.jboss.messaging.core.impl.postoffice.GroupMember] (Incoming-15,10.3.94.132:34771) I am (10.3.94.132:34771)
      2010-05-10 13:36:12,903 INFO  [org.jboss.cache.RPCManagerImpl] (Incoming-19,10.3.94.132:34771) Received new cluster view: [10.3.94.132:34771|2] [10.3.94.132:34771]
      2010-05-10 13:36:12,904 ERROR [org.jgroups.blocks.ConnectionTable] (ViewHandler,MessagingPostOffice-DATA,10.3.94.132:7900) failed sending data to 10.3.94.122:7900: java.net.SocketException: Socket closed
      2010-05-10 13:36:12,946 INFO  [com.foo.mdbcluster.Receiver] (WorkManager(2)-87) Received: Message: 84 iCount = 84
      2010-05-10 13:36:12,976 INFO  [com.foo.mdbcluster.Receiver] (WorkManager(2)-89) Received: Message: 86 iCount = 85
      2010-05-10 13:36:12,976 INFO  [com.foo.mdbcluster.Receiver] (WorkManager(2)-88) Received: Message: 85 iCount = 85
      2010-05-10 13:36:12,985 INFO  [com.foo.mdbcluster.Receiver] (WorkManager(2)-90) Received: Message: 87 iCount = 86
      2010-05-10 13:36:12,985 INFO  [com.foo.mdbcluster.Receiver] (WorkManager(2)-88) Persisted: Message: 85 iCount = 86
      2010-05-10 13:36:12,985 INFO  [org.jboss.messaging.core.impl.postoffice.MessagingPostOffice] (Incoming-15,10.3.94.132:34771) JBoss Messaging is failing over for failed node 99. If there are many messages to reload this may take some time...
      2010-05-10 13:36:12,987 INFO  [com.foo.mdbcluster.Receiver] (WorkManager(2)-90) Persisted: Message: 87 iCount = 86
      2010-05-10 13:36:12,987 INFO  [com.foo.mdbcluster.Receiver] (WorkManager(2)-89) Persisted: Message: 86 iCount = 86
      2010-05-10 13:36:12,987 INFO  [com.foo.mdbcluster.Receiver] (WorkManager(2)-87) Persisted: Message: 84 iCount = 86
      2010-05-10 13:36:12,989 INFO  [com.foo.mdbcluster.Receiver] (WorkManager(2)-91) Received: Message: 88 iCount = 87
      ...
      2010-05-10 13:36:13,781 INFO  [com.foo.mdbcluster.Receiver] (WorkManager(2)-6) Persisted: Message: 123 iCount = 121
      2010-05-10 13:36:13,781 INFO  [com.foo.mdbcluster.Receiver] (WorkManager(2)-28) Persisted: Message: 122 iCount = 121
      2010-05-10 13:36:13,923 INFO  [org.jboss.messaging.core.impl.postoffice.MessagingPostOffice] (Incoming-15,10.3.94.132:34771) JBoss Messaging failover completed
      2010-05-10 13:36:13,961 INFO  [org.jboss.messaging.core.impl.postoffice.GroupMember] (Incoming-15,10.3.94.132:34771) Dead members: 1 ([10.3.94.122:47594])
      2010-05-10 13:36:13,961 INFO  [org.jboss.messaging.core.impl.postoffice.GroupMember] (Incoming-15,10.3.94.132:34771) All Members : 1 ([10.3.94.132:34771])
      2010-05-10 13:36:14,014 INFO  [com.foo.mdbcluster.Receiver] (WorkManager(2)-23) Received: Message: 124 iCount = 122
      2010-05-10 13:36:14,014 INFO  [com.foo.mdbcluster.Receiver] (WorkManager(2)-33) Received: Message: 125 iCount = 123
      2010-05-10 13:36:14,015 INFO  [com.foo.mdbcluster.Receiver] (WorkManager(2)-33) Persisted: Message: 125 iCount = 123
      2010-05-10 13:36:14,015 INFO  [com.foo.mdbcluster.Receiver] (WorkManager(2)-23) Persisted: Message: 124 iCount = 123
      ...
      2010-05-10 13:36:25,240 INFO  [com.foo.mdbcluster.Receiver] (WorkManager(2)-54) Received: Message: 499 iCount = 450
      2010-05-10 13:36:25,240 INFO  [com.foo.mdbcluster.Receiver] (WorkManager(2)-48) Received: Message: 498 iCount = 451
      2010-05-10 13:36:25,240 INFO  [com.foo.mdbcluster.Receiver] (WorkManager(2)-11) Persisted: Message: 497 iCount = 451
      2010-05-10 13:36:25,241 INFO  [com.foo.mdbcluster.Receiver] (WorkManager(2)-54) Persisted: Message: 499 iCount = 451
      2010-05-10 13:36:25,241 INFO  [com.foo.mdbcluster.Receiver] (WorkManager(2)-48) Persisted: Message: 498 iCount = 451
      2010-05-10 13:36:36,594 WARN  [org.jboss.jms.server.connectionmanager.SimpleConnectionManager] (Timer-3) A problem has been detected with the connection to remote client a32m3e-vq2wtx-g91c4dzt-1-g91c4vwl-ab, jmsClientID=null. It is possible the client has exited without closing its connection(s) or the network has failed. All associated connection resources will be cleaned up.
      {code}

       

      That's the  end of the log. After node 1 has reached this quiescent state, I will typically see 4-10 messages not being failed over.

       

      Is there something of interest here worth investigating?

       

      Thanks for any help/guidance

       

      Chris