1 Reply Latest reply on Jun 24, 2006 10:24 PM by Ben Wang

    (another) XAConnectionFactory not bound

    Brandon Vallade Newbie

      Hello all,
      I am having intermittent trouble with JMS provider failover. I am setting up a number of JBoss instances (4.0.1sp1), each running the 'all' configuration, on separate boxes. They are each being deployed under the default partition. My goal is to have a uniform server configuration and provided JMS failover.
      I have been running a failover test in production, where we have 50 boxes/nodes, where I will bring down each of the boxes in groups of 10-20 and look to make sure that the master node switches correctly and each of the instances are able to rejoin the cluster and successfully re-deploy each of the mdbs.
      *Sometimes* all the boxes go down and come back fine (I see a bunch of errors about not being able to contact the JMS provider once the master node goes down but I understand that this is to be changed to a debugging level. The errors go away once the next node in line becomes the coordinator). Othertimes I will continually get errors such as:
      org.jboss.ejb.plugins.jms.DLQHandler - Initialization failed DLQHandler
      javax.jms.JMSException: Error creating the dlq connection: XAConnectionFactory not bound
      then for each of my MDB's I get
      2006-06-20 10:37:29,556 ERROR- [JMSContainerInvoker(MyMDB) Reconnect] org.jboss.ejb.plugins.jms.DLQHandler - Initialization failed DLQHandler
      javax.jms.JMSException: Error creating the dlq connection: XAConnectionFactory not bound
      2006-06-20 10:37:29,556 WARN - [JMSContainerInvoker(MyMDB) Reconnect] org.jboss.ejb.plugins.jms.JMSContainerInvoker - JMS provider failure detected:
      javax.jms.JMSException: Error creating the dlq connection: XAConnectionFactory not bound

      and all throughout the start up process I see these scattered throughout server.xml:
      ERROR- [UpHandler (GMS)] org.jgroups.protocols.pbcast.ClientGmsImpl - suspect() should not be invoked on an instance of org.jgroups.protocols.pbcast.ClientGmsImpl
      and
      WARN - [DownHandler (GMS)] org.jgroups.protocols.pbcast.ClientGmsImpl - handleJoin(fsf-pw148:49881 (additional data: 16 bytes)) failed, retrying
      and
      WARN - [UpHandler (NAKACK)] org.jgroups.protocols.pbcast.NAKACK - [xxxx:49881 (additional data: 16 bytes)] discarded message from non-member yyyy:37995 (additional data: 16 bytes)

      In these cases it is like the next node in line does not recognize that it is supposed to become the coordinator (I never see the logs that it is deploying the destinations). I am able to sometimes remedy this by successively taking down the node that was supposed become the corrdinator until one finally does.

      I have applied the fix to hajndi-jms-ds.xml to avoid looking for XAConnectionFactory in the local jvm (remove the java:). I have also moved off of hsqldb as the jms datasource. Could this be related to the number of nodes in the cluster? Would it help to switch from FD to FD_SOCK failure detection? If so, would it still provide me with reliable JMS failover?

      Any insight would be greatly appreciated. If any additional info is needed, please let me know.