4 Replies Latest reply on May 14, 2009 10:19 AM by victorstar

Messages getting stuck in delivery on a clustered queue

victorstar May 12, 2009 11:01 AM

Hi guys,

We are doing a pilot project adding JBM to existing EJB application.
Configuration is as following:
JBoss 4.2.3, JBM 1.4.2.GA-SP1, JBossRemoting 2.5.1
EJB3 application is running on a cluster of 7 servers.
All servers have the same application farmed across them.
We have a clustered queue and using clustered connection factory.
Initially everything seemed to work fine
But what we noticed is that during peak server load times (nightly) some messages are getting "stuck" in the queues with MessageCount=DeliveringCount=non-zero. This numbers stays non-zero until the node is restarted. The symptoms are very similar to the open Jira issue https://jira.jboss.org/jira/browse/JBMESSAGING-1456
This issue links to the JBoss Remoting issue https://jira.jboss.org/jira/browse/JBREM-1112 which seems to be fixed in 2.5.1 release.
We've upgraded to this release but the issue seems to stay there.

While reading the Jira issues above and looking at the sympthoms it seems that the issue may be caused by network timeouts. So one thing we've done is we bumped the following values up in remoting-bisocket-service.xml file:

<attribute name="clientLeasePeriod" isParam="true">15000</attribute>
<attribute name="validatorPingPeriod" isParam="true">15000</attribute>
<attribute name="validatorPingTimeout" isParam="true">10000</attribute>
<attribute name="timeout" isParam="true">120000</attribute>

This (or maybe upgrade to the latest JBoss Remoting) seemed to help somewhat. If previous night saw us loosing all messages in the queue, last night had just 10% of them stuck.
And actually it's a bit different tonight. We only see 4 messages stuck with MessageCount=DeliveringCount=non-zero but on one of the nodes we see a bunch of messages just sitting there with MessageCount=non-zero, DeliveringCount=0 and ConsumerCount=0. These messages are just piling up. While checking other nodes, four of them have same 0 ConsumerCount, two nodes have ConsumerCount=2 and one node has ConsumerCount=6 (the first one).

With all nodes having same code deployed with MDB listening on the queue why would the ConsumerCount be zero? And more importantly: why would the messages go to the node that doesn't have the consumers?
With this piece I think we might be just missing something in our configuration. I'll post our config below and hopefully you guys can point to what we're doing wrong.

As for the messages "stuck" I'll try bumping timeouts even higher but this doesn't seem to be a proper solution. I would definitely appreciate a good advice here.

Our config:
Clustered queue defined in destinations-service.xml like this:

 <mbean code="org.jboss.jms.server.destination.QueueService"
 name="jboss.messaging.destination:service=Queue,name=/queue/recommendationSentQueue"
 xmbean-dd="xmdesc/Queue-xmbean.xml">
 <depends optional-attribute-name="ServerPeer">jboss.messaging:service=ServerPeer</depends>
 <depends>jboss.messaging:service=PostOffice</depends>
 <attribute name="Clustered">true</attribute>
 <attribute name="RedeliveryDelay">1000</attribute>
 </mbean>

MDB is annotated like this:

@MessageDriven(activationConfig =
{
 @ActivationConfigProperty(propertyName="destinationType", propertyValue="javax.jms.Queue"),
 @ActivationConfigProperty(propertyName="destination", propertyValue="/queue/recommendationSentQueue"),
 @ActivationConfigProperty(propertyName="DLQMaxResent", propertyValue="10")
})

JmsXA connection factory that we're using in the code is pointing to ClusteredXAConnectionFactory in hajndi-jms-ds.xml :

 <mbean code="org.jboss.jms.jndi.JMSProviderLoader"
 name="jboss.messaging:service=JMSProviderLoader,name=HAJNDIJMSProvider">
 <attribute name="ProviderName">DefaultJMSProvider</attribute>
 <attribute name="ProviderAdapterClass">
 org.jboss.jms.jndi.JNDIProviderAdapter
 </attribute>
 <!-- The combined connection factory -->
 <attribute name="FactoryRef">ClusteredXAConnectionFactory</attribute>
 <!-- The queue connection factory -->
 <attribute name="QueueFactoryRef">ClusteredXAConnectionFactory</attribute>
 <!-- The topic factory -->
 <attribute name="TopicFactoryRef">ClusteredXAConnectionFactory</attribute>
 <!-- Access JMS via HAJNDI -->
 <attribute name="Properties">
 java.naming.factory.initial=org.jnp.interfaces.NamingContextFactory
 java.naming.factory.url.pkgs=org.jboss.naming:org.jnp.interfaces
 java.naming.provider.url=${jboss.bind.address:localhost}:1100
 jnp.disableDiscovery=false
 jnp.partitionName=${jboss.partition.name:DefaultPartition}
 jnp.discoveryGroup=${jboss.partition.udpGroup:230.0.0.4}
 jnp.discoveryPort=1102
 jnp.discoveryTTL=16
 jnp.discoveryTimeout=5000
 jnp.maxRetries=1
 </attribute>
 </mbean>

 <!-- The server session pool for Message Driven Beans -->
 <mbean code="org.jboss.jms.asf.ServerSessionPoolLoader"
 name="jboss.messaging:service=ServerSessionPoolMBean,name=StdJMSPool">
 <depends optional-attribute-name="XidFactory">jboss:service=XidFactory</depends>
 <attribute name="PoolName">StdJMSPool</attribute>
 <attribute name="PoolFactoryClass">
 org.jboss.jms.asf.StdServerSessionPoolFactory
 </attribute>
 </mbean>

 <!-- JMS XA Resource adapter, use this to get transacted JMS in beans -->
 <tx-connection-factory>
 <jndi-name>JmsXA</jndi-name>
 <xa-transaction/>
 <rar-name>jms-ra.rar</rar-name>
 <connection-definition>org.jboss.resource.adapter.jms.JmsConnectionFactory</connection-definition>
 <config-property name="SessionDefaultType" type="java.lang.String">javax.jms.Topic</config-property>
 <config-property name="JmsProviderAdapterJNDI" type="java.lang.String">java:/DefaultJMSProvider</config-property>
 <max-pool-size>20</max-pool-size>
 <security-domain-and-application>JmsXARealm</security-domain-and-application>
 </tx-connection-factory>

Thank you guys!
Looking forward to hear any ideas or advice.

Victor

1. Re: Messages getting stuck in delivery on a clustered queue

gaohoward May 12, 2009 11:46 AM (in response to victorstar)

Hi, I think your issue is same as JBMESSAGING-1456/JBREM 1112.

I think you need to wait for the fix, which should be soon.

It's a mixed issue of JBM and JBR. This issue will cause not only message stuck due to client failover, but also incorrect consumer/message count. So as long as client failover happens often, you will have a good chance to get message stuck.
Actions
2. Re: Messages getting stuck in delivery on a clustered queue

victorstar May 12, 2009 11:58 AM (in response to victorstar)

Thanks.

As I said JBR issue is marked as fixed in 2.5.1 release. We've put it in but still seing the issue.

Now, accepting that there is a problem here - is there anything we could do as a temporary workaround? Like I said - timeouts etc.?
One thing we're considering is making those queues local rather than clustered until the issue is fixed. But this wouldn't work for topics which is a problem.

And also with messages piling up on the node without consumers. Do you think it's the same issue? I've noticed that all of those messages are coming from the specifc node (different from the one where they are piling up). So why would 4 out of 7 nodes have no consumers on the queue even though they have MDBs deployed? I thought MDBs would listen to the queue just on the node where they are deployed (or this is an impression I got by reading docs).

And thanks again for helping!
Actions
3. Re: Messages getting stuck in delivery on a clustered queue

gaohoward May 14, 2009 7:21 AM (in response to victorstar)

This issue will be fixed soon. Before this, there is no good workaround that can avoid message stuck. Those stuck messages won't be delivered until you restart the node.

About the message piling up on one node without consumers. Did you configure the message sucker? The message sucker will poll messages from the node that has no consumers on it.
Actions
4. Re: Messages getting stuck in delivery on a clustered queue

victorstar May 14, 2009 10:19 AM (in response to victorstar)
Yes, sucker is configured:
<attribute name="ClusterPullConnectionFactoryName">jboss.messaging.connectionfactory:service=ClusterPullConnectionFactory</attribute>

and we did change the password. It's the same config on all nodes.

Regarding the issue with stuck messages:
I've bumped up timeouts even higher and for the last couple nights we didn't see any messages that got stuck. But we are waiting for a proper fix of course. Let me know when we can try it out.

Thanks!
Actions

Go to original post