7 Replies Latest reply on Jan 21, 2009 2:12 PM by brian.stansberry

discarded message from non-member

bmelloni Jan 16, 2009 5:46 PM

As far as I can tell I setup a cluster following all the instructions. But as I start the 2nd server, even without any of my apps deployed, I get the error "discarded message from non-member" several times on the first server. Needless to say... things get worse when I try to deploy a simple webapp through the farm folder. I then see lots of errors and although I see the file -slowly (horrendously slowly)- arriving at the other server, the cluster is unusable. The documentation and troubleshooting FAQs haven't helped.

Does anyone have any clues about the possible causes?

Configuration for both boxes:
- JDK 1.6
- jBoss EAP 4.3
- Windows XP
- Firewall disabled for test
- Run all the jgroups standalone troubleshooting test successfully (problem only when using full jBoss)
- Using the 'all' configuration
- Reduced the network to a linksys router and 2 boxes

Symptoms:

- jGroups seems to start correctly and recognize both boxes:
15:50:45,995 INFO [DefaultPartition] I am (127.0.0.1:1099) received membershipChanged event:
15:50:45,995 INFO [DefaultPartition] Dead members: 0 ([])
15:50:45,995 INFO [DefaultPartition] New Members : 0 ([])
15:50:45,995 INFO [DefaultPartition] All Members : 2 ([127.0.0.1:1099, 127.0.0.1:1099])
15:51:07,386 INFO [TreeCache] viewAccepted(): [192.168.11.103:2591|1] [192.168.11.103:2591, 192.168.11.102:1136]
15:51:09,730 INFO [TreeCache] viewAccepted(): [192.168.11.103:2595|1] [192.168.11.103:2595, 192.168.11.102:1141]

- Then the first server complains about 9 times like:
15:51:24,808 WARN [NAKACK] 192.168.11.103:2600] discarded message from non-member 192.168.11.102:1147, my view is [192.168.11.1
03:2600|0] [192.168.11.103:2600]

My guess is that I missed configuring something. Because of the jGroups tests I am reasonably confident that multicast works OK.

I searched this forum (no results), then googled and found lots of similar posts, but no answers. Any help will be greatly welcome.

1. Re: discarded message from non-member

brian.stansberry Jan 19, 2009 12:38 PM (in response to bmelloni)

First, if you are a support customer (you're using EAP), please open a case via the Customer Support Portal. There's no SLA via the forums.

Otherwise,

1) Are you actually using UDP multicast? The ports shown in your logs seem more like what would be used by a TCP-based JGroups config. (Could be multicast though; depends on your config).

2) I need to understand what channels are using 192.168.11.102:1147 and 192.168.11.103:2600. Please find the logging that looks like this on the two nodes and post the area around it:

-------------------------------------------------------
GMS: address is 192.168.11.102:1147
-------------------------------------------------------

or

-------------------------------------------------------
GMS: address is 192.168.11.103:2600
-------------------------------------------------------

3) The following will likely cause problems, although AFAIR not the NAKACK issue you are reporting:

15:50:45,995 INFO [DefaultPartition] All Members : 2 ([127.0.0.1:1099, 127.0.0.1:1099])

That tells me you have JBoss bound to 127.0.0.1 on both nodes. That would occur either by starting JBoss with -b 127.0.0.1 on both nodes, or by not setting -b and leaving the 127.0.0.1 default. The AS clustering code uses the bind address and JNDI port to form a unique cluster-wide id for each node. Works fine, except when you bind JBoss to 127.0.0.1 or 0.0.0.0 on more than one machine. If you *want* to use 127.0.0.1 or 0.0.0.0 as the -b value on more than one node, you should edit the server/all/deploy/cluster-service.xml's ClusterPartition mbean and either change

${jboss.bind.address}

to something unique per server, like

192.168.11.102

or, explicitly configure a String "NodeName" attribute with a unique value per node:

node1

Bottom line, you don't want duplicates in the "[DefaultPartition] All Members" logging.
Actions
2. Re: discarded message from non-member

bmelloni Jan 20, 2009 11:17 AM (in response to bmelloni)

Yes, we are a support customer. I am a new employee for the company and I just requested from my boss the info needed to open a ticket.

Thank you for helping until I am able to open the formal ticket.

Your suggestion (3) to start with -b took care of the discarded message. But I still get some errors. After starting .103 first and .102 second, the following is still happening:

A) I see these errors on .102 at about a 2 minute interval:
09:02:22,093 WARN [ConnectionTable] peer closed connection, trying to re-send msg
09:02:22,093 ERROR [ConnectionTable] 2nd attempt to send data failed too
B) Deployment after placing a WAR in the farm folder seems to be horrendously slow (like if it was failing a lot, timing out, and recovering). I see the WAR file being placed in the all/tmp folder, but the byte count goes up at a crawl. In both servers logs I see quite a few debug statements for TORecoveryModule and XARecoveryModule. Once the push finally finished (after 30-60 min!) the application worked on both servers.

Here are the details you requested in your previous email:

1) I am using the default clustering configuration, since the instructions say you should get default clustering by just starting in the 'all' configuration. If that is UDP multicast, then yes.

I believe the only changes I did to the defaults are:
a) What is indicated in the post-Installation instructions (i.e.: enable the admin accounts so that I can get to the web pages.
b) Start with "-c all', to get default clustering.
c) Since I noticed that with the defaults I couldn't access the server by IP, after I capture the logs I posted, I changed the start to include '-b '.

2)
=====================
Log snippet from .103:
=====================
08:51:58,433 INFO [ServerInfo] Java version: 1.6.0_11,Sun Microsystems Inc.
08:51:58,433 INFO [ServerInfo] Java VM: Java HotSpot(TM) Server VM 11.0-b16,Sun Microsystems Inc.
08:51:58,433 INFO [ServerInfo] OS-System: Windows XP 5.1,x86
08:51:58,824 INFO [Server] Core system initialized
08:52:02,621 INFO [WebService] Using RMI server codebase: http://192.168.11.103:8083/
08:52:02,621 INFO [Log4jService$URLWatchTimerTask] Configuring from URL: resource:jboss-log4j.xml
08:52:03,058 INFO [TransactionManagerService] JBossTS Transaction Service (JTA version) - JBoss Inc.
08:52:03,058 INFO [TransactionManagerService] Setting up property manager MBean and JMX layer
08:52:03,168 INFO [TransactionManagerService] Starting recovery manager
08:52:03,215 INFO [TransactionManagerService] Recovery manager started
08:52:03,215 INFO [TransactionManagerService] Binding TransactionManager JNDI Reference
08:52:07,996 INFO [EJB3Deployer] Starting java:comp multiplexer
08:52:09,840 INFO [STDOUT]
-------------------------------------------------------
GMS: address is 192.168.11.103:1733
-------------------------------------------------------
08:52:11,871 INFO [TreeCache] viewAccepted(): [192.168.11.103:1733|0] [192.168.11.103:1733]
08:52:11,918 INFO [TreeCache] TreeCache local address is 192.168.11.103:1733
08:52:11,918 INFO [TreeCache] State could not be retrieved (we are the first member in group)
08:52:11,918 INFO [TreeCache] parseConfig(): PojoCacheConfig is empty
08:52:12,074 INFO [STDOUT] no object for null
08:52:12,074 INFO [STDOUT] no object for null
08:52:12,121 INFO [STDOUT] no object for null
08:52:12,137 INFO [STDOUT] no object for {urn:jboss:bean-deployer}supplyType
08:52:12,137 INFO [STDOUT] no object for {urn:jboss:bean-deployer}dependsType
08:52:16,480 INFO [NativeServerConfig] JBoss Web Services - Native
08:52:16,496 INFO [NativeServerConfig] jbossws-native-2.0.1.SP2 (build=200710210837)
08:52:18,090 INFO [SnmpAgentService] SNMP agent going active
08:52:18,433 INFO [DefaultPartition] Initializing
08:52:18,465 INFO [STDOUT]
-------------------------------------------------------
GMS: address is 192.168.11.103:1738
-------------------------------------------------------
08:52:20,480 INFO [DefaultPartition] Number of cluster members: 1
08:52:20,480 INFO [DefaultPartition] Other members: 0
08:52:20,480 INFO [DefaultPartition] Fetching state (will wait for 30000 milliseconds):
08:52:20,480 INFO [DefaultPartition] State could not be retrieved (we are the first member in group)
08:52:20,543 INFO [HANamingService] Started ha-jndi bootstrap jnpPort=1100, backlog=50, bindAddress=/192.168.11.103
08:52:20,558 INFO [DetachedHANamingService$AutomaticDiscovery] Listening on /192.168.11.103:1102, group=230.0.0.4, HA-JNDI addr
ess=192.168.11.103:1100
08:52:20,933 INFO [TreeCache] No transaction manager lookup class has been defined. Transactions cannot be used
08:52:21,027 INFO [STDOUT]
-------------------------------------------------------
GMS: address is 192.168.11.103:1742
-------------------------------------------------------
08:52:23,043 INFO [TreeCache] viewAccepted(): [192.168.11.103:1742|0] [192.168.11.103:1742]
08:52:23,043 INFO [TreeCache] TreeCache local address is 192.168.11.103:1742
08:52:23,324 INFO [STDOUT]
-------------------------------------------------------
GMS: address is 192.168.11.103:1746
-------------------------------------------------------
08:52:25,324 INFO [TreeCache] viewAccepted(): [192.168.11.103:1746|0] [192.168.11.103:1746]
08:52:25,324 INFO [TreeCache] TreeCache local address is 192.168.11.103:1746
================================

Snippet from .102:
================================
09:01:43,031 INFO [ServerInfo] Java version: 1.6.0_10,Sun Microsystems Inc.
09:01:43,031 INFO [ServerInfo] Java VM: Java HotSpot(TM) Server VM 11.0-b15,Sun Microsystems Inc.
09:01:43,031 INFO [ServerInfo] OS-System: Windows XP 5.1,x86
09:01:43,531 INFO [Server] Core system initialized
09:01:45,359 INFO [WebService] Using RMI server codebase: http://192.168.11.102:8083/
09:01:45,359 INFO [Log4jService$URLWatchTimerTask] Configuring from URL: resource:jboss-log4j.xml
09:01:45,750 INFO [TransactionManagerService] JBossTS Transaction Service (JTA version) - JBoss Inc.
09:01:45,750 INFO [TransactionManagerService] Setting up property manager MBean and JMX layer
09:01:45,921 INFO [TransactionManagerService] Starting recovery manager
09:01:45,968 INFO [TransactionManagerService] Recovery manager started
09:01:45,968 INFO [TransactionManagerService] Binding TransactionManager JNDI Reference
09:01:47,781 INFO [EJB3Deployer] Starting java:comp multiplexer
09:01:49,296 INFO [STDOUT]
-------------------------------------------------------
GMS: address is 192.168.11.102:1577
-------------------------------------------------------
09:01:51,515 INFO [TreeCache] viewAccepted(): [192.168.11.103:1733|1] [192.168.11.103:1733, 192.168.11.102:1577]
09:01:51,578 INFO [TreeCache] TreeCache local address is 192.168.11.102:1577
09:01:51,640 INFO [TreeCache] received the state (size=1024 bytes)
09:01:51,656 INFO [TreeCache] state was retrieved successfully (in 78 milliseconds)
09:01:51,656 INFO [TreeCache] parseConfig(): PojoCacheConfig is empty
09:01:51,703 INFO [STDOUT] no object for null
09:01:51,703 INFO [STDOUT] no object for null
09:01:51,718 INFO [STDOUT] no object for null
09:01:51,750 INFO [STDOUT] no object for {urn:jboss:bean-deployer}supplyType
09:01:51,765 INFO [STDOUT] no object for {urn:jboss:bean-deployer}dependsType
09:01:53,000 INFO [NativeServerConfig] JBoss Web Services - Native
09:01:53,000 INFO [NativeServerConfig] jbossws-native-2.0.1.SP2 (build=200710210837)
09:01:53,453 INFO [SnmpAgentService] SNMP agent going active
09:01:53,687 INFO [DefaultPartition] Initializing
09:01:53,718 INFO [STDOUT]
-------------------------------------------------------
GMS: address is 192.168.11.102:1583
-------------------------------------------------------
09:02:00,562 INFO [DefaultPartition] Number of cluster members: 2
09:02:00,562 INFO [DefaultPartition] Other members: 1
09:02:00,562 INFO [DefaultPartition] Fetching state (will wait for 30000 milliseconds):
09:02:00,750 INFO [DefaultPartition] state was retrieved successfully (in 188 milliseconds)
09:02:00,953 INFO [HANamingService] Started ha-jndi bootstrap jnpPort=1100, backlog=50, bindAddress=/192.168.11.102
09:02:00,953 INFO [DetachedHANamingService$AutomaticDiscovery] Listening on /192.168.11.102:1102, group=230.0.0.4, HA-JNDI addr
ess=192.168.11.102:1100
09:02:01,218 INFO [TreeCache] No transaction manager lookup class has been defined. Transactions cannot be used
09:02:01,312 INFO [STDOUT]
-------------------------------------------------------
GMS: address is 192.168.11.102:1589
-------------------------------------------------------
09:02:03,578 INFO [TreeCache] viewAccepted(): [192.168.11.103:1742|1] [192.168.11.103:1742, 192.168.11.102:1589]
09:02:03,640 INFO [TreeCache] TreeCache local address is 192.168.11.102:1589
09:02:03,734 INFO [STDOUT]
-------------------------------------------------------
GMS: address is 192.168.11.102:1594
-------------------------------------------------------
09:02:06,031 INFO [TreeCache] viewAccepted(): [192.168.11.103:1746|1] [192.168.11.103:1746, 192.168.11.102:1594]
09:02:06,093 INFO [TreeCache] TreeCache local address is 192.168.11.102:1594
Actions
3. Re: discarded message from non-member

brian.stansberry Jan 20, 2009 1:09 PM (in response to bmelloni)

Make sure when you open a support ticket that you reference this thread so the support team can see the background.

Please post the contents of your deploy/cluster-service.xml file.

Your farming issue for sure sounds like a communication issue; i.e. lost messages, lots of retries. Not bad enough that the cluster falls apart, but bad enough that RPCs around the cluster take forever.
Actions
4. Re: discarded message from non-member

bmelloni Jan 20, 2009 2:10 PM (in response to bmelloni)

Here is cluster-service.xml for both servers.

It should be 'untouched' from the original install (although I remember having to change 'somewhere' - maybe in this file or another file - a value from 0 to 1 to avoid the nodes fighting each other for the same identity).

.103 (the first server I start):
==================

<?xml version="1.0" encoding="UTF-8"?>












${jboss.partition.name:DefaultPartition}


${jboss.bind.address}


False


30000





<UDP mcast_addr="${jboss.partition.udpGroup:228.1.2.3}"
mcast_port="${jboss.hapartition.mcast_port:45566}"
tos="8"
ucast_recv_buf_size="20000000"
ucast_send_buf_size="640000"
mcast_recv_buf_size="25000000"
mcast_send_buf_size="640000"
loopback="false"
discard_incompatible_packets="true"
enable_bundling="false"
max_bundle_size="64000"
max_bundle_timeout="30"
use_incoming_packet_handler="true"
use_outgoing_packet_handler="false"
ip_ttl="${jgroups.udp.ip_ttl:2}"
down_thread="false" up_thread="false"/>
<PING timeout="2000"
down_thread="false" up_thread="false" num_initial_members="3"/>
<MERGE2 max_interval="100000"
down_thread="false" up_thread="false" min_interval="20000"/>
<FD_SOCK down_thread="false" up_thread="false"/>
<FD timeout="10000" max_tries="5" down_thread="false" up_thread="false" shun="true"/>
<VERIFY_SUSPECT timeout="1500" down_thread="false" up_thread="false"/>
<pbcast.NAKACK max_xmit_size="60000"
use_mcast_xmit="false" gc_lag="0"
retransmit_timeout="300,600,1200,2400,4800"
down_thread="false" up_thread="false"
discard_delivered_msgs="true"/>
<UNICAST timeout="300,600,1200,2400,3600"
down_thread="false" up_thread="false"/>
<pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
down_thread="false" up_thread="false"
max_bytes="400000"/>
<pbcast.GMS print_local_addr="true" join_timeout="3000"
down_thread="false" up_thread="false"
join_retry_timeout="2000" shun="true"
view_bundling="true"/>
<FRAG2 frag_size="60000" down_thread="false" up_thread="false"/>
<pbcast.STATE_TRANSFER down_thread="false" up_thread="false" use_flush="false"/>




jboss:service=Naming





jboss:service=Naming

<depends optional-attribute-name="ClusterPartition"
proxy-type="attribute">jboss:service=${jboss.partition.name:DefaultPartition}

/HASessionState/Default

0






<depends optional-attribute-name="ClusterPartition"
proxy-type="attribute">jboss:service=${jboss.partition.name:DefaultPartition}

${jboss.bind.address}

1100

1101

50

<depends optional-attribute-name="LookupPool"
proxy-type="attribute">jboss.system:service=ThreadPool


false

${jboss.bind.address}

${jboss.partition.udpGroup:230.0.0.4}
1102

16

org.jboss.ha.framework.interfaces.RoundRobin








jboss:service=TransactionManager
<depends optional-attribute-name="Connector"
proxy-type="attribute">jboss.remoting:service=Connector,transport=socket
jboss:service=${jboss.partition.name:DefaultPartition}

${jboss.bind.address}
4447

jboss:service=Naming



1
300
300
60000
${jboss.bind.address}
4448
${jboss.bind.address}
0
false
<depends optional-attribute-name="TransactionManagerService">jboss:service=TransactionManager
jboss:service=Naming








<depends optional-attribute-name="ClusterPartition"
proxy-type="attribute">jboss:service=${jboss.partition.name:DefaultPartition}
jboss.cache:service=InvalidationManager
jboss.cache:service=InvalidationManager
DefaultJGBridge

.102 (the secondserver):
===============

<?xml version="1.0" encoding="UTF-8"?>












${jboss.partition.name:DefaultPartition}


${jboss.bind.address}


False


30000





<UDP mcast_addr="${jboss.partition.udpGroup:228.1.2.3}"
mcast_port="${jboss.hapartition.mcast_port:45566}"
tos="8"
ucast_recv_buf_size="20000000"
ucast_send_buf_size="640000"
mcast_recv_buf_size="25000000"
mcast_send_buf_size="640000"
loopback="false"
discard_incompatible_packets="true"
enable_bundling="false"
max_bundle_size="64000"
max_bundle_timeout="30"
use_incoming_packet_handler="true"
use_outgoing_packet_handler="false"
ip_ttl="${jgroups.udp.ip_ttl:2}"
down_thread="false" up_thread="false"/>
<PING timeout="2000"
down_thread="false" up_thread="false" num_initial_members="3"/>
<MERGE2 max_interval="100000"
down_thread="false" up_thread="false" min_interval="20000"/>
<FD_SOCK down_thread="false" up_thread="false"/>
<FD timeout="10000" max_tries="5" down_thread="false" up_thread="false" shun="true"/>
<VERIFY_SUSPECT timeout="1500" down_thread="false" up_thread="false"/>
<pbcast.NAKACK max_xmit_size="60000"
use_mcast_xmit="false" gc_lag="0"
retransmit_timeout="300,600,1200,2400,4800"
down_thread="false" up_thread="false"
discard_delivered_msgs="true"/>
<UNICAST timeout="300,600,1200,2400,3600"
down_thread="false" up_thread="false"/>
<pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
down_thread="false" up_thread="false"
max_bytes="400000"/>
<pbcast.GMS print_local_addr="true" join_timeout="3000"
down_thread="false" up_thread="false"
join_retry_timeout="2000" shun="true"
view_bundling="true"/>
<FRAG2 frag_size="60000" down_thread="false" up_thread="false"/>
<pbcast.STATE_TRANSFER down_thread="false" up_thread="false" use_flush="false"/>




jboss:service=Naming





jboss:service=Naming

<depends optional-attribute-name="ClusterPartition"
proxy-type="attribute">jboss:service=${jboss.partition.name:DefaultPartition}

/HASessionState/Default

0






<depends optional-attribute-name="ClusterPartition"
proxy-type="attribute">jboss:service=${jboss.partition.name:DefaultPartition}

${jboss.bind.address}

1100

1101

50

<depends optional-attribute-name="LookupPool"
proxy-type="attribute">jboss.system:service=ThreadPool


false

${jboss.bind.address}

${jboss.partition.udpGroup:230.0.0.4}
1102

16

org.jboss.ha.framework.interfaces.RoundRobin








jboss:service=TransactionManager
<depends optional-attribute-name="Connector"
proxy-type="attribute">jboss.remoting:service=Connector,transport=socket
jboss:service=${jboss.partition.name:DefaultPartition}

${jboss.bind.address}
4447

jboss:service=Naming



1
300
300
60000
${jboss.bind.address}
4448
${jboss.bind.address}
0
false
<depends optional-attribute-name="TransactionManagerService">jboss:service=TransactionManager
jboss:service=Naming








<depends optional-attribute-name="ClusterPartition"
proxy-type="attribute">jboss:service=${jboss.partition.name:DefaultPartition}
jboss.cache:service=InvalidationManager
jboss.cache:service=InvalidationManager
DefaultJGBridge
Actions
5. Re: discarded message from non-member

brian.stansberry Jan 21, 2009 11:10 AM (in response to bmelloni)
Your "ConnectionTable" logging:

09:02:22,093 WARN [ConnectionTable] peer closed connection, trying to re-send msg 09:02:22,093 ERROR [ConnectionTable] 2nd attempt to send data failed too

is coming from the JBoss Messaging Data Channel. That channel uses TCP unicast for sending messages, unlike the other channels that use UDP multicast.

Farming doesn't use that channel; it uses a different one, the UDP multicast-based one from cluster-service.xml.

So, two separate channels using different underlying protocols are experiencing problems, which sounds to me like a network or host configuration problem. Hard to say what; if resolving the firewall issues you raise in a separate thread make it go away, there's your answer.

See also http://www.jboss.org/community/docs/DOC-12375
Actions
6. Re: discarded message from non-member

bmelloni Jan 21, 2009 12:41 PM (in response to bmelloni)

This Connection issue does not go away with the firewall turned off. The two posts are independent problems.

Let's table this problem. I should get my commercial license credentials today or tomorrow and will use phone support to start over from scratch and reinstall both servers in the cluster according to their instructions instead of what the documentation seems to say.

A suggestion:
- The server config guide is a good reference, but it is worthless as a cluster installation guide unless you are already a jBoss configuration expert.
- There is a need for a simple, step by step guide for installing a basic cluster.
- I might even write it myself and contribute it back after talking to support. Nobody else should suffer through this installation nightmare.

Thanks for trying to help.
Actions
7. Re: discarded message from non-member

brian.stansberry Jan 21, 2009 2:12 PM (in response to bmelloni)

"bmelloni" wrote:
This Connection issue does not go away with the firewall turned off. The two posts are independent problems.

Let's table this problem. I should get my commercial license credentials today or tomorrow and will use phone support to start over from scratch and reinstall both servers in the cluster according to their instructions instead of what the documentation seems to say.

OK. The support team is much better equipped to handle issues that are specific to a particular environment.

A suggestion:
- The server config guide is a good reference, but it is worthless as a cluster installation guide unless you are already a jBoss configuration expert.
- There is a need for a simple, step by step guide for installing a basic cluster.
- I might even write it myself and contribute it back after talking to support. Nobody else should suffer through this installation nightmare.

Thanks for the input. I've heard similar things before, and basically agree. I'd certainly welcome any contributions, particularly on AS 4.x. I'm rewriting the Clustering Guide for AS 5 and have added some of what you are talking about. A draft of that can be found attached to http://www.jboss.org/community/docs/DOC-12928; comments are welcome. (Note: it's the attached document at the bottom of the page; not the links at the top. I won't bore you with the details as to why).
Actions

Go to original post