7 Replies Latest reply on Mar 14, 2008 9:00 PM by genman

Random 1.6gb object allocation attempt when using tcpping

youngm Mar 12, 2008 6:10 PM

We are using jbossCache 1.4.1.SP8 jgroups 2.4.1.SP4 on websphere 6.1 (IBM JDK AIX)

This is our config string:

<config>
 <TCP start_port="58000" sock_conn_timeout="500" send_buf_size="150000" recv_buf_size="80000" loopback="false"
 use_send_queues="false" />
 <TCPPING timeout="2000" down_thread="false" up_thread="false" initial_hosts="host1[58000],host2[58000]"
 port_range="100" num_initial_members="1" />
 <MERGE2 min_interval="10000" max_interval="20000" />
 <FD_SOCK />
 <VERIFY_SUSPECT timeout="1500" up_thread="false" down_thread="false" />
 <pbcast.NAKACK gc_lag="50" retransmit_timeout="600,1200,2400,4800" max_xmit_size="8192" up_thread="false"
 down_thread="false" />
 <UNICAST timeout="600,1200,2400" window_size="100" min_threshold="10" down_thread="false" />
 <pbcast.STABLE desired_avg_gossip="20000" up_thread="false" down_thread="false" />
 <FRAG frag_size="8192" down_thread="false" up_thread="false" />
 <pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="true" print_local_addr="true" />
 <pbcast.STATE_TRANSFER up_thread="true" down_thread="true" />
</config>

Our current configuration contains only 2 nodes.

We are seeing a problem where after about a week of normal operation we see some random 1.6gb byte[] allocation attempts. One of our apps attempted to allocate 33 of these 1.6gb byte[]s and it hung the app. We took a thread and GC dump from our of our frozen applications and noticed the following.

1. We had 33 jgroups send threads hung. 32 of these thread dumps had the following stacktrace:

3XMTHREADINFO "ConnectionTable.Connection.Sender [10.98.111.61:58001 - 10.98.111.61:58001]" (TID:0x36E07400, sys_thread_t:0x37379850, state:CW, native ID:0x001D20B5) prio=5
4XESTACKTRACE at java/lang/Object.wait(Native Method)
4XESTACKTRACE at java/lang/Object.wait(Object.java:199(Compiled Code))
4XESTACKTRACE at org/jgroups/util/Queue.remove(Queue.java:257(Compiled Code))
4XESTACKTRACE at org/jgroups/blocks/BasicConnectionTable$Connection$Sender.run(BasicConnectionTable.java:686(Compiled Code))
4XESTACKTRACE at java/lang/Thread.run(Thread.java:810(Compiled Code))

The other thread was:

3XMTHREADINFO "ConnectionTable.Connection.Receiver [10.98.111.61:58000 - 10.98.111.62:52906]" (TID:0x36CEAA00, sys_thread_t:0x36D12D08, state:R, native ID:0x00125049) prio=5
4XESTACKTRACE at java/net/SocketInputStream.socketRead0(Native Method)
4XESTACKTRACE at java/net/SocketInputStream.read(SocketInputStream.java:155(Compiled Code))
4XESTACKTRACE at java/io/BufferedInputStream.fill(BufferedInputStream.java:229(Compiled Code))
4XESTACKTRACE at java/io/BufferedInputStream.read1(BufferedInputStream.java:267(Compiled Code))
4XESTACKTRACE at java/io/BufferedInputStream.read(BufferedInputStream.java:324(Compiled Code))
4XESTACKTRACE at java/io/DataInputStream.readFully(DataInputStream.java:202(Compiled Code))
4XESTACKTRACE at java/io/DataInputStream.readInt(DataInputStream.java:380(Compiled Code))
4XESTACKTRACE at org/jgroups/blocks/BasicConnectionTable$Connection.run(BasicConnectionTable.java:575)
4XESTACKTRACE at java/lang/Thread.run(Thread.java:810)

The 32 threads are associated with sequential ports 58001-58033 so it appears jgroups is scanning the ports to determine if there are any new nodes in the cluster?

We have not been able to duplicate this problem when using "mping" instead of "tcpping" for member finding however we are not allowed to use multicast in our production environment.

We are going to try and change our port_range to smaller number to see if that helps. does anyone on the board has any other ideas?

Mike

1. Re: Random 1.6gb object allocation attempt when using tcppin

genman Mar 13, 2008 6:26 AM (in response to youngm)

My immediate thought it is there is some packet being received that has a bad size value, which is why a 1.6GB buffer might have been allocated.

Maybe you could patch the code that allocates that array to throw an exception when a certain size limit is reached? I'd also try to run tcpdump and try and capture the offending packet if possible.
Actions
2. Re: Random 1.6gb object allocation attempt when using tcppin

youngm Mar 13, 2008 12:42 PM (in response to youngm)

I'll tell our infrastructure to try and grab that information next time they get the opportunity which may be a while since it they have no way to grab the dumps unless the problem happens to freeze the server which is usually just crashes.

Looking at the stack traces I'm not sure what would be creating the 1.6gb byte array is it possible that java/net/SocketInputStream.socketRead0(Native Method) is the one creating the byte[]. In the thread dump it appears jgroups is only attempting to read an int from the socket.

Mike
Actions
3. Re: Random 1.6gb object allocation attempt when using tcppin

youngm Mar 13, 2008 2:31 PM (in response to youngm)
I have some more info. If found the line of code that is allocating the 1.6gb byte[].

BasicConnectionTable.run():577

575: len=in.readInt(); 576: if(len > buf.length) 577: buf=new byte[len];

So you were correct it would appear that it is recieving a bad packet of some sort which is identifying it's length incorrectly.

Can you provide any insights as to what if any jgroups code might be attempting to send to the socket this code is listening on? Or where this bad packet might be coming from?

Mike
Actions
4. Re: Random 1.6gb object allocation attempt when using tcppin

youngm Mar 13, 2008 4:02 PM (in response to youngm)

Another update. This "bad packet" contains the the following

"0x62, 0x65, 0x6C, 0x61, 0x00, 0x00, 0x00, 0x00"

Which converted to ascii is "bela" so I think this is pretty obvious that it is a jgroup/jbossCache originated packet or fragment of a packet????

Any other ideas?
Actions
5. Re: Random 1.6gb object allocation attempt when using tcppin

youngm Mar 13, 2008 4:10 PM (in response to youngm)

Actually I lied the packet's contents are:

[98, 101, 108, 97, 17, 2, 4, 10, 108, 34, 34, 0, 0, -30, -112]

the first 4 bytes == "bela"

10.108.34.34 is my server's IP address

I don't know what the rest is.
Actions
6. Re: Random 1.6gb object allocation attempt when using tcppin

brian.stansberry Mar 13, 2008 6:32 PM (in response to youngm)

Suggest you post on javagroups-users@lists.sourceforge.net instead. This is a detailed JGroups question, not a JBoss Cache one. Not criticizing you for posting here; you didn't know that. :-) Just telling you where you're more likely to get a useful answer.
Actions
7. Re: Random 1.6gb object allocation attempt when using tcppin

genman Mar 14, 2008 9:00 PM (in response to youngm)

It's funny that a packet with "bela" in it would cause somebody's program to crash.
Actions

Go to original post