8 Replies Latest reply on Dec 2, 2010 8:56 AM by yelin666

JGroups exception over bridged network

yelin666 Nov 23, 2010 2:30 PM

I am running two cache instances - one on a VM managed by VMware, and the other on the host machine, and they are connected by the bridged network. Is this supported by JGroups? I am using TCP configuration with attached JGroups config files. I tested with different scenarios, and got the following observations. Please suggest what's the possible problems:

1. If I start the CacheManager and get a cache on the host machine first, then when I start the CacheManager and get a cache on the VM, get the following warning & exception on the VM instance:

2010-11-23 11:52:43,166 -0500 | INFO | JChannel        | apache.log4j.Category                              |   849 | JGroups version: 2.10.0.GA
2010-11-23 11:52:45,577 -0500 | WARN | FD_SOCK         | apache.log4j.Category                              |   846 | I (localhost-47185) was suspected by T00696119-61197; ignoring the SUSPECT message
2010-11-23 11:52:45,629 -0500 | INFO | GlobalComponentRegistry | apache.log4j.Category                              |   849 | Infinispan version: Infinispan 'Radegast' 4.1.0.FINAL
2010-11-23 11:52:45,741 -0500 | WARN | STREAMING_STATE_TRANSFER | apache.log4j.Category                              |   846 | State reader socket thread spawned abnormaly
java.net.SocketException: No such device
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
        at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
        at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
        at java.net.Socket.connect(Socket.java:529)
        at org.jgroups.util.Util.connect(Util.java:247)
        at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.connectToStateProvider(STREAMING_STATE_TRANSFER.java:512)
        at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.handleStateRsp(STREAMING_STATE_TRANSFER.java:464)
        at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.up(STREAMING_STATE_TRANSFER.java:225)
        at org.jgroups.protocols.FRAG2.up(FRAG2.java:188)
        at org.jgroups.protocols.FC.up(FC.java:474)
        at org.jgroups.protocols.pbcast.GMS.up(GMS.java:888)
        at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:234)
        at org.jgroups.protocols.UNICAST.handleDataReceived(UNICAST.java:614)
        at org.jgroups.protocols.UNICAST.up(UNICAST.java:294)
        at org.jgroups.protocols.pbcast.NAKACK.up(NAKACK.java:707)
        at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:132)
        at org.jgroups.protocols.FD.up(FD.java:266)
        at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:270)
        at org.jgroups.protocols.MERGE2.up(MERGE2.java:210)
        at org.jgroups.protocols.Discovery.up(Discovery.java:281)
        at org.jgroups.protocols.TP.passMessageUp(TP.java:1009)
        at org.jgroups.protocols.TP.access$100(TP.java:56)
        at org.jgroups.protocols.TP$IncomingPacket.handleMyMessage(TP.java:1549)
        at org.jgroups.protocols.TP$IncomingPacket.run(TP.java:1531)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
2010-11-23 11:52:45,744 -0500 | WARN | STREAMING_STATE_TRANSFER | apache.log4j.Category                              |   846 | Could not connect to state provider. Closing socket...

2. If I start the CacheManager and get a cache on the VM first, then start the CacheManager and get a cache on the host machine, sometimes the data replication works. However, even when it works, I get the following warnings:

Sometimes replication failed with the following exception on the VM:

2010-11-23 10:21:13,464 -0500 | WARN | TCPConnectionMap | apache.log4j.Category                              |   846 | Could not accept connection from peer
java.net.SocketException: Connection reset
        at java.net.SocketInputStream.read(SocketInputStream.java:168)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
        at java.io.DataInputStream.readFully(DataInputStream.java:178)
        at org.jgroups.blocks.TCPConnectionMap$TCPConnection.readPeerAddress(TCPConnectionMap.java:485)
        at org.jgroups.blocks.TCPConnectionMap$TCPConnection.<init>(TCPConnectionMap.java:375)
        at org.jgroups.blocks.TCPConnectionMap$ConnectionAcceptor.run(TCPConnectionMap.java:246)
        at java.lang.Thread.run(Thread.java:619)
2010-11-23 10:21:17,405 -0500 | WARN | NAKACK          | apache.log4j.Category                              |   846 | localhost-28269: dropped message from T00696119-15155 (not in xmit_table), keys are [localhost-28269], view=[localhost-28269|0] [localhost-28269]
2010-11-23 10:21:17,405 -0500 | WARN | NAKACK          | apache.log4j.Category                              |   846 | localhost-28269: dropped message from T00696119-15155 (not in xmit_table), keys are [localhost-28269], view=[localhost-28269|0] [localhost-28269]
2010-11-23 10:22:26,575 -0500 | WARN | NAKACK          | apache.log4j.Category                              |   846 | localhost-28269: dropped message from T00696119-15155 (not in xmit_table), keys are [localhost-28269], view=[localhost-28269|0] [localhost-28269]
2010-11-23 10:22:29,862 -0500 | WARN | NAKACK          | apache.log4j.Category                              |   846 | localhost-28269: dropped message from T00696119-15155 (not in xmit_table), keys are [localhost-28269], view=[localhost-28269|0] [localhost-28269]
2010-11-23 10:22:45,702 -0500 | WARN | TCPConnectionMap | apache.log4j.Category                              |   846 | Could not accept connection from peer
java.net.SocketException: ConnectionTable.Connection.readPeerAddress(): cookie read by 3.96.157.49:7800 does not match own cookie; terminating connection
        at org.jgroups.blocks.TCPConnectionMap$TCPConnection.readPeerAddress(TCPConnectionMap.java:487)
        at org.jgroups.blocks.TCPConnectionMap$TCPConnection.<init>(TCPConnectionMap.java:375)
        at org.jgroups.blocks.TCPConnectionMap$ConnectionAcceptor.run(TCPConnectionMap.java:246)
        at java.lang.Thread.run(Thread.java:619)
2010-11-23 10:23:26,832 -0500 | WARN | TCPConnectionMap | apache.log4j.Category                              |   846 | Could not accept connection from peer
java.net.SocketException: ConnectionTable.Connection.readPeerAddress(): cookie read by 3.96.157.49:7800 does not match own cookie; terminating connection
        at org.jgroups.blocks.TCPConnectionMap$TCPConnection.readPeerAddress(TCPConnectionMap.java:487)
        at org.jgroups.blocks.TCPConnectionMap$TCPConnection.<init>(TCPConnectionMap.java:375)
        at org.jgroups.blocks.TCPConnectionMap$ConnectionAcceptor.run(TCPConnectionMap.java:246)
        at java.lang.Thread.run(Thread.java:619)

(10.96.157.49 is the VM IP address)

jgroups-tcp_VM.xml 2.4 KB
jgroups-tcp_Host.xml 2.4 KB

1. Re: JGroups exception over bridged network

belaban Nov 24, 2010 3:24 AM (in response to yelin666)

#1 Make sure the system property -Djgroups.bind_addr is set correctly. I saw 'localhost' in the logs, and this is certainly incorrect, as it usually resolves to 127.0.0.1

#2 You need to either set bind_addr in FD_SOCK and STREAMING_STATE_TRANSFER as well, or - even better - use -Djgroups.bind_addr. It looks as if STREAMING_STATE_TRANSFER picks the wrong bind address for its server socket

As an alternative, you could set use_default_transport=true in STREAMING_STATE_TRANSFER, so SST wouldn't create its own server socket
Actions
2. Re: JGroups exception over bridged network

yelin666 Nov 24, 2010 9:52 AM (in response to belaban)

Bela, thanks for the response. I need some clarifications:

#1 I believe the 'localhost' is coming from 'hostname' call. I know the hostname was screwed up in our lab machine, and got into the image used by the VM. I can fix that. However, I have the following set up in my JGroups config files. I assume the address specified there would be used for binding if -Djgroups.bind_addr is not specified. And I did test with "telnet 3.96.157.49 7800" from 3.96.157.50 and "telnet 3.96.157.50 7800" from 3.96.157.49, and it works both ways. So it seems the binding works at least. Is -Djgroups.bind_addr required to be set, or optional? Also, based on your description, this property is used in many places, could you please give more details on all the places this property is being referenced? And if the property is not set, what's the default value being used in those places? I couldn't find this information in JGroups manual.
<TCP bind_addr="${jgroups.bind_addr:3.96.157.49}" [on VM]
bind_port="${jgroups.tcp.bind_port:7800}"
<TCP bind_addr="${jgroups.bind_addr:3.96.157.50}" [on Host Machine]
bind_port="${jgroups.tcp.bind_port:7800}"

#2 You mentioned I can set use_default_transport=true in STREAMING_STATE_TRANSFER, can I do the same for FD_SOCK? Also, I got warning log for NAKACK, how should I set that up?

Thanks in advance.
Actions
3. Re: JGroups exception over bridged network

belaban Nov 24, 2010 10:20 AM (in response to yelin666)

So you need to add bind_addr=x.x.x.x to both FD_SOCK and STREAMING_STATE_TRANSFER, or use -Djgroups.bind_addr=x.x.x.x. The latter overrides the former.

The jgroups.bind_addr system property is optional, but then JGroups doesn't know which address to bind to and picks the first non-loopback address, which is almost certainly something you want to avoid.

#2 No, you cannot do the same for FD_SOCK
Actions
4. Re: JGroups exception over bridged network

yelin666 Nov 29, 2010 2:44 PM (in response to belaban)
Bela, thanks for the information. I corrected the hostname on the VM, and added bind_addr for both FD_SOCK & STREAMING_STATE_TRANSFER for testing purpose as it's config change only without coding change. It's clustering and the data replication works as well. However, I am still getting the following FD_SOCK WARN message. Could you please suggest what I miss? I also attached my JGroups config file for your review.

2010-11-29 13:19:38,267 -0500 | WARN | FD_SOCK | apache.log4j.Category | 846 | I (linuxvm-666) was suspected by T00696119-43462; ignoring the SUSPECT message

jgroups-tcp.xml 2.5 KB
Actions
5. Re: JGroups exception over bridged network

belaban Nov 30, 2010 2:47 AM (in response to yelin666)

I suggest enable TRACE for org.jgroups.protocols.FD_SOCK, to see what's going on. You could also use probe.sh (e.g. probe.sh jmx=FD_SOCK) to get info about FD_SOCK in your running system
Actions
6. Re: JGroups exception over bridged network

yelin666 Dec 1, 2010 10:18 AM (in response to belaban)

I followed your advice to enable TRACE. I run it twice, in both cases I started the datagrid on the 50 machine first, then 49. I assume you would be able to get some hints from the following logs. Please suggest what's the problem.

The first time, I only turn on TRACE on 49, and here is the log:
2010-12-01 09:54:19,864 -0500 | INFO | JChannel        | JGroups version: 2.10.0.GA
2010-12-01 09:54:20,197 -0500 | TRACE | FD_SOCK         | waiting for client connections on /3.96.157.49:37493
2010-12-01 09:54:20,652 -0500 | TRACE | FD_SOCK         | who-has-sock linuxvm-11495
2010-12-01 09:54:20,655 -0500 | TRACE | FD_SOCK         | i-have-sock: T00696119-40724 --> 3.96.157.50:4223 (cache is {T00696119-40724=3.96.157.50:4223})
2010-12-01 09:54:20,663 -0500 | DEBUG | FD_SOCK         | VIEW_CHANGE received: [T00696119-40724, linuxvm-11495]
2010-12-01 09:54:20,668 -0500 | TRACE | FD_SOCK         | i-have-sock: linuxvm-11495 --> 3.96.157.49:37493 (cache is {linuxvm-11495=3.96.157.49:37493, T00696119-40724=3.96.157.50:4223})
2010-12-01 09:54:20,671 -0500 | TRACE | FD_SOCK         | got cache from T00696119-40724: cache is {linuxvm-11495=3.96.157.49:37493, T00696119-40724=3.96.157.50:4223}
2010-12-01 09:54:20,671 -0500 | TRACE | FD_SOCK         | pinger_thread started
2010-12-01 09:54:20,672 -0500 | DEBUG | FD_SOCK         | ping_dest is T00696119-40724, pingable_mbrs=[T00696119-40724, linuxvm-11495]
2010-12-01 09:54:20,673 -0500 | TRACE | FD_SOCK         | ping_dest=T00696119-40724, ping_sock=Socket[addr=/3.96.157.50,port=4223,localport=54927], cache={linuxvm-11495=3.96.157.49:37493, T00696119-40724=3.96.157.50:4223}
2010-12-01 09:54:21,654 -0500 | TRACE | FD_SOCK         | [SUSPECT] hdr=SUSPECT, mbrs=[linuxvm-11495]
2010-12-01 09:54:21,655 -0500 | WARN | FD_SOCK         | I (linuxvm-11495) was suspected by T00696119-40724; ignoring the SUSPECT message
2010-12-01 09:54:24,147 -0500 | WARN | TCPConnectionMap | Could not accept connection from peer
java.net.SocketException: Connection reset
        at java.net.SocketInputStream.read(SocketInputStream.java:168)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
        at java.io.DataInputStream.readFully(DataInputStream.java:178)
        at org.jgroups.blocks.TCPConnectionMap$TCPConnection.readPeerAddress(TCPConnectionMap.java:485)
        at org.jgroups.blocks.TCPConnectionMap$TCPConnection.<init>(TCPConnectionMap.java:375)
        at org.jgroups.blocks.TCPConnectionMap$ConnectionAcceptor.run(TCPConnectionMap.java:246)
        at java.lang.Thread.run(Thread.java:619)
2010-12-01 10:05:43,476 -0500 | DEBUG | FD_SOCK         | VIEW_CHANGE received: [linuxvm-11495]
2010-12-01 10:05:43,478 -0500 | DEBUG | FD_SOCK         | peer null closed socket gracefully
2010-12-01 10:05:43,793 -0500 | TRACE | FD_SOCK         | pinger thread terminated

The second time, the log on 49 looks ok as follows:
2010-12-01 10:07:11,824 -0500 | INFO | JChannel        | JGroups version: 2.10.0.GA
2010-12-01 10:07:12,143 -0500 | TRACE | FD_SOCK         | waiting for client connections on /3.96.157.49:51502
2010-12-01 10:07:12,236 -0500 | TRACE | FD_SOCK         | who-has-sock linuxvm-44223
2010-12-01 10:07:12,239 -0500 | TRACE | FD_SOCK         | i-have-sock: T00696119-51985 --> 3.96.157.50:1065 (cache is {T00696119-51985=3.96.157.50:1065})
2010-12-01 10:07:12,240 -0500 | TRACE | FD_SOCK         | accepted connection from /3.96.157.50:1135
2010-12-01 10:07:12,251 -0500 | DEBUG | FD_SOCK         | VIEW_CHANGE received: [T00696119-51985, linuxvm-44223]
2010-12-01 10:07:12,252 -0500 | TRACE | FD_SOCK         | waiting for client connections on /3.96.157.49:51502
2010-12-01 10:07:12,256 -0500 | TRACE | FD_SOCK         | i-have-sock: linuxvm-44223 --> 3.96.157.49:51502 (cache is {T00696119-51985=3.96.157.50:1065, linuxvm-44223=3.96.157.49:51502})
2010-12-01 10:07:12,259 -0500 | TRACE | FD_SOCK         | got cache from T00696119-51985: cache is {T00696119-51985=3.96.157.50:1065, linuxvm-44223=3.96.157.49:51502}
2010-12-01 10:07:12,259 -0500 | TRACE | FD_SOCK         | pinger_thread started
2010-12-01 10:07:12,259 -0500 | DEBUG | FD_SOCK         | ping_dest is T00696119-51985, pingable_mbrs=[T00696119-51985, linuxvm-44223]
2010-12-01 10:07:12,261 -0500 | TRACE | FD_SOCK         | ping_dest=T00696119-51985, ping_sock=Socket[addr=/3.96.157.50,port=1065,localport=33448], cache={T00696119-51985=3.96.157.50:1065, linuxvm-44223=3.96.157.49:51502}

On 50 machine, I got the following log:
2010-12-01 10:05:42,968 -0500 | TRACE | FD_SOCK         | problem sending signal NORMAL_TERMINATION
java.net.SocketException: Socket is not connected
    at java.net.Socket.getOutputStream(Socket.java:830)
    at org.jgroups.protocols.FD_SOCK.sendPingSignal(FD_SOCK.java:501)
    at org.jgroups.protocols.FD_SOCK.sendPingTermination(FD_SOCK.java:493)
    at org.jgroups.protocols.FD_SOCK.stopPingerThread(FD_SOCK.java:487)
    at org.jgroups.protocols.FD_SOCK.stop(FD_SOCK.java:165)
    at org.jgroups.stack.ProtocolStack.stopStack(ProtocolStack.java:847)
    at org.jgroups.JChannel.stopStack(JChannel.java:1908)
    at org.jgroups.JChannel.disconnect(JChannel.java:585)
    at org.infinispan.remoting.transport.jgroups.JGroupsTransport.stop(JGroupsTransport.java:183)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.infinispan.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:170)
    at org.infinispan.factories.AbstractComponentRegistry$PrioritizedMethod.invoke(AbstractComponentRegistry.java:852)
    at org.infinispan.factories.AbstractComponentRegistry.internalStop(AbstractComponentRegistry.java:701)
    at org.infinispan.factories.AbstractComponentRegistry.stop(AbstractComponentRegistry.java:595)
    at org.infinispan.factories.GlobalComponentRegistry.stop(GlobalComponentRegistry.java:149)
    at org.infinispan.manager.DefaultCacheManager.stop(DefaultCacheManager.java:477)
    at com.ge.energy.ssi.core.datagrid.infinispan.InfinispanDataGridManager.shutdown(InfinispanDataGridManager.java:79)
    at com.ge.energy.ssi.core.datagrid.osgi.Activator.stop(Activator.java:72)
    at org.eclipse.osgi.framework.internal.core.BundleContextImpl$2.run(BundleContextImpl.java:843)
    at java.security.AccessController.doPrivileged(Native Method)
    at org.eclipse.osgi.framework.internal.core.BundleContextImpl.stop(BundleContextImpl.java:836)
    at org.eclipse.osgi.framework.internal.core.BundleHost.stopWorker(BundleHost.java:474)
    at org.eclipse.osgi.framework.internal.core.AbstractBundle.suspend(AbstractBundle.java:546)
    at org.eclipse.osgi.framework.internal.core.Framework.suspendBundle(Framework.java:1098)
    at org.eclipse.osgi.framework.internal.core.StartLevelManager.decFWSL(StartLevelManager.java:593)
    at org.eclipse.osgi.framework.internal.core.StartLevelManager.doSetStartLevel(StartLevelManager.java:261)
    at org.eclipse.osgi.framework.internal.core.StartLevelManager.shutdown(StartLevelManager.java:216)
    at org.eclipse.osgi.framework.internal.core.InternalSystemBundle.suspend(InternalSystemBundle.java:266)
    at org.eclipse.osgi.framework.internal.core.Framework.shutdown(Framework.java:685)
    at org.eclipse.osgi.framework.internal.core.Framework.close(Framework.java:583)
    at org.eclipse.osgi.framework.internal.core.InternalSystemBundle$1.run(InternalSystemBundle.java:243)
    at java.lang.Thread.run(Thread.java:619)
2010-12-01 10:06:26,803 -0500 | INFO | JChannel        | JGroups version: 2.10.0.GA
2010-12-01 10:06:27,021 -0500 | TRACE | FD_SOCK         | waiting for client connections on /3.96.157.50:1065
2010-12-01 10:06:30,037 -0500 | DEBUG | FD_SOCK         | VIEW_CHANGE received: [T00696119-51985]
2010-12-01 10:07:11,731 -0500 | DEBUG | FD_SOCK         | VIEW_CHANGE received: [T00696119-51985, linuxvm-44223]
2010-12-01 10:07:11,731 -0500 | TRACE | FD_SOCK         | i-have-sock: T00696119-51985 --> 3.96.157.50:1065 (cache is {T00696119-51985=3.96.157.50:1065})
2010-12-01 10:07:11,731 -0500 | TRACE | FD_SOCK         | pinger_thread started
2010-12-01 10:07:11,731 -0500 | DEBUG | FD_SOCK         | ping_dest is linuxvm-44223, pingable_mbrs=[T00696119-51985, linuxvm-44223]
2010-12-01 10:07:11,731 -0500 | TRACE | FD_SOCK         | i-have-sock: linuxvm-44223 --> 3.96.157.49:51502 (cache is {T00696119-51985=3.96.157.50:1065, linuxvm-44223=3.96.157.49:51502})
2010-12-01 10:07:11,731 -0500 | TRACE | FD_SOCK         | ping_dest=linuxvm-44223, ping_sock=Socket[addr=/3.96.157.49,port=51502,localport=1135], cache={T00696119-51985=3.96.157.50:1065, linuxvm-44223=3.96.157.49:51502}
2010-12-01 10:07:11,747 -0500 | TRACE | FD_SOCK         | i-have-sock: linuxvm-44223 --> 3.96.157.49:51502 (cache is {T00696119-51985=3.96.157.50:1065, linuxvm-44223=3.96.157.49:51502})
2010-12-01 10:07:11,762 -0500 | TRACE | FD_SOCK         | accepted connection from /3.96.157.49:33448
2010-12-01 10:07:11,762 -0500 | TRACE | FD_SOCK         | waiting for client connections on /3.96.157.50:1065
Actions
7. Re: JGroups exception over bridged network

belaban Dec 2, 2010 8:49 AM (in response to yelin666)

You're probably shutting down both instances concurrently, so the second instance cannot tell the first instance that it is shutting down gracefully.
Hard to say, but it looks like this isn't causing any issues, correct ?
Actions
8. Re: JGroups exception over bridged network

yelin666 Dec 2, 2010 8:56 AM (in response to belaban)

Yeah, it was clustering and replicating data ok. However, I got "2010-12-01 09:54:21,655 -0500 | WARN | FD_SOCK | I (linuxvm-11495) was suspected by T00696119-40724; ignoring the SUSPECT message" from time to time, so I am not sure if there is any problem in my configuration...
Actions

Go to original post