8 Replies Latest reply on Dec 2, 2010 8:56 AM by yelin666

    JGroups exception over bridged network

    yelin666

      I am running two cache instances - one on a VM managed by VMware, and the other on the host machine, and they are connected by the bridged network. Is this supported by JGroups? I am using TCP configuration with attached JGroups config files. I tested with different scenarios, and got the following observations. Please suggest what's the possible problems:

       

      1. If I start the CacheManager and get a cache on the host machine first, then when I start the CacheManager and get a cache on the VM, get the following warning & exception on the VM instance:

      2010-11-23 11:52:43,166 -0500 | INFO  | JChannel        | apache.log4j.Category                              |   849 | JGroups version: 2.10.0.GA
      2010-11-23 11:52:45,577 -0500 | WARN  | FD_SOCK         | apache.log4j.Category                              |   846 | I (localhost-47185) was suspected by T00696119-61197; ignoring the SUSPECT message
      2010-11-23 11:52:45,629 -0500 | INFO  | GlobalComponentRegistry | apache.log4j.Category                              |   849 | Infinispan version: Infinispan 'Radegast' 4.1.0.FINAL
      2010-11-23 11:52:45,741 -0500 | WARN  | STREAMING_STATE_TRANSFER | apache.log4j.Category                              |   846 | State reader socket thread spawned abnormaly
      java.net.SocketException: No such device
              at java.net.PlainSocketImpl.socketConnect(Native Method)
              at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
              at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
              at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
              at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
              at java.net.Socket.connect(Socket.java:529)
              at org.jgroups.util.Util.connect(Util.java:247)
              at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.connectToStateProvider(STREAMING_STATE_TRANSFER.java:512)
              at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.handleStateRsp(STREAMING_STATE_TRANSFER.java:464)
              at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.up(STREAMING_STATE_TRANSFER.java:225)
              at org.jgroups.protocols.FRAG2.up(FRAG2.java:188)
              at org.jgroups.protocols.FC.up(FC.java:474)
              at org.jgroups.protocols.pbcast.GMS.up(GMS.java:888)
              at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:234)
              at org.jgroups.protocols.UNICAST.handleDataReceived(UNICAST.java:614)
              at org.jgroups.protocols.UNICAST.up(UNICAST.java:294)
              at org.jgroups.protocols.pbcast.NAKACK.up(NAKACK.java:707)
              at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:132)
              at org.jgroups.protocols.FD.up(FD.java:266)
              at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:270)
              at org.jgroups.protocols.MERGE2.up(MERGE2.java:210)
              at org.jgroups.protocols.Discovery.up(Discovery.java:281)
              at org.jgroups.protocols.TP.passMessageUp(TP.java:1009)
              at org.jgroups.protocols.TP.access$100(TP.java:56)
              at org.jgroups.protocols.TP$IncomingPacket.handleMyMessage(TP.java:1549)
              at org.jgroups.protocols.TP$IncomingPacket.run(TP.java:1531)
              at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
              at java.lang.Thread.run(Thread.java:619)
      2010-11-23 11:52:45,744 -0500 | WARN  | STREAMING_STATE_TRANSFER | apache.log4j.Category                              |   846 | Could not connect to state provider. Closing socket...

       

      2. If I start the CacheManager and get a cache on the VM first, then start the CacheManager and get a cache on the host machine, sometimes the data replication works. However, even when it works, I get the following warnings:

      2010-11-23 10:38:59,005 -0500 | WARN  | NAKACK          | apache.log4j.Category                              |   846 | T00696119-29820: dropped message from localhost-39737 (not in xmit_table), keys are [T00696119-29820], view=[T00696119-29820|0] [T00696119-29820] (on host machine)

      2010-11-23 10:38:55,348 -0500 | WARN  | NAKACK          | apache.log4j.Category                              |   846 | localhost-39737: dropped message from T00696119-29820 (not in xmit_table), keys are [localhost-39737], view=[localhost-39737|0] [localhost-39737] (on VM)

       

      Sometimes replication failed with the following exception on the VM:

      2010-11-23 10:21:13,464 -0500 | WARN  | TCPConnectionMap | apache.log4j.Category                              |   846 | Could not accept connection from peer
      java.net.SocketException: Connection reset
              at java.net.SocketInputStream.read(SocketInputStream.java:168)
              at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
              at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
              at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
              at java.io.DataInputStream.readFully(DataInputStream.java:178)
              at org.jgroups.blocks.TCPConnectionMap$TCPConnection.readPeerAddress(TCPConnectionMap.java:485)
              at org.jgroups.blocks.TCPConnectionMap$TCPConnection.<init>(TCPConnectionMap.java:375)
              at org.jgroups.blocks.TCPConnectionMap$ConnectionAcceptor.run(TCPConnectionMap.java:246)
              at java.lang.Thread.run(Thread.java:619)
      2010-11-23 10:21:17,405 -0500 | WARN  | NAKACK          | apache.log4j.Category                              |   846 | localhost-28269: dropped message from T00696119-15155 (not in xmit_table), keys are [localhost-28269], view=[localhost-28269|0] [localhost-28269]
      2010-11-23 10:21:17,405 -0500 | WARN  | NAKACK          | apache.log4j.Category                              |   846 | localhost-28269: dropped message from T00696119-15155 (not in xmit_table), keys are [localhost-28269], view=[localhost-28269|0] [localhost-28269]
      2010-11-23 10:22:26,575 -0500 | WARN  | NAKACK          | apache.log4j.Category                              |   846 | localhost-28269: dropped message from T00696119-15155 (not in xmit_table), keys are [localhost-28269], view=[localhost-28269|0] [localhost-28269]
      2010-11-23 10:22:29,862 -0500 | WARN  | NAKACK          | apache.log4j.Category                              |   846 | localhost-28269: dropped message from T00696119-15155 (not in xmit_table), keys are [localhost-28269], view=[localhost-28269|0] [localhost-28269]
      2010-11-23 10:22:45,702 -0500 | WARN  | TCPConnectionMap | apache.log4j.Category                              |   846 | Could not accept connection from peer
      java.net.SocketException: ConnectionTable.Connection.readPeerAddress(): cookie read by 3.96.157.49:7800 does not match own cookie; terminating connection
              at org.jgroups.blocks.TCPConnectionMap$TCPConnection.readPeerAddress(TCPConnectionMap.java:487)
              at org.jgroups.blocks.TCPConnectionMap$TCPConnection.<init>(TCPConnectionMap.java:375)
              at org.jgroups.blocks.TCPConnectionMap$ConnectionAcceptor.run(TCPConnectionMap.java:246)
              at java.lang.Thread.run(Thread.java:619)
      2010-11-23 10:23:26,832 -0500 | WARN  | TCPConnectionMap | apache.log4j.Category                              |   846 | Could not accept connection from peer
      java.net.SocketException: ConnectionTable.Connection.readPeerAddress(): cookie read by 3.96.157.49:7800 does not match own cookie; terminating connection
              at org.jgroups.blocks.TCPConnectionMap$TCPConnection.readPeerAddress(TCPConnectionMap.java:487)
              at org.jgroups.blocks.TCPConnectionMap$TCPConnection.<init>(TCPConnectionMap.java:375)
              at org.jgroups.blocks.TCPConnectionMap$ConnectionAcceptor.run(TCPConnectionMap.java:246)
              at java.lang.Thread.run(Thread.java:619)

      (10.96.157.49 is the VM IP address)

        • 1. Re: JGroups exception over bridged network
          belaban

          #1 Make sure the system property -Djgroups.bind_addr is set correctly. I saw 'localhost' in the logs, and this is certainly incorrect, as it usually resolves to 127.0.0.1

           

          #2 You need to either set bind_addr in FD_SOCK and STREAMING_STATE_TRANSFER as well, or - even better - use -Djgroups.bind_addr. It looks as if STREAMING_STATE_TRANSFER picks the wrong bind address for its server socket

           

          As an alternative, you could set use_default_transport=true in STREAMING_STATE_TRANSFER, so SST wouldn't create its own server socket

          • 2. Re: JGroups exception over bridged network
            yelin666

            Bela, thanks for the response. I need some clarifications:

             

            #1 I believe the 'localhost' is coming from 'hostname' call. I know the hostname was screwed up in our lab machine, and got into the image used by the VM. I can fix that. However, I have the following set up in my JGroups config files. I assume the address specified there would be used for binding if -Djgroups.bind_addr is not specified. And I did test with "telnet 3.96.157.49 7800" from 3.96.157.50 and "telnet 3.96.157.50 7800" from 3.96.157.49, and it works both ways. So it seems the binding works at least. Is -Djgroups.bind_addr required to be set, or optional? Also, based on your description, this property is used in many places, could you please give more details on all the places this property is being referenced? And if the property is not set, what's the default value being used in those places? I couldn't find this information in JGroups manual.

            <TCP bind_addr="${jgroups.bind_addr:3.96.157.49}" [on VM]
                    bind_port="${jgroups.tcp.bind_port:7800}"

            <TCP bind_addr="${jgroups.bind_addr:3.96.157.50}" [on Host Machine]
                    bind_port="${jgroups.tcp.bind_port:7800}"

             

            #2 You mentioned I can set use_default_transport=true in STREAMING_STATE_TRANSFER, can I do the same for FD_SOCK? Also, I got warning log for NAKACK, how should I set that up?

             

            Thanks in advance.

            • 3. Re: JGroups exception over bridged network
              belaban

              So you need to add bind_addr=x.x.x.x to both FD_SOCK and STREAMING_STATE_TRANSFER, or use -Djgroups.bind_addr=x.x.x.x. The latter overrides the former.

               

              The jgroups.bind_addr system property is optional, but then JGroups doesn't know which address to bind to and picks the first non-loopback address, which is almost certainly something you want to avoid.

               

              #2 No, you cannot do the same for FD_SOCK

              • 4. Re: JGroups exception over bridged network
                yelin666

                Bela, thanks for the information. I corrected the hostname on the VM, and added bind_addr for both FD_SOCK & STREAMING_STATE_TRANSFER for testing purpose as it's config change only without coding change. It's clustering and the data replication works as well. However, I am still getting the following FD_SOCK WARN message. Could you please suggest what I miss? I also attached my JGroups config file for your review.

                 

                2010-11-29 13:19:38,267 -0500 | WARN  | FD_SOCK         | apache.log4j.Category                              |   846 | I (linuxvm-666) was suspected by T00696119-43462; ignoring the SUSPECT message

                • 5. Re: JGroups exception over bridged network
                  belaban

                  I suggest enable TRACE for org.jgroups.protocols.FD_SOCK, to see what's going on. You could also use probe.sh (e.g. probe.sh jmx=FD_SOCK) to get info about FD_SOCK in your running system

                  • 6. Re: JGroups exception over bridged network
                    yelin666

                    I followed your advice to enable TRACE. I run it twice, in both cases I started the datagrid on the 50 machine first, then 49. I assume you would be able to get some hints from the following logs. Please suggest what's the problem.

                     

                    The first time, I only turn on TRACE on 49, and here is the log:

                    2010-12-01 09:54:19,864 -0500 | INFO  | JChannel        | JGroups version: 2.10.0.GA
                    2010-12-01 09:54:20,197 -0500 | TRACE | FD_SOCK         | waiting for client connections on /3.96.157.49:37493
                    2010-12-01 09:54:20,652 -0500 | TRACE | FD_SOCK         | who-has-sock linuxvm-11495
                    2010-12-01 09:54:20,655 -0500 | TRACE | FD_SOCK         | i-have-sock: T00696119-40724 --> 3.96.157.50:4223 (cache is {T00696119-40724=3.96.157.50:4223})
                    2010-12-01 09:54:20,663 -0500 | DEBUG | FD_SOCK         | VIEW_CHANGE received: [T00696119-40724, linuxvm-11495]
                    2010-12-01 09:54:20,668 -0500 | TRACE | FD_SOCK         | i-have-sock: linuxvm-11495 --> 3.96.157.49:37493 (cache is {linuxvm-11495=3.96.157.49:37493, T00696119-40724=3.96.157.50:4223})
                    2010-12-01 09:54:20,671 -0500 | TRACE | FD_SOCK         | got cache from T00696119-40724: cache is {linuxvm-11495=3.96.157.49:37493, T00696119-40724=3.96.157.50:4223}
                    2010-12-01 09:54:20,671 -0500 | TRACE | FD_SOCK         | pinger_thread started
                    2010-12-01 09:54:20,672 -0500 | DEBUG | FD_SOCK         | ping_dest is T00696119-40724, pingable_mbrs=[T00696119-40724, linuxvm-11495]
                    2010-12-01 09:54:20,673 -0500 | TRACE | FD_SOCK         | ping_dest=T00696119-40724, ping_sock=Socket[addr=/3.96.157.50,port=4223,localport=54927], cache={linuxvm-11495=3.96.157.49:37493, T00696119-40724=3.96.157.50:4223}
                    2010-12-01 09:54:21,654 -0500 | TRACE | FD_SOCK         | [SUSPECT] hdr=SUSPECT, mbrs=[linuxvm-11495]
                    2010-12-01 09:54:21,655 -0500 | WARN  | FD_SOCK         | I (linuxvm-11495) was suspected by T00696119-40724; ignoring the SUSPECT message
                    2010-12-01 09:54:24,147 -0500 | WARN  | TCPConnectionMap | Could not accept connection from peer
                    java.net.SocketException: Connection reset
                            at java.net.SocketInputStream.read(SocketInputStream.java:168)
                            at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
                            at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
                            at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
                            at java.io.DataInputStream.readFully(DataInputStream.java:178)
                            at org.jgroups.blocks.TCPConnectionMap$TCPConnection.readPeerAddress(TCPConnectionMap.java:485)
                            at org.jgroups.blocks.TCPConnectionMap$TCPConnection.<init>(TCPConnectionMap.java:375)
                            at org.jgroups.blocks.TCPConnectionMap$ConnectionAcceptor.run(TCPConnectionMap.java:246)
                            at java.lang.Thread.run(Thread.java:619)
                    2010-12-01 10:05:43,476 -0500 | DEBUG | FD_SOCK         | VIEW_CHANGE received: [linuxvm-11495]
                    2010-12-01 10:05:43,478 -0500 | DEBUG | FD_SOCK         | peer null closed socket gracefully
                    2010-12-01 10:05:43,793 -0500 | TRACE | FD_SOCK         | pinger thread terminated

                     

                    The second time, the log on 49 looks ok as follows:

                    2010-12-01 10:07:11,824 -0500 | INFO  | JChannel        | JGroups version: 2.10.0.GA
                    2010-12-01 10:07:12,143 -0500 | TRACE | FD_SOCK         | waiting for client connections on /3.96.157.49:51502
                    2010-12-01 10:07:12,236 -0500 | TRACE | FD_SOCK         | who-has-sock linuxvm-44223
                    2010-12-01 10:07:12,239 -0500 | TRACE | FD_SOCK         | i-have-sock: T00696119-51985 --> 3.96.157.50:1065 (cache is {T00696119-51985=3.96.157.50:1065})
                    2010-12-01 10:07:12,240 -0500 | TRACE | FD_SOCK         | accepted connection from /3.96.157.50:1135
                    2010-12-01 10:07:12,251 -0500 | DEBUG | FD_SOCK         | VIEW_CHANGE received: [T00696119-51985, linuxvm-44223]
                    2010-12-01 10:07:12,252 -0500 | TRACE | FD_SOCK         | waiting for client connections on /3.96.157.49:51502
                    2010-12-01 10:07:12,256 -0500 | TRACE | FD_SOCK         | i-have-sock: linuxvm-44223 --> 3.96.157.49:51502 (cache is {T00696119-51985=3.96.157.50:1065, linuxvm-44223=3.96.157.49:51502})
                    2010-12-01 10:07:12,259 -0500 | TRACE | FD_SOCK         | got cache from T00696119-51985: cache is {T00696119-51985=3.96.157.50:1065, linuxvm-44223=3.96.157.49:51502}
                    2010-12-01 10:07:12,259 -0500 | TRACE | FD_SOCK         | pinger_thread started
                    2010-12-01 10:07:12,259 -0500 | DEBUG | FD_SOCK         | ping_dest is T00696119-51985, pingable_mbrs=[T00696119-51985, linuxvm-44223]
                    2010-12-01 10:07:12,261 -0500 | TRACE | FD_SOCK         | ping_dest=T00696119-51985, ping_sock=Socket[addr=/3.96.157.50,port=1065,localport=33448], cache={T00696119-51985=3.96.157.50:1065, linuxvm-44223=3.96.157.49:51502}

                     

                    On 50 machine, I got the following log:

                    2010-12-01 10:05:42,968 -0500 | TRACE | FD_SOCK         | problem sending signal NORMAL_TERMINATION
                    java.net.SocketException: Socket is not connected
                        at java.net.Socket.getOutputStream(Socket.java:830)
                        at org.jgroups.protocols.FD_SOCK.sendPingSignal(FD_SOCK.java:501)
                        at org.jgroups.protocols.FD_SOCK.sendPingTermination(FD_SOCK.java:493)
                        at org.jgroups.protocols.FD_SOCK.stopPingerThread(FD_SOCK.java:487)
                        at org.jgroups.protocols.FD_SOCK.stop(FD_SOCK.java:165)
                        at org.jgroups.stack.ProtocolStack.stopStack(ProtocolStack.java:847)
                        at org.jgroups.JChannel.stopStack(JChannel.java:1908)
                        at org.jgroups.JChannel.disconnect(JChannel.java:585)
                        at org.infinispan.remoting.transport.jgroups.JGroupsTransport.stop(JGroupsTransport.java:183)
                        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
                        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
                        at java.lang.reflect.Method.invoke(Method.java:597)
                        at org.infinispan.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:170)
                        at org.infinispan.factories.AbstractComponentRegistry$PrioritizedMethod.invoke(AbstractComponentRegistry.java:852)
                        at org.infinispan.factories.AbstractComponentRegistry.internalStop(AbstractComponentRegistry.java:701)
                        at org.infinispan.factories.AbstractComponentRegistry.stop(AbstractComponentRegistry.java:595)
                        at org.infinispan.factories.GlobalComponentRegistry.stop(GlobalComponentRegistry.java:149)
                        at org.infinispan.manager.DefaultCacheManager.stop(DefaultCacheManager.java:477)
                        at com.ge.energy.ssi.core.datagrid.infinispan.InfinispanDataGridManager.shutdown(InfinispanDataGridManager.java:79)
                        at com.ge.energy.ssi.core.datagrid.osgi.Activator.stop(Activator.java:72)
                        at org.eclipse.osgi.framework.internal.core.BundleContextImpl$2.run(BundleContextImpl.java:843)
                        at java.security.AccessController.doPrivileged(Native Method)
                        at org.eclipse.osgi.framework.internal.core.BundleContextImpl.stop(BundleContextImpl.java:836)
                        at org.eclipse.osgi.framework.internal.core.BundleHost.stopWorker(BundleHost.java:474)
                        at org.eclipse.osgi.framework.internal.core.AbstractBundle.suspend(AbstractBundle.java:546)
                        at org.eclipse.osgi.framework.internal.core.Framework.suspendBundle(Framework.java:1098)
                        at org.eclipse.osgi.framework.internal.core.StartLevelManager.decFWSL(StartLevelManager.java:593)
                        at org.eclipse.osgi.framework.internal.core.StartLevelManager.doSetStartLevel(StartLevelManager.java:261)
                        at org.eclipse.osgi.framework.internal.core.StartLevelManager.shutdown(StartLevelManager.java:216)
                        at org.eclipse.osgi.framework.internal.core.InternalSystemBundle.suspend(InternalSystemBundle.java:266)
                        at org.eclipse.osgi.framework.internal.core.Framework.shutdown(Framework.java:685)
                        at org.eclipse.osgi.framework.internal.core.Framework.close(Framework.java:583)
                        at org.eclipse.osgi.framework.internal.core.InternalSystemBundle$1.run(InternalSystemBundle.java:243)
                        at java.lang.Thread.run(Thread.java:619)
                    2010-12-01 10:06:26,803 -0500 | INFO  | JChannel        | JGroups version: 2.10.0.GA
                    2010-12-01 10:06:27,021 -0500 | TRACE | FD_SOCK         | waiting for client connections on /3.96.157.50:1065
                    2010-12-01 10:06:30,037 -0500 | DEBUG | FD_SOCK         | VIEW_CHANGE received: [T00696119-51985]
                    2010-12-01 10:07:11,731 -0500 | DEBUG | FD_SOCK         | VIEW_CHANGE received: [T00696119-51985, linuxvm-44223]
                    2010-12-01 10:07:11,731 -0500 | TRACE | FD_SOCK         | i-have-sock: T00696119-51985 --> 3.96.157.50:1065 (cache is {T00696119-51985=3.96.157.50:1065})
                    2010-12-01 10:07:11,731 -0500 | TRACE | FD_SOCK         | pinger_thread started
                    2010-12-01 10:07:11,731 -0500 | DEBUG | FD_SOCK         | ping_dest is linuxvm-44223, pingable_mbrs=[T00696119-51985, linuxvm-44223]
                    2010-12-01 10:07:11,731 -0500 | TRACE | FD_SOCK         | i-have-sock: linuxvm-44223 --> 3.96.157.49:51502 (cache is {T00696119-51985=3.96.157.50:1065, linuxvm-44223=3.96.157.49:51502})
                    2010-12-01 10:07:11,731 -0500 | TRACE | FD_SOCK         | ping_dest=linuxvm-44223, ping_sock=Socket[addr=/3.96.157.49,port=51502,localport=1135], cache={T00696119-51985=3.96.157.50:1065, linuxvm-44223=3.96.157.49:51502}
                    2010-12-01 10:07:11,747 -0500 | TRACE | FD_SOCK         | i-have-sock: linuxvm-44223 --> 3.96.157.49:51502 (cache is {T00696119-51985=3.96.157.50:1065, linuxvm-44223=3.96.157.49:51502})
                    2010-12-01 10:07:11,762 -0500 | TRACE | FD_SOCK         | accepted connection from /3.96.157.49:33448
                    2010-12-01 10:07:11,762 -0500 | TRACE | FD_SOCK         | waiting for client connections on /3.96.157.50:1065

                    • 7. Re: JGroups exception over bridged network
                      belaban

                      You're probably shutting down both instances concurrently, so the second instance cannot tell the first instance that it is shutting down gracefully.

                      Hard to say, but it looks like this isn't causing any issues, correct ?

                      • 8. Re: JGroups exception over bridged network
                        yelin666

                        Yeah, it was clustering and replicating data ok. However, I got "2010-12-01  09:54:21,655 -0500 | WARN  | FD_SOCK         | I (linuxvm-11495) was suspected  by T00696119-40724; ignoring the SUSPECT message" from time to time, so I am not  sure if there is any problem in my configuration...