6 Replies Latest reply on Mar 9, 2014 8:51 PM by vbchin2

TransportException when Hot Rod connector is bound to a XSITE cache-container

vbchin2 Mar 6, 2014 9:01 AM

ISSUE SCENARIO:

If hotrod-connector is bound to a cache-container meant for cross datacenter replication I get TransportExceptions on the (hot rod) client side when one of the nodes go down
If hotrod-connector is left bound to clustered cache-container, the detection of a failed node is gracefully handled (transport invalidated) but no cache operations can be performed and the desired cache (labCache) is said to be not found

05:29:26,559 WARN  [org.infinispan.client.hotrod.impl.transport.tcp.TcpTransportFactory] (Thread-82) ISPN004022: Unable to invalidate transport for server: /127.0.0.1:11922
05:29:26,563 ERROR [org.infinispan.client.hotrod.impl.transport.tcp.TcpTransportFactory] (Thread-80) ISPN004017: Could not fetch transport: org.infinispan.client.hotrod.exceptions.TransportException:: Could not connect to server: /127.0.0.1:11922
    at org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport.<init>(TcpTransport.java:74) [infinispan-client-hotrod-6.0.1.Final-redhat-2.jar:6.0.1.Final-redhat-2]
    at org.infinispan.client.hotrod.impl.transport.tcp.TransportObjectFactory.makeObject(TransportObjectFactory.java:35) [infinispan-client-hotrod-6.0.1.Final-redhat-2.jar:6.0.1.Final-redhat-2]
    at org.infinispan.client.hotrod.impl.transport.tcp.TransportObjectFactory.makeObject(TransportObjectFactory.java:16) [infinispan-client-hotrod-6.0.1.Final-redhat-2.jar:6.0.1.Final-redhat-2]
    at org.apache.commons.pool.impl.GenericKeyedObjectPool.borrowObject(GenericKeyedObjectPool.java:1220) [commons-pool-1.6-redhat-4.jar:1.6-redhat-4]
    at org.infinispan.client.hotrod.impl.transport.tcp.TcpTransportFactory.borrowTransportFromPool(TcpTransportFactory.java:287) [infinispan-client-hotrod-6.0.1.Final-redhat-2.jar:6.0.1.Final-redhat-2]
    at org.infinispan.client.hotrod.impl.transport.tcp.TcpTransportFactory.getTransport(TcpTransportFactory.java:165) [infinispan-client-hotrod-6.0.1.Final-redhat-2.jar:6.0.1.Final-redhat-2]
    at org.infinispan.client.hotrod.impl.operations.StatsOperation.getTransport(StatsOperation.java:30) [infinispan-client-hotrod-6.0.1.Final-redhat-2.jar:6.0.1.Final-redhat-2]
    at org.infinispan.client.hotrod.impl.operations.RetryOnFailureOperation.execute(RetryOnFailureOperation.java:45) [infinispan-client-hotrod-6.0.1.Final-redhat-2.jar:6.0.1.Final-redhat-2]
    at org.infinispan.client.hotrod.impl.RemoteCacheImpl.stats(RemoteCacheImpl.java:194) [infinispan-client-hotrod-6.0.1.Final-redhat-2.jar:6.0.1.Final-redhat-2]

CONFIG for ISSUE SCENARIO #1

<subsystem xmlns="urn:infinispan:server:endpoint:6.0">
    <hotrod-connector socket-binding="hotrod" cache-container="xsite">
        <topology-state-transfer lazy-retrieval="false" lock-timeout="1000" replication-timeout="5000"/>
    </hotrod-connector>
    ...
</subsystem>
<subsystem xmlns="urn:infinispan:server:core:6.0" default-cache-container="clustered">
    ....
    <cache-container name="xsite" default-cache="default" statistics="true">
        <transport executor="infinispan-transport" lock-timeout="60000" cluster="site-1" stack="relay"/>
        <distributed-cache name="labCache" mode="SYNC" owners="2" remote-timeout="30000" start="EAGER">
            <locking isolation="READ_COMMITTED" acquire-timeout="30000" concurrency-level="1000" striping="false"/>
            <transaction mode="NONE"/>
            <backups>
                <backup site="site-2" failure-policy="WARN" enabled="true" strategy="SYNC" timeout="10000"/>
            </backups>
        </distributed-cache>
    </cache-container>
    ...
</subsystem>

CONFIG for ISSUE SCENARIO #2: Same as above but with the following change

    <hotrod-connector socket-binding="hotrod" cache-container="clustered">

Please let me know if there is a quick fix for this or whether I need to provide detailed TRACE logs on org.infinispan and possibly org.jgroups

1. Re: TransportException when Hot Rod connector is bound to a XSITE cache-container

vbchin2 Mar 6, 2014 10:10 AM (in response to vbchin2)
Adding logs from server with TRACE enabled on the org.infinispan

Adding some additional context here:
The hot rod client is being used from a webapp on EAP 6 connecting to a grid of nodes
The exception is thrown particularly when this call is made : cacheManager.getCache().stats();

server2.log.zip 57.8 KB
Actions
2. Re: TransportException when Hot Rod connector is bound to a XSITE cache-container

rvansa Mar 7, 2014 2:56 AM (in response to vbchin2)

The second configuration makes no sense, as you target the HotRod requests to non-configured container.

Regarding the error logs polluting application log, these are expected - something failed, so an error has been logged. Still, I agree with you that it may be confusing for the user. I've created [1] to reduce the verbosity.
The WARN log is called because we invalidate transport which was not created yet. That's certainly a bug (although without any functional consequences), JIRA is [2].

[1] [ISPN-4082] TcpTransportFactory shouldn't log error when it cannot fetch transport - JBoss Issue Tracker
[2] [ISPN-4083] Do not invalidate null Transport - JBoss Issue Tracker
Actions

3. Re: Re: TransportException when Hot Rod connector is bound to a XSITE cache-container

vbchin2 Mar 7, 2014 6:06 AM (in response to rvansa)

Its my bad! I was trying to be concise with the amount of XML I pasted. The clustered cache-container was supposed to be hidden in the "..." sections above. For the sake of clarity below is the full configuration of Infinipspan subsystems for ISSUE SCENARIO #2.

I respectfully disagree that the failure to invalidate the transport should be considered harmless when it eventually is affecting the list of servers the RemoteCacheManager maintains internally via the associated TcpTransportFactory. The same node failure is handled gracefully when the hotrod-connector is connected to the clustered cache-container (SCENARIO #2).

So lets assume we started the hotrod client connected to 3 nodes and one of the nodes go down, the servers count in the TcpTrasportFactory associated to RemoteCacheManager is:

In case of SCENARIO #1 : 3
In case of SCENARIO #2 : 2

I will try to pinpoint exactly where the issue could be happening if I can.

<subsystem xmlns="urn:infinispan:server:endpoint:6.0">
    <hotrod-connector socket-binding="hotrod" cache-container="clustered">
        <topology-state-transfer lazy-retrieval="false" lock-timeout="1000" replication-timeout="5000"/>
    </hotrod-connector>
    <memcached-connector socket-binding="memcached" cache-container="clustered"/>
    <rest-connector virtual-server="default-host" cache-container="xsite"/>
</subsystem>
<subsystem xmlns="urn:infinispan:server:core:6.0" default-cache-container="clustered">
    <cache-container name="clustered" default-cache="default" statistics="true">
        <transport executor="infinispan-transport" lock-timeout="60000"/>
        <distributed-cache name="default" mode="SYNC" segments="20" owners="2" remote-timeout="30000" start="EAGER">
            <locking isolation="READ_COMMITTED" acquire-timeout="30000" concurrency-level="1000" striping="false"/>
            <transaction mode="NONE"/>
        </distributed-cache>
        <distributed-cache name="memcachedCache" mode="SYNC" segments="20" owners="2" remote-timeout="30000" start="EAGER">
            <locking isolation="READ_COMMITTED" acquire-timeout="30000" concurrency-level="1000" striping="false"/>
            <transaction mode="NONE"/>
        </distributed-cache>
        <distributed-cache name="namedCache" mode="SYNC" start="EAGER"/>
    </cache-container>
    <cache-container name="xsite" default-cache="default" statistics="true">
        <transport executor="infinispan-transport" lock-timeout="60000" cluster="site-1" stack="relay"/>
        <distributed-cache name="labCache" mode="SYNC" owners="2" remote-timeout="30000" start="EAGER">
            <locking isolation="READ_COMMITTED" acquire-timeout="30000" concurrency-level="1000" striping="false"/>
            <transaction mode="NONE"/>
            <backups>
                <backup site="site-2" failure-policy="WARN" enabled="true" strategy="SYNC" timeout="10000"/>
            </backups>
        </distributed-cache>
    </cache-container>
    <cache-container name="security"/>
</subsystem>

4. Re: TransportException when Hot Rod connector is bound to a XSITE cache-container

rvansa Mar 7, 2014 8:07 AM (in response to vbchin2)

It seems that you may have a problem with the server list being updated. I have went through the trace logs you've posted and the client really does not receive the new topology.
In HotRod, server list is not updated when the client cannot connect to the server but when the servers cannot connect to this server. Then a new topology is installed and the clients are notified about that. You could see that from server logs as a INFO level message.
So, it seems that the problem is not in HotRod but in the cluster. What configuration of JGroups do you use, the default one?

I think that we need trace logs from the server as well to see what's happening there, and why the other nodes do not detect that the other server has crashed. By the way, do you just kill the server (kill -9), or do you use something different to simulate the crash?
Actions
5. Re: Re: TransportException when Hot Rod connector is bound to a XSITE cache-container

vbchin2 Mar 7, 2014 12:00 PM (in response to rvansa)
I can work on getting those logs for you. The JGroups is using UDP multicast for local cluster discovery and uses TCP relay configured using MPING as discovery protocol for cross site. I am attaching both the site XML files for reference (they have the hotrod connector connected to xsite cache-container). I took both the approaches, graceful and forceful, but the outcome was the same.

Though you would eventually see this in logs, but the cluster discovers the failure of the node pretty quickly.

site-1.xml 14.9 KB

site-2.xml 15.0 KB
Actions

6. Re: Re: TransportException when Hot Rod connector is bound to a XSITE cache-container

vbchin2 Mar 9, 2014 8:51 PM (in response to rvansa)

QUICK UPDATE:

I was able to get the Cross Site Datacenter Replication functionality AND cluster topology aware Hot Rod client working by eliminating the need for a separate cache container altogether. Below is the XML snippet of the configuration file including the Infinispan Server Core and Server Endpoint subsystems that finally worked.

Though I have my path clear, I believe we should continue to look into, when time permits, why a separate cache container configured with a separate transport cannot coexist together with the default clustered cache-container while also providing correct topology information to the Hot Rod client.

<subsystem xmlns="urn:infinispan:server:endpoint:6.0">
    <hotrod-connector socket-binding="hotrod" cache-container="clustered">
        <topology-state-transfer lazy-retrieval="false" lock-timeout="1000" replication-timeout="5000"/>
    </hotrod-connector>
    <memcached-connector socket-binding="memcached" cache-container="clustered"/>
    <rest-connector virtual-server="default-host" cache-container="clustered"/>
</subsystem>
<subsystem xmlns="urn:infinispan:server:core:6.0" default-cache-container="clustered">
    <cache-container name="clustered" default-cache="default" statistics="true">
        <transport executor="infinispan-transport" lock-timeout="60000" cluster="site-1" stack="relay"/>
        <distributed-cache name="default" mode="SYNC" segments="20" owners="2" remote-timeout="30000" start="EAGER">
            <locking isolation="READ_COMMITTED" acquire-timeout="30000" concurrency-level="1000" striping="false"/>
            <transaction mode="NONE"/>
        </distributed-cache>
        <distributed-cache name="memcachedCache" mode="SYNC" segments="20" owners="2" remote-timeout="30000" start="EAGER">
            <locking isolation="READ_COMMITTED" acquire-timeout="30000" concurrency-level="1000" striping="false"/>
            <transaction mode="NONE"/>
        </distributed-cache>
        <distributed-cache name="namedCache" mode="SYNC" start="EAGER"/>
        <distributed-cache name="labCache" mode="SYNC" owners="2" remote-timeout="30000" start="EAGER">
            <locking isolation="READ_COMMITTED" acquire-timeout="30000" concurrency-level="1000" striping="false"/>
            <transaction mode="NONE"/>
            <backups>
                <backup site="site-2" failure-policy="WARN" enabled="true" strategy="SYNC" timeout="10000"/>
            </backups>
        </distributed-cache>
    </cache-container>
    <cache-container name="security"/>
</subsystem>

Go to original post