I have increased timeouts, and it has helped. However, I still experience the same problem, but less frequently.
I suggest to separate connection liveness check into another thread and increase timeouts further. Will it work? I can write a patch, but I want to be sure this is sensible and will not damage something inside Remoting.
Help from Remote guru will be very appreciated.
Sorry for the delay in responding. Have you solved your problem, or is it still outstanding?
No I did not resolve the problem completely. However, I am more aware about cause of the problem now.
As I said there are two kind of servers, coordinators and slaves. The slaves are run with very high concurrency factor. i.e. load average is around 6-7. So sometimes they are unable to answer ping requests quickly enough.
To make things work I patched JBoss Remoting source to increase connection checker timeout and tries number. This decreased failures rate downto acceptable level. Now we have failure event one or two times per hour.
However, it would be good to improve JBoss Remoting by providing customization for the server failure detection parameters. Moreover, I think ConnectionValidator should make pauses between validity checks. Now it fires all validations in a row, so all of them may be ignored by overloaded slave server.
Also it would be good idea to optimize jndi detector somehow. Now, all servers check liveness of others. With a big number of nodes, if only one server failed to connect by whatever reasons, global record in JNDI will be updated. But other servers may continue to see it. While number of cluster nodes will grow, the problem will appear more frequently.