-
1. Re: TcpDelegatingCacheLoader
manik Dec 16, 2008 4:45 AM (in response to lovelyliatroim)No reason, really. Happy to accept a patch, to make this configurable using a cache loader property.
-
2. Re: TcpDelegatingCacheLoader
lovelyliatroim Dec 16, 2008 4:49 AM (in response to lovelyliatroim)Reason im asking is this, i didnt have time yesterday to go into detail.
I ran some load tests on my app in a 3 node cluster. One node in particular takes more of the load than the others and this node also hosts the TcpCacheServer. After a while lets call it node3 ran into difficulties with GC. The whole node froze for quite a while. Now the problem is this, after a while the other 2 nodes froze as well which is not what i expected. When node3 recovered from its GC problems the other 2 nodes recovered as well.
Now there is 2 ways the other 2 nodes are connected to node3. One way is that it communicates via UDP, so i dont see this as a problem. The 2nd way is via the TcpDelegatingLoader. Now unfortunately these tests where running on our production or will be production system and i dont have write access and no-one was there from operations to take a dump for me on this so i cant say for sure what exactly happened.
I am using just the _get method of the delegatorprotected Map<Object, Object> _get(Fqn name) throws Exception { synchronized (out) { out.reset(); out.writeByte(TcpCacheOperations.GET); out.writeObject(name); out.flush(); Object retval = in.readObject(); if (retval instanceof Exception) { throw (Exception) retval; } return (Map) retval; } }
However my theory on it is this, node 1 and node2 after a period of time after node3 became totally unresponsive, it couldnt take any more requests. Only after node3 recovered did the other 2 recover so it looked like the complete thread pool of node1 and node2 got exhausted, node1 and node2 were still receiving requests for data that is located on node3, after a while the thread pool gets exhausted because every thread is waiting for a lock on the outputstream, this will never happen because we have one thread who is asking node3 for data but is blocked forever because it doesnt timeout on the read. When node1 and node2 keep receiving requests for data on node3 then eventually every thread will end up waiting for a lock that will never come.
Like I said, its a theory, very hard to say what happened for sure without a dump but my original question stands, why isnt there a timeout on the read??
Thanks guys,
LL -
3. Re: TcpDelegatingCacheLoader
lovelyliatroim Dec 16, 2008 4:56 AM (in response to lovelyliatroim)
No reason, really. Happy to accept a patch, to make this configurable using a cache loader property.
Sorry didnt see this comment when I posted the last one. Where do i submit patches to?? Is there a link on it??
Thanks,
LL -
4. Re: TcpDelegatingCacheLoader
manik Dec 16, 2008 4:56 AM (in response to lovelyliatroim)Yeah, unfortunately without a proper thread dump it is hard to tell where the problem is. Can you not simulate this in a test or staging environment, by loading one server (which runs the TcpCacheServer) artificially (maybe even give it less memory so it has to perform GC more often) and try and lock up the other servers to get a thread dump?
In general though, the TcpCacheServer/TcpDelegatingCacheLoader code is due for an overhaul:
https://jira.jboss.org/jira/browse/JBCACHE-1262
https://jira.jboss.org/jira/browse/JBCACHE-789
https://jira.jboss.org/jira/browse/JBCACHE-1259 -
5. Re: TcpDelegatingCacheLoader
manik Dec 16, 2008 4:58 AM (in response to lovelyliatroim)"lovelyliatroim" wrote:
Sorry didnt see this comment when I posted the last one. Where do i submit patches to?? Is there a link on it??
There isn't one - you'll need to create a JIRA and attach it there. -
6. Re: TcpDelegatingCacheLoader
lovelyliatroim Dec 17, 2008 7:08 AM (in response to lovelyliatroim)
No reason, really. Happy to accept a patch, to make this configurable using a cache loader property.
Created one for the 2.2.1.GA version
https://jira.jboss.org/jira/browse/JBCACHE-1451 -
7. Re: TcpDelegatingCacheLoader
lovelyliatroim Dec 17, 2008 7:25 AM (in response to lovelyliatroim)
Can you not simulate this in a test or staging environment, by loading one server (which runs the TcpCacheServer) artificially (maybe even give it less memory so it has to perform GC more often) and try and lock up the other servers to get a thread dump?
Will try but it wont happen this side of xmas. It will be the new year when i get a chance to look at this again. -
8. Re: TcpDelegatingCacheLoader
lovelyliatroim Jan 5, 2009 7:38 AM (in response to lovelyliatroim)Just a comment on the TcpCacheServer as well.
I dont like thisThread serverThread = new Thread("TcpCacheServer") { @Override public void run() { try { while (running) { Socket client_sock = srv_sock.accept(); Connection conn = new Connection(client_sock, cache); conns.add(conn); conn.start(); } } catch (SocketException se) { if (!running) { // this is because of the stop() lifecycle method being called. // ignore. log.info("Shutting down TcpCacheServer"); } else { log.error("Caught exception! Shutting down server thread.", se); } } catch (IOException e) { log.error("Caught exception! Shutting down server thread.", e); } } };
The reason i dont like it, is that as soon as there is a problem or an exception thrown the TcpCacheServer listening thread shutsdown, so that means no other nodes are no longer able to connect and thus a restart of this node is needed in order to recover.
Any reason why this is this way?? Should be catching inside the loop and trying to recover, not just stop listening completely.
Cheers,
LL -
9. Re: TcpDelegatingCacheLoader
manik Jan 5, 2009 7:40 AM (in response to lovelyliatroim)Yes, this is all a part of the planned overhaul to this code.
I suppose a simplistic retry could make it into the next minor release though. Feel like creating a JIRA? -
10. Re: TcpDelegatingCacheLoader
lovelyliatroim Jan 5, 2009 8:00 AM (in response to lovelyliatroim)
Yes, this is all a part of the planned overhaul to this code.
I suppose a simplistic retry could make it into the next minor release though. Feel like creating a JIRA?
Done.
https://jira.jboss.org/jira/browse/JBCACHE-1457 -
11. Re: TcpDelegatingCacheLoader
lovelyliatroim Jan 12, 2009 4:56 AM (in response to lovelyliatroim)"lovelyliatroim" wrote:
No reason, really. Happy to accept a patch, to make this configurable using a cache loader property.
Created one for the 2.2.1.GA version
https://jira.jboss.org/jira/browse/JBCACHE-1451
Hi Manik,
You might have to revert that patch or take a new one.
Here is the problem.protected transient Object invokeWithRetries(Method m, Object params[]) { long endTime = System.currentTimeMillis() + (long)config.getTimeout(); _L2: return m.invoke(this, params); IllegalAccessException e; e; log.error("Should never get here!", e); continue; /* Loop/switch isn't completed */ e; if(e.getCause() instanceof IOException) { try { if(log.isDebugEnabled()) { log.debug("Caught IOException. Retrying.", e); } Thread.sleep(config.getReconnectWaitTime()); restart(); } catch(IOException e1) { } catch(InterruptedException e1) { } } else { throw new CacheException("Problems invoking method call!", e); } if(System.currentTimeMillis() < endTime) goto _L2; else goto _L1 _L1: throw new CacheException((new StringBuilder()).append("Unable to communicate with TCPCacheServer(").append(config.getHost()).append(":").append(config.getPort()).append(") after ").append(config.getTimeout()).append(" millis, with reconnects every ").append(config.getReconnectWaitTime()).append(" millis.").toString()); }
Problem is this, if we have a timeout,which the patch allows you to do, it throws an IOExcpeption. When we get an IOEXception we will call "restart" which tears down and up the socket even though its not needed. We shouldnt be tearing up and down the socket if we timeout.
2nd problem it is not thread safe. Multiple threads can call restart if we timeout.
So you either revert it or take a 2nd patch. I have already adopted my version so patch is available if needed.
Just so you know about it.
Cheers,
LL -
12. Re: TcpDelegatingCacheLoader
lovelyliatroim Jan 12, 2009 12:07 PM (in response to lovelyliatroim)
Problem is this, if we have a timeout,which the patch allows you to do, it throws an IOExcpeption. When we get an IOEXception we will call "restart" which tears down and up the socket even though its not needed. We shouldnt be tearing up and down the socket if we timeout.
Ignore that I am wrong, tried to reset the input stream on the socket and it aint supported.
You get
java.io.IOException: mark/reset not supported
at java.io.InputStream.reset(InputStream.java:334)
at org.jboss.cache.loader.TcpDelegatingCacheLoader.invokeWithRetries(TcpDelegatingCacheLoader.java:152)
But thread safety is still an issue
Cheers,
LL -
13. Re: TcpDelegatingCacheLoader
lovelyliatroim Jan 14, 2009 9:44 AM (in response to lovelyliatroim)Hi Manik,
Just seen your entry here to resolved
[ https://jira.jboss.org/jira/browse/JBCACHE-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Had a look at your fix here
http://fisheye.jboss.org/browse/JBossCache/core/trunk/src/main/java/org/jboss/cache/loader/TcpDelegatingCacheLoader.java?r=7378
Is that the fixed version??
If so you there is a bug.
Let me explainprotected Map<Object, Object> _get(Fqn name) throws Exception { synchronized (this) { out.reset(); out.writeByte(TcpCacheOperations.GET); out.writeObject(name); out.flush(); Object retval = in.readObject(); if (retval instanceof Exception) { throw (Exception) retval; } return (Map) retval; } }
Lets take this sample here, your locking is better but you make the mistake in throwing the IOException to the calling method. Once you throw it you loose your lock and when catching the IOException you dont synch on the restart, so picture this
T1 gets lock,
T2 waits for lock
T1 Times out or server socket is down and throws exception
T2 now has lock and times out
T2 throws exception
T1 calls restart from invokeWithRetries
T3 acquires lock and writes but doesnt read
T2 calls restart from invokeWithRetries
T3 now attempts to read response
You need to clean up in the _get() while you still have the lock
Something like thisprotected Map<Object, Object> _get(Fqn name) throws Exception { synchronized (this){ try { out.reset(); out.writeByte(TcpCacheOperations.GET); out.writeObject(name); out.flush(); Object retval = in.readObject(); if(log.isDebugEnabled()){ log.debug("Path=" +name +" Return Val"+name); } if (retval instanceof Exception) { throw (Exception) retval; } return (Map) retval; } catch (IOException e) { log.info("IOEXception occurred. Will attempt to re-create connection",e); try{ restart(); }catch(IOException ex){ log.error("Failed to reconnect to remote server, it might be down??"+ex); } } } return null; }
That way only one thread calls the restart and once reset it should be ok for the next thread.
Next thing i see is the start methodpublic void start() throws IOException { try { sock = new Socket(config.getHost(), config.getPort()); sock.setSoTimeout(config.getReadTimeout()); out = new ObjectOutputStream(new BufferedOutputStream(sock.getOutputStream())); out.flush(); in = new ObjectInputStream(new BufferedInputStream(sock.getInputStream())); } catch (ConnectException ce) { log.info("Unable to connect to TCP socket on interface " + config.getHost() + " and port " + config.getPort()); throw ce; } }
I would change this also, I know i wrote it originally but I would change it to something like thispublic void start() throws IOException { log.info("Attempting to start a connection to server"); InetSocketAddress address = new InetSocketAddress(config.getHost(),config.getPort()); sock = new Socket(); sock.setSoTimeout(config.getReadTimeout()); sock.connect(address, config.getReadTimeout()); out = new ObjectOutputStream(new BufferedOutputStream(sock.getOutputStream())); out.flush(); in = new ObjectInputStream(new BufferedInputStream(sock.getInputStream())); }
Reason being I dont like thissock = new Socket(config.getHost(), config.getPort());
This constructor from the API says that
Creates a stream socket and connects it to the specified port number at the specified IP address.
Now at this stage we have no timeout set, so does that mean we dont timeout and can wait to establish a connection for no matter how long it takes??? Other way is safer, we set the timeout before we make the connection.
Just my 2 cents on it, hope it helps. Ignore me if I am looking at the wrong file but would appreciate if you posted the right link for me so I can look at the fix for it.
Regards,
LL -
14. Re: TcpDelegatingCacheLoader
manik Jan 14, 2009 9:58 AM (in response to lovelyliatroim)Hi there
Nope, you were looking at the wrong check-in. That doesn't contain the fix. Have a look at this commit log.
Basically, I have delegated the restart process to a separate class, which ensures that concurrent restarts do not happen, nor do "back-to-back" restarts. It involves using a CAS to detect concurrent restarts, and then queueing waiters so that they wait for the concurrent restart before returning and being able to use a successfully restarted resource.
Re: your changes to start(), I have not included these yet. These do make sense though, so I will make those changes accordingly.
Cheers
Manik