1 2 Previous Next 22 Replies Latest reply on Jan 15, 2009 10:45 AM by lovelyliatroim

    TcpDelegatingCacheLoader

    lovelyliatroim

      Hi Guys,
      Just wondering why the SO_TIMEOUT isnt set on the socket for TcpDelegato or isnt even configurable?? Is there a reason for this?

      I originally thought this was what the timeout parameter is/was for but it looks like it is used for something else.

      Thanks,
      LL

        • 1. Re: TcpDelegatingCacheLoader
          manik

          No reason, really. Happy to accept a patch, to make this configurable using a cache loader property.

          • 2. Re: TcpDelegatingCacheLoader
            lovelyliatroim

            Reason im asking is this, i didnt have time yesterday to go into detail.


            I ran some load tests on my app in a 3 node cluster. One node in particular takes more of the load than the others and this node also hosts the TcpCacheServer. After a while lets call it node3 ran into difficulties with GC. The whole node froze for quite a while. Now the problem is this, after a while the other 2 nodes froze as well which is not what i expected. When node3 recovered from its GC problems the other 2 nodes recovered as well.

            Now there is 2 ways the other 2 nodes are connected to node3. One way is that it communicates via UDP, so i dont see this as a problem. The 2nd way is via the TcpDelegatingLoader. Now unfortunately these tests where running on our production or will be production system and i dont have write access and no-one was there from operations to take a dump for me on this so i cant say for sure what exactly happened.

            I am using just the _get method of the delegator

             protected Map<Object, Object> _get(Fqn name) throws Exception
             {
             synchronized (out)
             {
             out.reset();
            
             out.writeByte(TcpCacheOperations.GET);
             out.writeObject(name);
             out.flush();
             Object retval = in.readObject();
             if (retval instanceof Exception)
             {
             throw (Exception) retval;
             }
             return (Map) retval;
             }
             }
            


            However my theory on it is this, node 1 and node2 after a period of time after node3 became totally unresponsive, it couldnt take any more requests. Only after node3 recovered did the other 2 recover so it looked like the complete thread pool of node1 and node2 got exhausted, node1 and node2 were still receiving requests for data that is located on node3, after a while the thread pool gets exhausted because every thread is waiting for a lock on the outputstream, this will never happen because we have one thread who is asking node3 for data but is blocked forever because it doesnt timeout on the read. When node1 and node2 keep receiving requests for data on node3 then eventually every thread will end up waiting for a lock that will never come.

            Like I said, its a theory, very hard to say what happened for sure without a dump but my original question stands, why isnt there a timeout on the read??

            Thanks guys,
            LL


            • 3. Re: TcpDelegatingCacheLoader
              lovelyliatroim

               


              No reason, really. Happy to accept a patch, to make this configurable using a cache loader property.


              Sorry didnt see this comment when I posted the last one. Where do i submit patches to?? Is there a link on it??

              Thanks,
              LL

              • 4. Re: TcpDelegatingCacheLoader
                manik

                Yeah, unfortunately without a proper thread dump it is hard to tell where the problem is. Can you not simulate this in a test or staging environment, by loading one server (which runs the TcpCacheServer) artificially (maybe even give it less memory so it has to perform GC more often) and try and lock up the other servers to get a thread dump?

                In general though, the TcpCacheServer/TcpDelegatingCacheLoader code is due for an overhaul:

                https://jira.jboss.org/jira/browse/JBCACHE-1262
                https://jira.jboss.org/jira/browse/JBCACHE-789
                https://jira.jboss.org/jira/browse/JBCACHE-1259

                • 5. Re: TcpDelegatingCacheLoader
                  manik

                   

                  "lovelyliatroim" wrote:


                  Sorry didnt see this comment when I posted the last one. Where do i submit patches to?? Is there a link on it??



                  There isn't one - you'll need to create a JIRA and attach it there.

                  • 6. Re: TcpDelegatingCacheLoader
                    lovelyliatroim

                     


                    No reason, really. Happy to accept a patch, to make this configurable using a cache loader property.


                    Created one for the 2.2.1.GA version
                    https://jira.jboss.org/jira/browse/JBCACHE-1451





                    • 7. Re: TcpDelegatingCacheLoader
                      lovelyliatroim

                       


                      Can you not simulate this in a test or staging environment, by loading one server (which runs the TcpCacheServer) artificially (maybe even give it less memory so it has to perform GC more often) and try and lock up the other servers to get a thread dump?

                      Will try but it wont happen this side of xmas. It will be the new year when i get a chance to look at this again.

                      • 8. Re: TcpDelegatingCacheLoader
                        lovelyliatroim

                        Just a comment on the TcpCacheServer as well.

                        I dont like this

                        Thread serverThread = new Thread("TcpCacheServer")
                         {
                         @Override
                         public void run()
                         {
                         try
                         {
                         while (running)
                         {
                         Socket client_sock = srv_sock.accept();
                         Connection conn = new Connection(client_sock, cache);
                         conns.add(conn);
                         conn.start();
                         }
                         }
                         catch (SocketException se)
                         {
                         if (!running)
                         {
                         // this is because of the stop() lifecycle method being called.
                         // ignore.
                         log.info("Shutting down TcpCacheServer");
                         }
                         else
                         {
                         log.error("Caught exception! Shutting down server thread.", se);
                         }
                         }
                         catch (IOException e)
                         {
                         log.error("Caught exception! Shutting down server thread.", e);
                         }
                         }
                         };
                        


                        The reason i dont like it, is that as soon as there is a problem or an exception thrown the TcpCacheServer listening thread shutsdown, so that means no other nodes are no longer able to connect and thus a restart of this node is needed in order to recover.

                        Any reason why this is this way?? Should be catching inside the loop and trying to recover, not just stop listening completely.

                        Cheers,
                        LL


                        • 9. Re: TcpDelegatingCacheLoader
                          manik

                          Yes, this is all a part of the planned overhaul to this code.

                          I suppose a simplistic retry could make it into the next minor release though. Feel like creating a JIRA?

                          • 10. Re: TcpDelegatingCacheLoader
                            lovelyliatroim

                             


                            Yes, this is all a part of the planned overhaul to this code.

                            I suppose a simplistic retry could make it into the next minor release though. Feel like creating a JIRA?


                            Done.

                            https://jira.jboss.org/jira/browse/JBCACHE-1457



                            • 11. Re: TcpDelegatingCacheLoader
                              lovelyliatroim

                               

                              "lovelyliatroim" wrote:

                              No reason, really. Happy to accept a patch, to make this configurable using a cache loader property.


                              Created one for the 2.2.1.GA version
                              https://jira.jboss.org/jira/browse/JBCACHE-1451


                              Hi Manik,
                              You might have to revert that patch or take a new one.

                              Here is the problem.


                               protected transient Object invokeWithRetries(Method m, Object params[])
                               {
                               long endTime = System.currentTimeMillis() + (long)config.getTimeout();
                              _L2:
                               return m.invoke(this, params);
                               IllegalAccessException e;
                               e;
                               log.error("Should never get here!", e);
                               continue; /* Loop/switch isn't completed */
                               e;
                               if(e.getCause() instanceof IOException)
                               {
                               try
                               {
                               if(log.isDebugEnabled())
                               {
                               log.debug("Caught IOException. Retrying.", e);
                               }
                               Thread.sleep(config.getReconnectWaitTime());
                               restart();
                               }
                               catch(IOException e1) { }
                               catch(InterruptedException e1) { }
                               } else
                               {
                               throw new CacheException("Problems invoking method call!", e);
                               }
                               if(System.currentTimeMillis() < endTime) goto _L2; else goto _L1
                              _L1:
                               throw new CacheException((new StringBuilder()).append("Unable to communicate with TCPCacheServer(").append(config.getHost()).append(":").append(config.getPort()).append(") after ").append(config.getTimeout()).append(" millis, with reconnects every ").append(config.getReconnectWaitTime()).append(" millis.").toString());
                               }
                              


                              Problem is this, if we have a timeout,which the patch allows you to do, it throws an IOExcpeption. When we get an IOEXception we will call "restart" which tears down and up the socket even though its not needed. We shouldnt be tearing up and down the socket if we timeout.

                              2nd problem it is not thread safe. Multiple threads can call restart if we timeout.


                              So you either revert it or take a 2nd patch. I have already adopted my version so patch is available if needed.

                              Just so you know about it.

                              Cheers,
                              LL


                              • 12. Re: TcpDelegatingCacheLoader
                                lovelyliatroim

                                 


                                Problem is this, if we have a timeout,which the patch allows you to do, it throws an IOExcpeption. When we get an IOEXception we will call "restart" which tears down and up the socket even though its not needed. We shouldnt be tearing up and down the socket if we timeout.


                                Ignore that I am wrong, tried to reset the input stream on the socket and it aint supported.

                                You get

                                java.io.IOException: mark/reset not supported
                                at java.io.InputStream.reset(InputStream.java:334)
                                at org.jboss.cache.loader.TcpDelegatingCacheLoader.invokeWithRetries(TcpDelegatingCacheLoader.java:152)


                                But thread safety is still an issue

                                Cheers,
                                LL

                                • 13. Re: TcpDelegatingCacheLoader
                                  lovelyliatroim

                                  Hi Manik,

                                  Just seen your entry here to resolved


                                  [ https://jira.jboss.org/jira/browse/JBCACHE-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

                                  Had a look at your fix here

                                  http://fisheye.jboss.org/browse/JBossCache/core/trunk/src/main/java/org/jboss/cache/loader/TcpDelegatingCacheLoader.java?r=7378

                                  Is that the fixed version??

                                  If so you there is a bug.

                                  Let me explain


                                  protected Map<Object, Object> _get(Fqn name) throws Exception
                                   {
                                   synchronized (this)
                                   {
                                   out.reset();
                                  
                                   out.writeByte(TcpCacheOperations.GET);
                                   out.writeObject(name);
                                   out.flush();
                                   Object retval = in.readObject();
                                   if (retval instanceof Exception)
                                   {
                                   throw (Exception) retval;
                                   }
                                   return (Map) retval;
                                   }
                                   }
                                  
                                  


                                  Lets take this sample here, your locking is better but you make the mistake in throwing the IOException to the calling method. Once you throw it you loose your lock and when catching the IOException you dont synch on the restart, so picture this

                                  T1 gets lock,
                                  T2 waits for lock
                                  T1 Times out or server socket is down and throws exception
                                  T2 now has lock and times out
                                  T2 throws exception
                                  T1 calls restart from invokeWithRetries
                                  T3 acquires lock and writes but doesnt read
                                  T2 calls restart from invokeWithRetries
                                  T3 now attempts to read response

                                  You need to clean up in the _get() while you still have the lock

                                  Something like this



                                  protected Map<Object, Object> _get(Fqn name) throws Exception
                                   {
                                   synchronized (this){
                                   try {
                                  
                                   out.reset();
                                  
                                   out.writeByte(TcpCacheOperations.GET);
                                  
                                   out.writeObject(name);
                                   out.flush();
                                   Object retval = in.readObject();
                                   if(log.isDebugEnabled()){
                                   log.debug("Path=" +name +" Return Val"+name);
                                   }
                                  
                                   if (retval instanceof Exception)
                                   {
                                   throw (Exception) retval;
                                   }
                                   return (Map) retval;
                                   } catch (IOException e) {
                                   log.info("IOEXception occurred. Will attempt to re-create connection",e);
                                   try{
                                   restart();
                                   }catch(IOException ex){
                                   log.error("Failed to reconnect to remote server, it might be down??"+ex);
                                   }
                                   }
                                   }
                                  
                                   return null;
                                   }
                                  



                                  That way only one thread calls the restart and once reset it should be ok for the next thread.


                                  Next thing i see is the start method


                                  public void start() throws IOException
                                   {
                                   try
                                   {
                                   sock = new Socket(config.getHost(), config.getPort());
                                   sock.setSoTimeout(config.getReadTimeout());
                                   out = new ObjectOutputStream(new BufferedOutputStream(sock.getOutputStream()));
                                   out.flush();
                                   in = new ObjectInputStream(new BufferedInputStream(sock.getInputStream()));
                                   }
                                   catch (ConnectException ce)
                                   {
                                   log.info("Unable to connect to TCP socket on interface " + config.getHost() + " and port " + config.getPort());
                                   throw ce;
                                   }
                                   }
                                  



                                  I would change this also, I know i wrote it originally but I would change it to something like this


                                   public void start() throws IOException
                                   {
                                   log.info("Attempting to start a connection to server");
                                  
                                   InetSocketAddress address = new InetSocketAddress(config.getHost(),config.getPort());
                                   sock = new Socket();
                                   sock.setSoTimeout(config.getReadTimeout());
                                   sock.connect(address, config.getReadTimeout());
                                   out = new ObjectOutputStream(new BufferedOutputStream(sock.getOutputStream()));
                                   out.flush();
                                   in = new ObjectInputStream(new BufferedInputStream(sock.getInputStream()));
                                   }
                                  
                                  


                                  Reason being I dont like this


                                  sock = new Socket(config.getHost(), config.getPort());
                                  

                                  This constructor from the API says that



                                  Creates a stream socket and connects it to the specified port number at the specified IP address.


                                  Now at this stage we have no timeout set, so does that mean we dont timeout and can wait to establish a connection for no matter how long it takes??? Other way is safer, we set the timeout before we make the connection.


                                  Just my 2 cents on it, hope it helps. Ignore me if I am looking at the wrong file but would appreciate if you posted the right link for me so I can look at the fix for it.

                                  Regards,
                                  LL


                                  • 14. Re: TcpDelegatingCacheLoader
                                    manik

                                    Hi there

                                    Nope, you were looking at the wrong check-in. That doesn't contain the fix. Have a look at this commit log.

                                    Basically, I have delegated the restart process to a separate class, which ensures that concurrent restarts do not happen, nor do "back-to-back" restarts. It involves using a CAS to detect concurrent restarts, and then queueing waiters so that they wait for the concurrent restart before returning and being able to use a successfully restarted resource.

                                    Re: your changes to start(), I have not included these yet. These do make sense though, so I will make those changes accordingly.

                                    Cheers
                                    Manik

                                    1 2 Previous Next