6 Replies Latest reply on Aug 13, 2013 5:57 AM by gouldsj

    Seg Fault in libtcnative used by jbossweb-7.0.16  with large number of concurrent users in sip container performance test.

    gouldsj

      We are using Jboss (7.1.2) with mobicents sip container.

       

      We ran at 1 Call per second for 9h 12m.

       

      This was using 500 users, each using a single Web Sockets TCP connection each to the Gateway. The users also made one UDP connection each per call. The UDP connection is disconnected at the end of the call.

       

      With 1 CPS as a call rate, we had to use 6 concurrent calls. So, 1 CPS (with 2 users per call) means that we had at any one time:

       

      500 WS connections. (1 per user)

      12 UDP connections  (2 per call)

       

      CPU usage on the box was an average of 10%

       

      After 9h 12m of running we got a seg fault in the libtcnative library (full hs_err log attached ):

       

      Stack: [0x00007f5869adb000,0x00007f5869bdc000],  sp=0x00007f5869bd97f0,  free space=1017k

      Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)

      C  [libtcnative-1.so+0x18382]  Java_org_apache_tomcat_jni_Socket_sendibb+0x22

      j  org.apache.coyote.http11.InternalAprOutputBuffer.flushBuffer()V+186

      j  org.apache.coyote.http11.InternalAprOutputBuffer.flush()V+19

      j  org.apache.coyote.http11.Http11AprProcessor.action(Lorg/apache/coyote/ActionCode;Ljava/lang/Object;)V+104

      j  org.apache.coyote.Response.action(Lorg/apache/coyote/ActionCode;Ljava/lang/Object;)V+31

      j  org.apache.catalina.connector.OutputBuffer.doFlush(Z)V+94

       

       

      Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)

      j  org.apache.tomcat.jni.Socket.sendibb(JII)I+0

      j  org.apache.coyote.http11.InternalAprOutputBuffer.flushBuffer()V+186

      j  org.apache.coyote.http11.InternalAprOutputBuffer.flush()V+19

      j  org.apache.coyote.http11.Http11AprProcessor.action(Lorg/apache/coyote/ActionCode;Ljava/lang/Object;)V+104

      j  org.apache.coyote.Response.action(Lorg/apache/coyote/ActionCode;Ljava/lang/Object;)V+31

      j  org.apache.catalina.connector.OutputBuffer.doFlush(Z)V+94

      J  org.jboss.websockets.oio.internal.protocol.ietf13.Hybi13Socket._writeTextFrame(Ljava/lang/String;)V

      j  org.jboss.websockets.oio.internal.protocol.ietf13.Hybi13Socket.writeFrame(Lorg/jboss/websockets/Frame;)V+56

      j  org.jboss.as.websockets.servlet.WebSocketDelegate.writeFrame(Lorg/jboss/websockets/Frame;)V+5

      j  com.alicecallsbob.porth.gateway.FCSDKGW_c.FCSDKGW_c()V+230

      j  com.alicecallsbob.porth.gateway.FCSDKGW_c.FCSDKGW_a(Lcom/alicecallsbob/porth/gateway/swift/SwiftMessage;)V+201

      j  com.alicecallsbob.porth.gateway.FCSDKGW_c.FCSDKGW_a(Ljavax/servlet/sip/SipApplicationSession;Ljavax/servlet/sip/SipSession;Ljavax/servlet/sip/SipFactory;Ljava/lang/String;Lcom/alicecallsbob/sdk/acb/event_manager/EventManagerProvider;)Lcom/alicecallsbob/porth/gateway/call/FCSDKGW_c;+67

      j  com.alicecallsbob.porth.gateway.sip.GatewaySipServlet.newInboundConnection(Lcom/alicecallsbob/sdk/acb/connection_manager/ConnectionBuilder;Ljavax/servlet/sip/SipServletRequest;)V+272

      j  com.alicecallsbob.sdk.acb.connection_manager.ConnectionManagerSipServlet.doRequest(Ljavax/servlet/sip/SipServletRequest;)V+67

      j  javax.servlet.sip.SipServlet.service(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;)V+9

      j  org.mobicents.servlet.sip.core.dispatchers.MessageDispatcher.callServlet(Lorg/mobicents/servlet/sip/core/message/MobicentsSipServletRequest;)V+306

      j  org.mobicents.servlet.sip.core.dispatchers.InitialRequestDispatcher$InitialDispatchTask.dispatch()V+351

      j  org.mobicents.servlet.sip.core.dispatchers.DispatchTask.dispatchAndHandleExceptions()V+1

      j  org.mobicents.servlet.sip.core.dispatchers.InitialRequestDispatcher.dispatchInsideContainer(Ljavax/sip/SipProvider;Ljavax/servlet/sip/ar/SipApplicationRouterInfo;Lorg/mobicents/servlet/sip/core/session/MobicentsSipApplicationSession;Lorg/mobicents/servlet/sip/message/SipServletRequestImpl;)V+238

      j  org.mobicents.servlet.sip.core.dispatchers.InitialRequestDispatcher.dispatchRequestToApplication(Ljavax/sip/SipProvider;Lorg/mobicents/servlet/sip/message/SipServletRequestImpl;Ljavax/servlet/sip/ar/SipApplicationRouterInfo;Lorg/mobicents/servlet/sip/core/session/MobicentsSipApplicationSession;)V+182

      j  org.mobicents.servlet.sip.core.dispatchers.InitialRequestDispatcher.invokeAppRouterAndDispatchRequest(Ljavax/sip/SipProvider;Lorg/mobicents/servlet/sip/message/SipServletRequestImpl;)V+456

      j  org.mobicents.servlet.sip.core.dispatchers.InitialRequestDispatcher.dispatchMessage(Ljavax/sip/SipProvider;Lorg/mobicents/servlet/sip/message/SipServletMessageImpl;)V+65

      j  org.mobicents.servlet.sip.core.SipApplicationDispatcherImpl.processRequest(Ljavax/sip/RequestEvent;)V+1006

      j  gov.nist.javax.sip.EventScanner.deliverEvent(Lgov/nist/javax/sip/EventWrapper;)V+472

      j  gov.nist.javax.sip.SipProviderImpl.handleEvent(Ljava/util/EventObject;Lgov/nist/javax/sip/stack/SIPTransaction;)V+230

      j  gov.nist.javax.sip.DialogFilter.processRequest(Lgov/nist/javax/sip/message/SIPRequest;Lgov/nist/javax/sip/stack/MessageChannel;)V+4049

      j  gov.nist.javax.sip.stack.SIPServerTransaction.processRequest(Lgov/nist/javax/sip/message/SIPRequest;Lgov/nist/javax/sip/stack/MessageChannel;)V+460

      j  gov.nist.javax.sip.stack.UDPMessageChannel.processMessage(Lgov/nist/javax/sip/message/SIPMessage;)V+207

      j  gov.nist.javax.sip.stack.UDPMessageChannel.processIncomingDataPacket(Ljava/net/DatagramPacket;)V+1196

      j  gov.nist.javax.sip.stack.UDPMessageChannel.run()V+150

      j  java.lang.Thread.run()V+11

      v  ~StubRoutines::call_stub

       

       

      We tried patching libtcnative with a check on the s->net, however this was insufficient and only delayed the point wehere the issue arose.

       

      Our investigation suggests that jbossweb is using a socket handle (a bare pointer from libtcnative) which has been freed by another thread due to an error.

      We added logging into a patched version of libtcnative to detect this condition and used that to switch on additional logging in jbossweb.

       

      this suggested that an error occuring in AprEndpoint.java Worker:run() method:

       

      .... 

                    if ((status != null) && (handler.event(socket, status) == Handler.SocketState.CLOSED)) {

                              // Close socket and pool only if it wasn't closed

                              // already by the parent pool

                              if (serverSockPool != 0) {

                                  Socket.destroy(socket);

                              }

                          }

      ....

       

      I am not that familiar with the jbossweb code and so far I haven't been able to track down why the flush is occuring after this condition.

       

      Anyone have any ideas?

       

      We have considered adding a further level of abstraction to either the lintcnative library or the jbossweb Socket class using a key into a map rather than the bare pointer as the handle. However this could have a performance impact and it would be fixing the symptom rather than the (as yet undetermined) cause. The other issue is that we would want to submit any such update to libtcnative as a fix and we are not sure of the process and turn around on these (even if anyone else thought it was a good idea, which I realise may not be the case).

       

      Any input or insight into this issue would be much appreciated.

        • 1. Re: Seg Fault in libtcnative used by jbossweb-7.0.16  with large number of concurrent users in sip container performance test.
          jfclere

          Did you try something like what is in https://issues.apache.org/bugzilla/show_bug.cgi?id=51813 ?

          1 of 1 people found this helpful
          • 2. Re: Seg Fault in libtcnative used by jbossweb-7.0.16  with large number of concurrent users in sip container performance test.
            gouldsj

            Hi Jean-Frederic,

             

            Thanks very much for the response, yes we tried that patch in libtcnative. It definately improved things, However we still got a seg fault in the method, it was just a different one and took much longer to occur.

             

            Stack: [0x00007f89cc452000,0x00007f89cc553000],  sp=0x00007f89cc550708,  free space=1017k

            Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)

            C  0x00007f89b69ed5f0

             

            [error occurred during error reporting (printing native stack), id 0xb]

             

            Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)

            j  org.apache.tomcat.jni.Socket.sendibb(JII)I+0

            J  org.apache.coyote.http11.InternalAprOutputBuffer.flushBuffer()V

            J  org.apache.coyote.http11.Http11AprProcessor.action(Lorg/apache/coyote/ActionCode;Ljava/lang/Object;)V

            J  org.apache.catalina.connector.OutputBuffer.doFlush(Z)V

            j  org.apache.catalina.connector.OutputBuffer.flush()V+2

            j  org.apache.catalina.connector.CoyoteOutputStream.flush()V+4

            J  org.jboss.websockets.oio.internal.protocol.ietf13.Hybi13Socket._writeTextFrame(Ljava/lang/String;)V

            j  org.jboss.websockets.oio.internal.protocol.ietf13.Hybi13Socket.writeFrame(Lorg/jboss/websockets/Frame;)V+56

            j  org.jboss.as.websockets.servlet.WebSocketDelegate.writeFrame(Lorg/jboss/websockets/Frame;)V+5

            J  com.alicecallsbob.porth.gateway.FCSDKGW_c.FCSDKGW_c()V

            j  com.alicecallsbob.porth.gateway.FCSDKGW_c.FCSDKGW_a(Lcom/alicecallsbob/porth/gateway/swift/SwiftMessage;)V+199

            j  com.alicecallsbob.porth.gateway.FCSDKGW_c.FCSDKGW_a(Ljavax/servlet/sip/SipApplicationSession;Ljavax/servlet/sip/SipSession;Ljavax/servlet/sip/SipFactory;Ljava/lang/String;Lcom/alicecallsbob/sdk/acb/event_manager/EventManagerProvider;)Lcom/alicecallsbob/porth/gateway/call/FCSDKGW_c;+67

            j  com.alicecallsbob.porth.gateway.sip.GatewaySipServlet.newInboundConnection(Lcom/alicecallsbob/sdk/acb/connection_manager/ConnectionBuilder;Ljavax/servlet/sip/SipServletRequest;)V+272

            j  com.alicecallsbob.sdk.acb.connection_manager.ConnectionManagerSipServlet.doRequest(Ljavax/servlet/sip/SipServletRequest;)V+67

            j  javax.servlet.sip.SipServlet.service(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;)V+9

            j  org.mobicents.servlet.sip.core.dispatchers.MessageDispatcher.callServlet(Lorg/mobicents/servlet/sip/core/message/MobicentsSipServletRequest;Lorg/mobicents/servlet/sip/core/SipApplicationDispatcher;)V+324

            j  org.mobicents.servlet.sip.core.dispatchers.InitialRequestDispatcher$InitialDispatchTask.dispatch()V+355

            j  org.mobicents.servlet.sip.core.dispatchers.DispatchTask.dispatchAndHandleExceptions()V+1

            j  org.mobicents.servlet.sip.core.dispatchers.InitialRequestDispatcher.dispatchInsideContainer(Ljavax/sip/SipProvider;Ljavax/servlet/sip/ar/SipApplicationRouterInfo;Lorg/mobicents/servlet/sip/core/session/MobicentsSipApplicationSession;Lorg/mobicents/servlet/sip/core/session/MobicentsSipApplicationSessionKey;Lorg/mobicents/servlet/sip/message/SipServletRequestImpl;)V+354

            j  org.mobicents.servlet.sip.core.dispatchers.InitialRequestDispatcher.dispatchRequestToApplication(Ljavax/sip/SipProvider;Lorg/mobicents/servlet/sip/message/SipServletRequestImpl;Ljavax/servlet/sip/ar/SipApplicationRouterInfo;Lorg/mobicents/servlet/sip/core/session/MobicentsSipApplicationSession;)V+184

            j  org.mobicents.servlet.sip.core.dispatchers.InitialRequestDispatcher.invokeAppRouterAndDispatchRequest(Ljavax/sip/SipProvider;Lorg/mobicents/servlet/sip/message/SipServletRequestImpl;)V+477

            j  org.mobicents.servlet.sip.core.dispatchers.InitialRequestDispatcher.dispatchMessage(Ljavax/sip/SipProvider;Lorg/mobicents/servlet/sip/message/SipServletMessageImpl;)V+94

            j  org.mobicents.servlet.sip.core.SipApplicationDispatcherImpl.processRequest(Ljavax/sip/RequestEvent;)V+1014

            j  gov.nist.javax.sip.EventScanner.deliverEvent(Lgov/nist/javax/sip/EventWrapper;)V+472

            j  gov.nist.javax.sip.SipProviderImpl.handleEvent(Ljava/util/EventObject;Lgov/nist/javax/sip/stack/SIPTransaction;)V+230

            j  gov.nist.javax.sip.DialogFilter.processRequest(Lgov/nist/javax/sip/message/SIPRequest;Lgov/nist/javax/sip/stack/MessageChannel;)V+4049

            j  gov.nist.javax.sip.stack.SIPServerTransaction.processRequest(Lgov/nist/javax/sip/message/SIPRequest;Lgov/nist/javax/sip/stack/MessageChannel;)V+460

            j  gov.nist.javax.sip.stack.UDPMessageChannel.processMessage(Lgov/nist/javax/sip/message/SIPMessage;Ljava/net/InetAddress;I)V+209

            j  gov.nist.javax.sip.stack.UDPMessageChannel.processIncomingDataPacket(Ljava/net/DatagramPacket;)V+1229

            j  gov.nist.javax.sip.stack.UDPMessageChannel.run()V+150

            j  java.lang.Thread.run()V+11

             

             

            I suspect that the s->net wasn't null because the memory had corrupted (or been reused). The patch appears insufficient for that reason. Ideally we would like to find and fix the cause rather than patch up libtcnative.

             

            A better fix (and something we are considering) would be to implement a hash map in libtcnative and return the key to that as the handle from the jni library, the key and the entry would be removed and freed when the handle was destroyed. You could then ceck the entry for the key exists before processing any request on that handle. This does have a potential performance hit and would add a dependency (to whichever library defiines the hashmap - https://code.google.com/p/sparsehash/?redir=1 or boost possibly) so isn't ideal either.

             

            thje seg faults do look similar to this issue: JBWEB-203

             

            and we are currently running the tests with an executor defined for the http and https connections as suggested in the bug. will update if we get any result from that.

            • 3. Re: Seg Fault in libtcnative used by jbossweb-7.0.16  with large number of concurrent users in sip container performance test.
              jfclere

              With the patch there should be an exception could you check that you get it? May be that  gives clues what is going wrong.

              1 of 1 people found this helpful
              • 4. Re: Seg Fault in libtcnative used by jbossweb-7.0.16  with large number of concurrent users in sip container performance test.
                gouldsj

                Hi Jean-Frederic,

                 

                Thanks for your help. You are correct, the patch we implemented was similar but did not throw an exception, instead we switched on additional logging in the libtcnative library and in jbossweb to help track down the cause. The exception will occur in the same place as the seg fault, looking at the call stack we know were it is failing, which is due to a flush call at the end of a write.

                 

                What our debug logging indicated was that the socket handle returned by libtcnative was being released in the worker thread prior to the flush call. I do not know why this would happen.

                 

                My current best guess is that the performance test is pushing the limits and a resource (threads/memory/???) is unavailable, this delays the socket accept, the write starts in a separate thread (?), the accept fails and the socket is freed, the write completes with a flush call, the memory the socket handle accesses in libtcnative (sendibb) has been freed and the seg fault occurs.

                 

                We tried the performance test with a thread pool executor specified (unbounded queue) and we had a different issue where it seg faulted (probably due to the queue blowing up). We are retrying it with a bounded queue.

                • 5. Re: Seg Fault in libtcnative used by jbossweb-7.0.16  with large number of concurrent users in sip container performance test.
                  jfclere

                  such problems are caused by double close or double destroy so that fits with our explications.

                  • 6. Re: Seg Fault in libtcnative used by jbossweb-7.0.16  with large number of concurrent users in sip container performance test.
                    gouldsj

                    We have modified out installation settings to use a thread pool for the web sockets. With a bounded queue, the issue does no longer occurs, this would appear to be similar to JBWEB-203 and to be a jbossweb issue with the default Worker threads. I am going to mark the bug I raised as a duplicate of 203.

                     

                    Thank you for your help and input.