Seg Fault in libtcnative used by jbossweb-7.0.16 with large number of concurrent users in sip container performance test.
gouldsj Aug 7, 2013 10:14 AMWe are using Jboss (7.1.2) with mobicents sip container.
We ran at 1 Call per second for 9h 12m.
This was using 500 users, each using a single Web Sockets TCP connection each to the Gateway. The users also made one UDP connection each per call. The UDP connection is disconnected at the end of the call.
With 1 CPS as a call rate, we had to use 6 concurrent calls. So, 1 CPS (with 2 users per call) means that we had at any one time:
500 WS connections. (1 per user)
12 UDP connections (2 per call)
CPU usage on the box was an average of 10%
After 9h 12m of running we got a seg fault in the libtcnative library (full hs_err log attached ):
Stack: [0x00007f5869adb000,0x00007f5869bdc000], sp=0x00007f5869bd97f0, free space=1017k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C [libtcnative-1.so+0x18382] Java_org_apache_tomcat_jni_Socket_sendibb+0x22
j org.apache.coyote.http11.InternalAprOutputBuffer.flushBuffer()V+186
j org.apache.coyote.http11.InternalAprOutputBuffer.flush()V+19
j org.apache.coyote.http11.Http11AprProcessor.action(Lorg/apache/coyote/ActionCode;Ljava/lang/Object;)V+104
j org.apache.coyote.Response.action(Lorg/apache/coyote/ActionCode;Ljava/lang/Object;)V+31
j org.apache.catalina.connector.OutputBuffer.doFlush(Z)V+94
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j org.apache.tomcat.jni.Socket.sendibb(JII)I+0
j org.apache.coyote.http11.InternalAprOutputBuffer.flushBuffer()V+186
j org.apache.coyote.http11.InternalAprOutputBuffer.flush()V+19
j org.apache.coyote.http11.Http11AprProcessor.action(Lorg/apache/coyote/ActionCode;Ljava/lang/Object;)V+104
j org.apache.coyote.Response.action(Lorg/apache/coyote/ActionCode;Ljava/lang/Object;)V+31
j org.apache.catalina.connector.OutputBuffer.doFlush(Z)V+94
J org.jboss.websockets.oio.internal.protocol.ietf13.Hybi13Socket._writeTextFrame(Ljava/lang/String;)V
j org.jboss.websockets.oio.internal.protocol.ietf13.Hybi13Socket.writeFrame(Lorg/jboss/websockets/Frame;)V+56
j org.jboss.as.websockets.servlet.WebSocketDelegate.writeFrame(Lorg/jboss/websockets/Frame;)V+5
j com.alicecallsbob.porth.gateway.FCSDKGW_c.FCSDKGW_c()V+230
j com.alicecallsbob.porth.gateway.FCSDKGW_c.FCSDKGW_a(Lcom/alicecallsbob/porth/gateway/swift/SwiftMessage;)V+201
j com.alicecallsbob.porth.gateway.FCSDKGW_c.FCSDKGW_a(Ljavax/servlet/sip/SipApplicationSession;Ljavax/servlet/sip/SipSession;Ljavax/servlet/sip/SipFactory;Ljava/lang/String;Lcom/alicecallsbob/sdk/acb/event_manager/EventManagerProvider;)Lcom/alicecallsbob/porth/gateway/call/FCSDKGW_c;+67
j com.alicecallsbob.porth.gateway.sip.GatewaySipServlet.newInboundConnection(Lcom/alicecallsbob/sdk/acb/connection_manager/ConnectionBuilder;Ljavax/servlet/sip/SipServletRequest;)V+272
j com.alicecallsbob.sdk.acb.connection_manager.ConnectionManagerSipServlet.doRequest(Ljavax/servlet/sip/SipServletRequest;)V+67
j javax.servlet.sip.SipServlet.service(Ljavax/servlet/ServletRequest;Ljavax/servlet/ServletResponse;)V+9
j org.mobicents.servlet.sip.core.dispatchers.MessageDispatcher.callServlet(Lorg/mobicents/servlet/sip/core/message/MobicentsSipServletRequest;)V+306
j org.mobicents.servlet.sip.core.dispatchers.InitialRequestDispatcher$InitialDispatchTask.dispatch()V+351
j org.mobicents.servlet.sip.core.dispatchers.DispatchTask.dispatchAndHandleExceptions()V+1
j org.mobicents.servlet.sip.core.dispatchers.InitialRequestDispatcher.dispatchInsideContainer(Ljavax/sip/SipProvider;Ljavax/servlet/sip/ar/SipApplicationRouterInfo;Lorg/mobicents/servlet/sip/core/session/MobicentsSipApplicationSession;Lorg/mobicents/servlet/sip/message/SipServletRequestImpl;)V+238
j org.mobicents.servlet.sip.core.dispatchers.InitialRequestDispatcher.dispatchRequestToApplication(Ljavax/sip/SipProvider;Lorg/mobicents/servlet/sip/message/SipServletRequestImpl;Ljavax/servlet/sip/ar/SipApplicationRouterInfo;Lorg/mobicents/servlet/sip/core/session/MobicentsSipApplicationSession;)V+182
j org.mobicents.servlet.sip.core.dispatchers.InitialRequestDispatcher.invokeAppRouterAndDispatchRequest(Ljavax/sip/SipProvider;Lorg/mobicents/servlet/sip/message/SipServletRequestImpl;)V+456
j org.mobicents.servlet.sip.core.dispatchers.InitialRequestDispatcher.dispatchMessage(Ljavax/sip/SipProvider;Lorg/mobicents/servlet/sip/message/SipServletMessageImpl;)V+65
j org.mobicents.servlet.sip.core.SipApplicationDispatcherImpl.processRequest(Ljavax/sip/RequestEvent;)V+1006
j gov.nist.javax.sip.EventScanner.deliverEvent(Lgov/nist/javax/sip/EventWrapper;)V+472
j gov.nist.javax.sip.SipProviderImpl.handleEvent(Ljava/util/EventObject;Lgov/nist/javax/sip/stack/SIPTransaction;)V+230
j gov.nist.javax.sip.DialogFilter.processRequest(Lgov/nist/javax/sip/message/SIPRequest;Lgov/nist/javax/sip/stack/MessageChannel;)V+4049
j gov.nist.javax.sip.stack.SIPServerTransaction.processRequest(Lgov/nist/javax/sip/message/SIPRequest;Lgov/nist/javax/sip/stack/MessageChannel;)V+460
j gov.nist.javax.sip.stack.UDPMessageChannel.processMessage(Lgov/nist/javax/sip/message/SIPMessage;)V+207
j gov.nist.javax.sip.stack.UDPMessageChannel.processIncomingDataPacket(Ljava/net/DatagramPacket;)V+1196
j gov.nist.javax.sip.stack.UDPMessageChannel.run()V+150
j java.lang.Thread.run()V+11
v ~StubRoutines::call_stub
We tried patching libtcnative with a check on the s->net, however this was insufficient and only delayed the point wehere the issue arose.
Our investigation suggests that jbossweb is using a socket handle (a bare pointer from libtcnative) which has been freed by another thread due to an error.
We added logging into a patched version of libtcnative to detect this condition and used that to switch on additional logging in jbossweb.
this suggested that an error occuring in AprEndpoint.java Worker:run() method:
....
if ((status != null) && (handler.event(socket, status) == Handler.SocketState.CLOSED)) {
// Close socket and pool only if it wasn't closed
// already by the parent pool
if (serverSockPool != 0) {
Socket.destroy(socket);
}
}
....
I am not that familiar with the jbossweb code and so far I haven't been able to track down why the flush is occuring after this condition.
Anyone have any ideas?
We have considered adding a further level of abstraction to either the lintcnative library or the jbossweb Socket class using a key into a map rather than the bare pointer as the handle. However this could have a performance impact and it would be fixing the symptom rather than the (as yet undetermined) cause. The other issue is that we would want to submit any such update to libtcnative as a fix and we are not sure of the process and turn around on these (even if anyone else thought it was a good idea, which I realise may not be the case).
Any input or insight into this issue would be much appreciated.
-
console.log.zip 1.6 MB
-
hs_err_pid19121.log.zip 44.2 KB