2 Replies Latest reply on Apr 2, 2008 5:35 PM by smohan

    Silent application server crash with jboss-4.0.3SP1

    smohan

      On Solaris x86 5.10

      We have production JBoss deployment where we are running it in three different configurations -
      Cluster 1: 2 AS nodes - JMS hosting (just Queue and Topics registered here)
      Cluster 2: 2 AS nodes - Core Java service (Remote method invocations over JMS mostly)
      Cluster 3: 3 AS nodes - Enterprise java services hosting mostly entity beans, one stateful session bean and a few stateless session beans

      Cluster 2 is a purely symmetrical cluster. Cluster 3 on the other hand is not as some state, which is not transferable amongst servers, is kept. For some reason one or another of the Cluster 3 nodes would crash for no reason silently. Cluster 2 (or 1) never have such problems.

      There is no stack and no symptoms as to why this is happening. We only know when the users complain about losing connectivity. We tend to blame it on the garbage collection sometimes as this happens around the time Java concurrent GC's promotion fails and a stop the world collection happens. Without any manual intervention (like stop being called), jboss starts undeploying all packages.

      Is this a known issue? If not, would anyone suggest ideas to debug this weird occurrence. Every week we will have at least one of the three servers going down.

      I will add the log (gc,console and server) as soon as the next crash happens (hopefully it will happen again this week as usual.)

        • 1. Re: Silent application server crash with jboss-4.0.3SP1
          jaikiran

           

          Without any manual intervention (like stop being called), jboss starts undeploying all packages.


          How do you detect this undeployment? Do you see anything in the logs related to undeployment? And how do you start JBoss?


          • 2. Re: Silent application server crash with jboss-4.0.3SP1
            smohan

            I see it in the logs. There are no errors in the logs that would trigger this.

            The jboss command line we use to start is:

            /prod/cbtech/pkgs/sunos-5.10-i86pc/jdk1.5.0_06/bin/java -server -XX:+UseConcMarkSweepGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:/data/cbtech-local/135/jboss/server/ejb/log/gc.log -ms2800m -mx2800m -XX:+DisableExplicitGC -XX:NewSize=85m -XX:MaxNewSize=85m -XX:CMSInitiatingOccupancyFraction=85 -Djboss.partition.name=PartitionEJB-135 -Djboss.partition.udpGroup=228.1.3.135 -classpath /data/cbtech-local/135/jboss/bin/run.jar:/prod/cbtech/pkgs/sunos-5.10-i86pc/jdk1.5.0_06/lib/tools.jar:/usr/local/pkgs/tibrv-7.4.1/lib/tibrvj.jar org.jboss.Main -c ejb

            And it crashed again today without any undeploy messages. Following were the logs seen around the time it crashed.

            GC-LOG (last lines):

            428770.447: [GC 428770.447: [ParNew: 86912K->0K(86976K), 0.1210544 secs] 2165379K->2114565K(2867136K), 0.1212449 secs]
            428780.330: [GC 428780.331: [ParNew: 86912K->0K(86976K), 0.1337211 secs] 2201477K->2146088K(2867136K), 0.1339104 secs]
            428787.036: [GC 428787.036: [ParNew: 86912K->0K(86976K), 0.1621513 secs] 2233000K->2193530K(2867136K), 0.1623363 secs]
            428788.272: [GC 428788.272: [ParNew (promotion failed): 86873K->86873K(86976K), 21.4087057 secs]428809.681: [CMS (concurrent mode failure): 2206872K->1201836K(2780160K), 12.1240322 secs] 2280403K->1201836K(2867
            136K), 33.5330371 secs]
            428832.636: [GC 428832.636: [ParNew: 86912K->0K(86976K), 0.1898448 secs] 1288748K->1248557K(2867136K), 0.1900673 secs]

            Console Log (around the time it died):


            17:08:07,676 ERROR [com.kbcfp.insight.components.clientObjectCacheDelegateService.server.DefaultClientObjectCacheDelegate] Finalized DefaultClientObjectCacheDelegate
            17:08:07,681 WARN [org.jgroups.protocols.FD] I was suspected, but will not remove myself from membership (waiting for EXIT message)
            17:08:07,681 WARN [org.jgroups.protocols.FD] I was suspected, but will not remove myself from membership (waiting for EXIT message)
            17:08:07,683 WARN [org.jgroups.protocols.pbcast.GMS] checkSelfInclusion() failed, cbcalcsx1:52572 (additional data: 15 bytes) is not a member of view [cbcalcsx2:40421 (additional data: 15 bytes)|22] [cbcalcsx2:40421 (additional data: 15 bytes), cbappsx22:63549 (additional data: 20 bytes), cbcalcsx3:32821 (additional data: 15 bytes), cbcalcsx4:43849 (additional data: 15 bytes), cbappsx25:35017 (additional data: 15 bytes)]; discarding view
            17:08:07,683 WARN [org.jgroups.protocols.pbcast.GMS] I (cbcalcsx1:52572 (additional data: 15 bytes)) am being shunned, will leave and rejoin group (prev_members are [cbcalcsx2:40421 (additional data: 15 bytes) cbappsx22:63549 (additional data: 20 bytes) cbcalcsx3:32821 (additional data: 15 bytes) cbcalcsx4:43849 (additional data: 15 bytes) cbappsx25:34450 (additional data: 15 bytes) cbcalcsx1:52572 (additional data: 15 bytes) cbappsx25:43841 (additional data: 15 bytes) cbappsx25:35017 (additional data: 15 bytes) ])

            Server Log (Around the time it died):

            17:08:07,681 WARN #UpHandler (FD) [protocols.FD ] I was suspected, but will not remove myself from membership (waiting for EXIT message)
            17:08:07,681 WARN #UpHandler (FD) [protocols.FD ] I was suspected, but will not remove myself from membership (waiting for EXIT message)
            17:08:07,683 WARN #UpHandler (GMS) [pbcast.GMS ] checkSelfInclusion() failed, cbcalcsx1:52572 (additional data: 15 bytes) is not a member of view [cbcalcsx2:40421 (additional data: 15 bytes) |22] [cbcalcsx2:40421 (additional data: 15 bytes), cbappsx22:63549 (additional data: 20 bytes), cbcalcsx3:32821 (additional data: 15 bytes), cbcalcsx4:43849 (additional data: 15 bytes), cbappsx25:35017 (additio nal data: 15 bytes)]; discarding view
            17:08:07,683 WARN #UpHandler (GMS) [pbcast.GMS ] I (cbcalcsx1:52572 (additional data: 15 bytes)) am being shunned, will leave and rejoin group (prev_members are [cbcalcsx2:40421 (additional data: 15 bytes) cbappsx22:63549 (additional data: 20 bytes) cbcalcsx3:32821 (additional data: 15 bytes) cbcalcsx4:43849 (additional data: 15 bytes) cbappsx25:34450 (additional data: 15 bytes) cbcalcsx1:52572 (a dditional data: 15 bytes) cbappsx25:43841 (additional data: 15 bytes) cbappsx25:35017 (additional data: 15 bytes) ])
            17:08:07,684 INFO #UpHandler (STATE_TRANSFER) [jgroups.JChannel ] received an EXIT event, will leave the channel
            17:08:07,686 INFO #CloserThread [jgroups.JChannel ] closing the channel
            17:08:07,689 DEBUG #JMS SessionPool Worker-0 [ejb2.UserBean ] Activating UserBean
            17:08:08,191 WARN #RMI TCP Connection(31656)-10.64.35.221 [server.TabletDelegate ] recalcTabletStructure took 34234ms
            17:08:08,194 WARN #RMI TCP Connection(31656)-10.64.35.221 [server.TabletDelegate ] getTabletStructure took 34237ms
            17:08:08,194 WARN #RMI TCP Connection(31656)-10.64.35.221 [server.TabletDelegate ] recalcUpdateRegion took 34237ms
            17:08:08,194 WARN #pool-1-thread-1 [server.TabletManager ] processDelegate spent 34231ms to executing TabletDelegate Search: TabletFilter:m_region = null
            m_currency = null
            m_country = null
            m_category = 0
            m_showIlliquid = true
            m_showDuplicates = true
            m_showRelevant = true
            m_activated = true
            PrefixFilter:
            m_searchType: Name+Quick+Bloomberg+Local+SPN+Cusip+All+Underlying SPN+Same Underlying+Org Id+Org Name+Org Shortcode+Org Bloombergcode+CParty Name+Text Filter+(847708415)
            m_searchText: null m_searchEtid: null m_isInteract: false m_isGoto: false
            17:08:08,195 WARN #pool-1-thread-5 [server.TabletManager ] trigger() took 34232ms to execute 33 tablet delegates
            17:08:08,502 INFO #CloserThread [jgroups.JChannel ] reconnecting to group PartitionEJB-135
            17:08:08,509 INFO #DownHandler (UDP) [protocols.UDP ] unicast sockets will use interface 10.64.1.13
            17:08:08,509 INFO #DownHandler (UDP) [protocols.UDP ] socket information:
            local_addr=cbcalcsx1:50030 (additional data: 15 bytes), mcast_addr=228.1.3.135:45566, bind_addr=/10.64.1.13, ttl=8
            sock: bound to 10.64.1.13:50030, receive buffer size=150000, send buffer size=800000
            mcast_recv_sock: bound to 10.64.1.13:45566, send buffer size=800000, receive buffer size=150000
            mcast_send_sock: bound to 10.64.1.13:50031, send buffer size=800000, receive buffer size=150000
            17:08:08,510 INFO #UpHandler (GMS) [STDOUT ]
            -------------------------------------------------------
            GMS: address is cbcalcsx1:50030 (additional data: 15 bytes)
            -------------------------------------------------------
            17:08:08,517 INFO #CloserThread [jgroups.JChannel ] fetching the state (auto_getstate=true)
            17:08:08,517 DEBUG #MessageDispatcher up processing thread [foAlertService.FOAlertServiceComponent] membership view changed recieved
            17:08:08,518 INFO #MessageDispatcher up processing thread [HAPartition.PartitionEJB-135 ] New cluster view for partition PartitionEJB-135: 23 ([10.64.1.14:1299, 204.253.249.133:1299, 10.64.1.15:1299, 10.64.1. 16:1299, 10.64.1.45:1299, 10.64.1.13:1299] delta: 0)
            17:08:08,518 INFO #MessageDispatcher up processing thread [tedReplicantManagerImpl.PartitionEJB-135] I am (10.64.1.13:1299) received membershipChanged event:
            17:08:08,518 INFO #MessageDispatcher up processing thread [tedReplicantManagerImpl.PartitionEJB-135] Dead members: 0 ([])
            17:08:08,518 INFO #MessageDispatcher up processing thread [tedReplicantManagerImpl.PartitionEJB-135] New Members : 0 ([])
            17:08:08,518 INFO #MessageDispatcher up processing thread [tedReplicantManagerImpl.PartitionEJB-135] All Members : 6 ([10.64.1.14:1299, 204.253.249.133:1299, 10.64.1.15:1299, 10.64.1.16:1299, 10.64.1.45:1299, 10.64.1.13:1299])
            17:08:08,524 INFO #CloserThread [jgroups.JChannel ] state was retrieved successfully
            17:08:08,586 INFO #DRM Async Publisher#1 [ejb.ProxyFactory ] Bound EJB Home 'IDT2.MiniTabState' to jndi 'IDT2.MiniTabState'
            17:08:08,587 INFO #DRM Async Publisher#1 [ejb.ProxyFactory ] Bound EJB Home 'IDT2.HouseKeeping' to jndi 'IDT2.HouseKeeping'
            17:08:09,842 INFO #pool-1-thread-3 [calc.PAndLCalculator ] Couldn't find holding for holding=(desk=3057, book=10073, spn=51001656, spnccy=1760000) period 7
            17:08:13,072 DEBUG #pool-1-thread-5 [calc.PAndLCalculator ] -3.4147bp move triggered FX recalculation - spn=1760115 was 0.1952, now 0.1951
            ---------------- no more logs after this --------