6 Replies Latest reply on Dec 3, 2012 12:04 PM by dschlenk

    Session failover in clustered

    dschlenk

      Should a session that originated from one node in a cluster be replicated to the other so that the session continues if the original node shuts down? I'm attempting to test that and haven't been successful. This is the console log when I shut down the original node in the cluster:

       

      16:04:34,382 INFO  [exo.jcr.component.core.SearchIndex] (Incoming-1,null) Index closed: /opt/gatein/standalone/data/gatein/jcr/lucene/portal-system_portal

      16:04:34,439 INFO  [exo.jcr.component.core.SearchIndex] (Incoming-1,null) Index closed: /opt/gatein/standalone/data/gatein/jcr/lucene/portal-work_portal

      16:04:34,514 INFO  [exo.jcr.component.core.SearchIndex] (Incoming-1,null) Index closed: /opt/gatein/standalone/data/gatein/jcr/lucene/wsrp-system_portal

      16:04:34,584 WARN  [org.hornetq.core.protocol.core.impl.RemotingConnectionImpl] (Thread-3 (HornetQ-client-global-threads-1674379319)) Connection failure has been detected: The connection was disconnected because of server shutdown [code=4]

      16:04:34,602 WARN  [org.hornetq.core.protocol.core.impl.RemotingConnectionImpl] (Thread-3 (HornetQ-client-global-threads-1674379319)) Connection failure has been detected: The connection was disconnected because of server shutdown [code=4]

      16:04:34,602 INFO  [exo.jcr.component.core.SearchIndex] (Incoming-1,null) Index closed: /opt/gatein/standalone/data/gatein/jcr/lucene/pc-system_portal

      16:04:34,617 INFO  [org.hornetq.core.server.cluster.impl.BridgeImpl] (Thread-12 (HornetQ-server-HornetQServerImpl::serverUUID=42a015b1-303a-11e2-8cfe-000c29fbc252-1584009536)) stopped bridge sf.my-cluster.379a88eb-332f-11e2-8d4f-000c2912e688

      16:04:34,691 INFO  [exo.jcr.component.core.SearchIndex] (Incoming-2,null) Index closed: /opt/gatein/standalone/data/gatein/jcr/lucene/system_portal_system

      16:04:34,746 INFO  [exo.jcr.component.core.SearchIndex] (Incoming-2,null) Index closed: /opt/gatein/standalone/data/gatein/jcr/lucene/system_portal

      16:04:35,186 INFO  [org.jboss.cache.RPCManagerImpl] (Incoming-2,null) Received new cluster view: [appserver2-dev-55198|2] [appserver2-dev-55198]

      16:04:35,213 WARNING [org.jgroups.protocols.pbcast.NAKACK] (Incoming-2,null) appserver2-dev-56973: dropped message from appserver-dev-25458 (not in table [appserver2-dev-56973]), view=[appserver2-dev-56973|2] [appserver2-dev-56973]

      16:04:35,513 INFO  [org.jboss.cache.RPCManagerImpl] (Incoming-1,null) Received new cluster view: [appserver2-dev-56973|2] [appserver2-dev-56973]

      16:04:35,561 WARNING [org.jgroups.protocols.pbcast.NAKACK] (Incoming-1,null) appserver2-dev-20744: dropped message from appserver-dev-41880 (not in table [appserver2-dev-20744]), view=[appserver2-dev-20744|2] [appserver2-dev-20744]

      16:04:35,862 INFO  [org.jboss.cache.RPCManagerImpl] (Incoming-2,null) Received new cluster view: [appserver2-dev-20744|2] [appserver2-dev-20744]

      16:04:36,328 INFO  [exo.jcr.component.core.WorkspaceResumer] (Thread-121) Setting workspace repository_portal-system online

      16:04:36,329 INFO  [exo.jcr.component.core.SearchIndex] (Incoming-1,null) Index created: /opt/gatein/standalone/data/gatein/jcr/lucene/portal-system_portal

      16:04:36,317 INFO  [exo.jcr.component.core.WorkspaceResumer] (Thread-122) Setting workspace repository_portal-work online

      16:04:36,332 INFO  [exo.jcr.component.core.WorkspaceResumer] (Thread-120) Setting workspace repository_system online

      16:04:36,333 INFO  [exo.jcr.component.core.WorkspaceResumer] (Thread-123) Setting workspace repository_wsrp-system online

      16:04:36,350 INFO  [exo.jcr.component.core.WorkspaceResumer] (Thread-124) Setting workspace repository_pc-system online

      16:04:36,531 INFO  [exo.jcr.component.core.SearchIndex] (Incoming-1,null) Initializing RecoveryFilters.

      16:04:36,533 INFO  [exo.jcr.component.core.SearchIndex] (Incoming-1,null) Index initialized: /opt/gatein/standalone/data/gatein/jcr/lucene/portal-system_portal Version: 4

      16:04:36,536 INFO  [exo.jcr.component.core.SearchIndex] (Incoming-1,null) Index created: /opt/gatein/standalone/data/gatein/jcr/lucene/portal-work_portal

      16:04:36,589 INFO  [exo.jcr.component.core.SearchIndex] (Incoming-1,null) Initializing RecoveryFilters.

      16:04:36,590 INFO  [exo.jcr.component.core.SearchIndex] (Incoming-1,null) Index initialized: /opt/gatein/standalone/data/gatein/jcr/lucene/portal-work_portal Version: 4

      16:04:36,594 INFO  [exo.jcr.component.core.SearchIndex] (Incoming-1,null) Index created: /opt/gatein/standalone/data/gatein/jcr/lucene/system_portal_system

      16:04:36,629 INFO  [exo.jcr.component.core.SearchIndex] (Incoming-1,null) Initializing RecoveryFilters.

      16:04:36,630 INFO  [exo.jcr.component.core.SearchIndex] (Incoming-1,null) Index initialized: /opt/gatein/standalone/data/gatein/jcr/lucene/system_portal_system Version: 4

      16:04:36,633 INFO  [exo.jcr.component.core.SearchIndex] (Incoming-1,null) Index created: /opt/gatein/standalone/data/gatein/jcr/lucene/wsrp-system_portal

      16:04:36,660 INFO  [exo.jcr.component.core.SearchIndex] (Incoming-1,null) Initializing RecoveryFilters.

      16:04:36,660 INFO  [exo.jcr.component.core.SearchIndex] (Incoming-1,null) Index initialized: /opt/gatein/standalone/data/gatein/jcr/lucene/wsrp-system_portal Version: 4

      16:04:36,663 INFO  [exo.jcr.component.core.SearchIndex] (Incoming-1,null) Index created: /opt/gatein/standalone/data/gatein/jcr/lucene/pc-system_portal

      16:04:36,672 INFO  [exo.jcr.component.core.SearchIndex] (Incoming-1,null) Initializing RecoveryFilters.

      16:04:36,672 INFO  [exo.jcr.component.core.SearchIndex] (Incoming-1,null) Index initialized: /opt/gatein/standalone/data/gatein/jcr/lucene/pc-system_portal Version: 4

      16:04:36,682 INFO  [exo.jcr.component.core.SearchIndex] (Incoming-1,null) Index created: /opt/gatein/standalone/data/gatein/jcr/lucene/system_portal

      16:04:36,692 INFO  [exo.jcr.component.core.SearchIndex] (Incoming-1,null) Initializing RecoveryFilters.

      16:04:36,693 INFO  [exo.jcr.component.core.SearchIndex] (Incoming-1,null) Index initialized: /opt/gatein/standalone/data/gatein/jcr/lucene/system_portal Version: 4

      16:04:37,087 INFO  [org.jboss.as.clustering.impl.CoreGroupCommunicationService.lifecycle.web] (Incoming-12,null) JBAS010247: New cluster view for partition web (id: 2, delta: -1, merge: false) : [appserver2-dev.example.com/web]

      16:04:37,090 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-12,null) ISPN000094: Received new cluster view: [appserver2-dev.example.com/web|2] [appserver2-dev.example.com/web]

      16:04:46,277 INFO  [org.jboss.cache.RPCManagerImpl] (Incoming-2,null) Received new cluster view: [appserver2-dev-65221|2] [appserver2-dev-65221]

      16:04:46,304 INFO  [org.jboss.cache.RPCManagerImpl] (Incoming-1,null) Received new cluster view: [appserver2-dev-59376|2] [appserver2-dev-59376]

      16:04:46,305 INFO  [org.jboss.cache.RPCManagerImpl] (Incoming-2,null) Received new cluster view: [appserver2-dev-22286|2] [appserver2-dev-22286]

      16:04:46,355 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-3,null) ISPN000094: Received new cluster view: [appserver2-dev.example.com-49199|2] [appserver2-dev.example.com-49199]

      16:04:46,365 INFO  [org.jboss.cache.RPCManagerImpl] (Incoming-1,null) Received new cluster view: [appserver2-dev-62749|2] [appserver2-dev-62749]

       

       

      It's a pretty stock 3.5.0.Beta02 other than I copied the configuration stuff from the standalone-ha.xml profile to standalone-full-ha.xml so I could take advantage of hornetq and some other things only available in the full profile. I'm using mod_cluster with apache2 to front end the cluster. I disabled the local firewall on the two GateIn nodes. I can still access GateIn after the original node shuts down, but I have to log in again. Is that the expected behavior or should the session be preserved? I don't know if those warning logs about dropped messages mean anything or not. I originally had some read and write buffer warnings in the logs but I sysctl'ed those away but the problem persists.

        • 1. Re: Session failover in clustered
          mvanco

          Hi David,

           

          the most important for session replication is following configuration:

                  <subsystem xmlns="urn:jboss:domain:web:1.2" default-virtual-server="default-host" native="false">

                      <connector name="http" protocol="HTTP/1.1" scheme="http" socket-binding="http"/>

                      <connector name="ajp" protocol="AJP/1.3" scheme="http" socket-binding="ajp"/>

                      <virtual-server name="default-host" enable-welcome-root="true">

                          <alias name="localhost"/>

                          <alias name="example.com"/>

                          <sso cache-container="web" cache-name="sso" reauthenticate="false"/>

                      </virtual-server>

                  </subsystem>

           

          With this setup, session replication works by default after failover.

           

          Regards,

          Michal Vančo

          • 2. Re: Session failover in clustered
            dschlenk

            I have that configuration present in both nodes. Do I need to change something, like change or add an alias?

            • 3. Re: Session failover in clustered
              dschlenk

              Maybe a better question is - is the "classic" portal configured for session replication? I deployed a sample clustered app and that appears to replicate properly, but if I log into the classic portal and then shut down the node that the session was created on, I have to log in again.

              • 4. Re: Session failover in clustered
                mvanco

                Hi David,

                can you verify with some developer tool what is your jsessionid at cookie on loadbalancer? That should be something like JSESSIONID=<jsessionid>.<jvmRoute/nodeName>

                After failover, jsessionid should remain the same and jvmRoute should change to new active node.

                And do you have your mod_cluster setup? (sticky session etc.)

                 

                Try to provide as many details as possible so that we can try to reproduce/report possible issue. How do you start cluster nodes (params)?

                 

                Regards,
                Michal

                • 5. Re: Session failover in clustered
                  mvanco

                  Yes, session replication should work at "classic" portal site.

                  • 6. Re: Session failover in clustered
                    dschlenk

                    The jsessionid does not remain the same but jvmRoute does change as expected.

                     

                    I have mod_cluster setup and am accessing through that.

                     

                    The output of the status page when both nodes are active is attached to the original post.

                     

                    As far as configuration, I have:

                     

                    <subsystem xmlns="urn:jboss:domain:modcluster:1.0">

                      <mod-cluster-config advertise-socket="modcluster" proxy-list="httpd-dev.pe.spanlink.com:8001">

                        <dynamic-load-provider>

                          <load-metric type="busyness"/>

                        </dynamic-load-provider>

                      </mod-cluster-config>

                    </subsystem>

                     

                    on each jboss node. On the apache side:

                     

                    Listen 172.17.13.155:8001

                    MemManagerFile /var/cache/httpd

                     

                    <VirtualHost *:8001>

                     

                      <Location />

                        Order allow,deny

                        Allow from all

                      </Location>

                     

                     

                      <Location /mod_cluster_manager>

                       SetHandler mod_cluster-manager

                       Order deny,allow

                       Deny from all

                          Allow from 192.168.

                          Allow from 10.

                          Allow from 172.17.

                      </Location>

                     

                      KeepAliveTimeout 60

                      MaxKeepAliveRequests 0

                     

                      ManagerBalancerName mycluster

                      AdvertiseFrequency 5

                      ServerAdvertise On

                     

                    </VirtualHost>

                     

                    Also attached are the full standalone-full-ha.xml files for each node. Note that I define the node name in there so I don't need to specify at the command line. They are separate machines so I don't run with any port offsets. Both nodes and the apache server are on the same physical vmware host as well so there shouldn't be any problems with network equipment blocking multicast or anything. 

                     

                    The startup command I use is just standalone.sh -c standalone-full-ha.xml