JGroupsQualityRisksAS5

    JGroups Quality Risks for AS 5

     

    JGroups is moving from to 2.4.1.SP4 in AS 4.3 to 2.6.2 in AS 5, and many new features have been introduced (multiplexing, FLUSH, partial/streaming state transfer, view bundling, concurrent stack, out-of-band messages, RPCDispatcher filtering, etc). At least one new feature (multiplexing) is being used in AS 5, which would require tests over and above those present in AS 4.3.  This page will serve as a shared whiteboard for recording ideas on how the new features of JGroups will affect the requirements for testing in AS 5.

     

     

    Key Changes by Release

     

    Version

    Changes To JGroups Features

    2.4

    multiplexing channels, FLUSH, partial and streaming state transfer, view bundling, FD_ALL, FD_ICMP

    2.5

    concurrent stack, OOB messages, concurrent multiplexer, SFC, FLUSH fully supports virtual synchrony

    2.6

    join and perform state transfer at same time, UNICAST bundling, RPCDispatcher filtering, adding data to a view, reincarnation prevention, shared transport,FLUSH subset of cluster, eager lock release in NAKACK and UNICAST, thread factory hooks, PING and multiple Gossip Routers, TCPPING parallel discovery

     

    Potential Integration Problems with AS 5 due to JGroups changes

     

    Risks are classified as to functional, performance and stress relatedness. Priority could be added.

     

    Functional

     

    ID

    Feature

    Potential Problem

    Action

    Reporter

    1

    Shared channels

    Channel multiplexing is being used for sharing channels. Use of mux is no longer recommended. Shared transport is now the recommended approach.

    check with developers (see Note1).  Test that all apps are using shared transport in AS 5.

    Bela

    2

    Concurrent stack

    Application callbacks are not thread-safe. With the old stack, callbacks such as MembershipListener, MessageListener and ChannelListenerwere never called concurrently. With the new stack, these callbacks can be called concurrently for events (state transfers, message arrivals) from different senders (see Note2). In the case where the stack contains ordering protocols, this may restrict the degree of concurrent callbacks. For example, with FIFO ordering (e.g. pbcast.NAKACK for UDP mcast and UNICAST for UDP unicast), now events from the same sender will not be concurrent, but processed in FIFO order. But in general, the application should assume that callbacks will be called concurrently.

    devise tests for JBC, JBM, Clustering which test correctness under high concurrency from multiple peers. For AS Clustering: JBAS-5432

    Brian

    3

    Shared transport

    AS behaves differently under shared transport than under non-shared transport. Seeing that the shared transport is so pervasive, this is a possibility.

    run AS testsuite in both configurations and compare

    Richard

    4

    Shared transport

    conflicting configurations. When channels share a common transport, they define full stacks including  the transport layer and its properties. It is possible that properties differ in the transport definitions. Depending on the order of instantiation, desired property settings can be discarded, without warning

    Possibly emit warning when conflicting configurations arise. Devise tests to check that warnings are emitted

    Brian

     

    Note1: JBM and JBC currently use createChannelFactory to create a multiplexed channel, and changing this would require change of interface for JGroups to expose the appropriate API and new release and (ii) change to JBM and JBC code and new release. Brian will modify creation of channels to that a call to createChannelFactory will return a shared transport and not a mux channel and (ii) inject a shared transport use_singleton property into the transport if one does not exist. In this way, all apps will use s ahared transport by default. Some additional work for Hibernate standalone.

     

    Note2: Among callbacks, View changes do not occur concurrently. However, it may be good practice to ensure that all callbacks are thread safe.

     

    Performance

     

    ID

    Feature

    Potential Problem

    Action

    Reporter

    1

    Shared transport

    AS does not perform significantly better with shared transport. The degree of improvement in performance is at present unknown.

    Comparative test of performance of AS with and without shared transport

    Richard

    2

    Shared transport

    Transport properties out of alignment with sharing multiple channels. Depending on degree of sharing, UDP/TCP transports will now be receiving more messages. UDP and TCP buffer and thread pool settings will be more likely to overflow.

    Adjust UDP/TCP properties accordingly and reach threshold of thread pool exhaustion. Is a warning message generated when thread pool is full so that users can know? (Birna, Bela) We need to come up with settings (thread pools size, timer pool etc) for AS which take all 5 or 6 channel into account

    Richard

    3

    Shared transport

    AS startup time is increased. Moving from the multiplexer to shared transport increases the startup time for channels, and so the AS. Currently, AS services are deployed by a single thread; concurrent deployment for independent services is being considered.

    test the impact of using shared channels on overall AS startup time

    Bela

     

    Stress

     

    ID

    Feature

    Potential Problem

    Action

    Reporter

    1

    Shared transport

    Classloader leaks in thread pool. Class loader leaks in JGroups thread pool over time. Effect on JGroups / applications?

    Devise test to force a classloader leak in the JGroups threadpool

    Brian

    2

    Shared transport

    Starvation of one service by another. One greedy service causes starvation of thread pool resources for other services: for example, JBoss Web with many sessions under replication using up all threads in JG thread pool

    Set up AS "Seam-like" app that involves web and SFSB replication, entity clustering and JBM. Introduce faults into individual services to see how the overall system reacts

    Brian

     

    Related Tests in JG, JBC, JBM and AS test suites

     

    Identify here test suites which address the potential problems mentioned above. General indication is OK for now.

     

    Functional

     

    Component

    Tests

    Description

    JGroups

    SharedTransportTest

    Tests of key features of the shared transport

    JGroups

    ConcurrentStackTest

    Tests key features of the concurrent stack

     

    Performance

     

    Component

    Tests

    Description

    sample component

    name of suite

    description of suite

     

    Stress

     

    Component

    Tests

    Description

    sample component

    name of suite

    description of suite

     

    Comments

     

    Brian Stansberry (17 March 2008):

     

    General comments on new JGroups feature usage in AS 5 and testing thereof:

     

    In general order of "problem" significance:

     

    1) Concurrent stack.  Issue here is it is now possible for AS services and JBC to concurrently receive messages from JGroups.  AS codebase has zero tests targeted at checking handling of this.  Don't know about JBC.

     

    2) Shared transport. General issue that different AS services will be sharing a JGroups resource.  Couple subissues come to mind:

     

    a) Lifecycle of the shared transport protocol as services start and stop.  This is really something that's better tested directly in the JGroups testsuite; the AS isn't doing anything special here.

     

    b) Different services are sharing the shared transport's thread pool, thus there is possibility of conflict between services over that resource (i.e. one rogue service consumes all threads).  This is an area that needs testing.  I've briefly talked with Dominik about the need for tests of a "Seam-like" app that involves web and SFSB replication, entity clustering and JBM. With such an app we can introduce faults into individual services to see how the overall system reacts.

     

    3) OOB messages.  The AS code doesn't use these directly.  JBC might for  2PC COMMIT messages. Not sure what the risk is here; just something different.

     

    4) VIEW_SYNC. Clebert asked a question Friday about how this changes the timing of receipt of view changes.  Perhaps could be an issue if services have an implicit assumption about timing (which they shouldn't, since different nodes will always get views at different times.)

     

    Bela Ban (17 March 2008)

     

    +1 on Brian's comments, plus

     

        Shared transport: since AS 5 will use it by default, we need to

         see (a) whether the current tests in JGroups cover all possible

         uses in AS and (b) add shared transport tests directly in the

         testsuites of AS (and, to a lesser degree) in JBC

        See where we're still using MUX and make sure nobody uses it

         anymore, switch all users to the shared transport

     

    The goal here is to have a valid replacement for the MUX, not to replace something flawed with something equally flawed

     

    Richard Achmatowicz (2 April 2008)

     

    1) Concurrent stack and callbacks: Applications interact with JGroups via callbacks such as MembershipListener, MessageListener and ChannelListener. One significant change between the old stack and the concurrent stack is that with the old stack, these callbacks were never called concurrently. With the new stack, these callbacks can be called concurrently for events (view changes, state transfers, message arrivals)  from different senders.

     

    In the case where the stack contains ordering protocols, this may restrict the degree of concurrent callbacks. For example,

    with FIFO ordering (e.g. pbcast.NAKACK for UDP mcast and UNICAST for UDP unicast), now events from the same sender will not be

    concurrent, but processed in FIFO order. But in general, the application should assume that callbacks will be called concurrently.

     

    As a consequence, at a minimum, all application level callbacks should be thread-safe.

     

    2) Shared transport and concurrent thread pool: When using the shared transport, the number of incoming messages to the transport may double, triple, etc. depending on the number of channels sharing that transport. If thread pool sizes and thread pool queue sizes are not adjusted, this will probably result in events being rejected due to thread pool being overloaded. In the case that the thread pool is overloaded, the default rejection policy is to have the caller (multicast port, unicast port, TCP port) carry the event up the stack. This will effectively result in sequential processing of events (need to double check this) without any exception being raised. Thus, poor performance may be seen.

     

    3) RpcDispatcher: Are all callbacks used internally in RpcDispatcher (and HAPartition, DistributedState) thread-safe?