JGroups Quality Risks for AS 5
JGroups is moving from to 2.4.1.SP4 in AS 4.3 to 2.6.2 in AS 5, and many new features have been introduced (multiplexing, FLUSH, partial/streaming state transfer, view bundling, concurrent stack, out-of-band messages, RPCDispatcher filtering, etc). At least one new feature (multiplexing) is being used in AS 5, which would require tests over and above those present in AS 4.3. This page will serve as a shared whiteboard for recording ideas on how the new features of JGroups will affect the requirements for testing in AS 5.
Key Changes by Release
Version | Changes To JGroups Features |
---|---|
2.4 | multiplexing channels, FLUSH, partial and streaming state transfer, view bundling, FD_ALL, FD_ICMP |
2.5 | concurrent stack, OOB messages, concurrent multiplexer, SFC, FLUSH fully supports virtual synchrony |
2.6 | join and perform state transfer at same time, UNICAST bundling, RPCDispatcher filtering, adding data to a view, reincarnation prevention, shared transport,FLUSH subset of cluster, eager lock release in NAKACK and UNICAST, thread factory hooks, PING and multiple Gossip Routers, TCPPING parallel discovery |
Potential Integration Problems with AS 5 due to JGroups changes
Risks are classified as to functional, performance and stress relatedness. Priority could be added.
Functional
ID | Feature | Potential Problem | Action | Reporter |
---|---|---|---|---|
1 | Shared channels | Channel multiplexing is being used for sharing channels. Use of mux is no longer recommended. Shared transport is now the recommended approach. | check with developers (see Note1). Test that all apps are using shared transport in AS 5. | Bela |
2 | Concurrent stack | Application callbacks are not thread-safe. With the old stack, callbacks such as MembershipListener, MessageListener and ChannelListenerwere never called concurrently. With the new stack, these callbacks can be called concurrently for events (state transfers, message arrivals) from different senders (see Note2). In the case where the stack contains ordering protocols, this may restrict the degree of concurrent callbacks. For example, with FIFO ordering (e.g. pbcast.NAKACK for UDP mcast and UNICAST for UDP unicast), now events from the same sender will not be concurrent, but processed in FIFO order. But in general, the application should assume that callbacks will be called concurrently. | devise tests for JBC, JBM, Clustering which test correctness under high concurrency from multiple peers. For AS Clustering: JBAS-5432 | Brian |
3 | Shared transport | AS behaves differently under shared transport than under non-shared transport. Seeing that the shared transport is so pervasive, this is a possibility. | run AS testsuite in both configurations and compare | Richard |
4 | Shared transport | conflicting configurations. When channels share a common transport, they define full stacks including the transport layer and its properties. It is possible that properties differ in the transport definitions. Depending on the order of instantiation, desired property settings can be discarded, without warning | Possibly emit warning when conflicting configurations arise. Devise tests to check that warnings are emitted | Brian |
Note1: JBM and JBC currently use createChannelFactory to create a multiplexed channel, and changing this would require change of interface for JGroups to expose the appropriate API and new release and (ii) change to JBM and JBC code and new release. Brian will modify creation of channels to that a call to createChannelFactory will return a shared transport and not a mux channel and (ii) inject a shared transport use_singleton property into the transport if one does not exist. In this way, all apps will use s ahared transport by default. Some additional work for Hibernate standalone.
Note2: Among callbacks, View changes do not occur concurrently. However, it may be good practice to ensure that all callbacks are thread safe.
Performance
ID | Feature | Potential Problem | Action | Reporter |
---|---|---|---|---|
1 | Shared transport | AS does not perform significantly better with shared transport. The degree of improvement in performance is at present unknown. | Comparative test of performance of AS with and without shared transport | Richard |
2 | Shared transport | Transport properties out of alignment with sharing multiple channels. Depending on degree of sharing, UDP/TCP transports will now be receiving more messages. UDP and TCP buffer and thread pool settings will be more likely to overflow. | Adjust UDP/TCP properties accordingly and reach threshold of thread pool exhaustion. Is a warning message generated when thread pool is full so that users can know? (Birna, Bela) We need to come up with settings (thread pools size, timer pool etc) for AS which take all 5 or 6 channel into account | Richard |
3 | Shared transport | AS startup time is increased. Moving from the multiplexer to shared transport increases the startup time for channels, and so the AS. Currently, AS services are deployed by a single thread; concurrent deployment for independent services is being considered. | test the impact of using shared channels on overall AS startup time | Bela |
Stress
ID | Feature | Potential Problem | Action | Reporter |
---|---|---|---|---|
1 | Shared transport | Classloader leaks in thread pool. Class loader leaks in JGroups thread pool over time. Effect on JGroups / applications? | Devise test to force a classloader leak in the JGroups threadpool | Brian |
2 | Shared transport | Starvation of one service by another. One greedy service causes starvation of thread pool resources for other services: for example, JBoss Web with many sessions under replication using up all threads in JG thread pool | Set up AS "Seam-like" app that involves web and SFSB replication, entity clustering and JBM. Introduce faults into individual services to see how the overall system reacts | Brian |
Related Tests in JG, JBC, JBM and AS test suites
Identify here test suites which address the potential problems mentioned above. General indication is OK for now.
Functional
Component | Tests | Description |
---|---|---|
JGroups | SharedTransportTest | Tests of key features of the shared transport |
JGroups | ConcurrentStackTest | Tests key features of the concurrent stack |
Performance
Component | Tests | Description |
---|---|---|
sample component | name of suite | description of suite |
Stress
Component | Tests | Description |
---|---|---|
sample component | name of suite | description of suite |
Comments
Brian Stansberry (17 March 2008):
General comments on new JGroups feature usage in AS 5 and testing thereof:
In general order of "problem" significance:
1) Concurrent stack. Issue here is it is now possible for AS services and JBC to concurrently receive messages from JGroups. AS codebase has zero tests targeted at checking handling of this. Don't know about JBC.
2) Shared transport. General issue that different AS services will be sharing a JGroups resource. Couple subissues come to mind:
a) Lifecycle of the shared transport protocol as services start and stop. This is really something that's better tested directly in the JGroups testsuite; the AS isn't doing anything special here.
b) Different services are sharing the shared transport's thread pool, thus there is possibility of conflict between services over that resource (i.e. one rogue service consumes all threads). This is an area that needs testing. I've briefly talked with Dominik about the need for tests of a "Seam-like" app that involves web and SFSB replication, entity clustering and JBM. With such an app we can introduce faults into individual services to see how the overall system reacts.
3) OOB messages. The AS code doesn't use these directly. JBC might for 2PC COMMIT messages. Not sure what the risk is here; just something different.
4) VIEW_SYNC. Clebert asked a question Friday about how this changes the timing of receipt of view changes. Perhaps could be an issue if services have an implicit assumption about timing (which they shouldn't, since different nodes will always get views at different times.)
Bela Ban (17 March 2008)
+1 on Brian's comments, plus
Shared transport: since AS 5 will use it by default, we need to
see (a) whether the current tests in JGroups cover all possible
uses in AS and (b) add shared transport tests directly in the
testsuites of AS (and, to a lesser degree) in JBC
See where we're still using MUX and make sure nobody uses it
anymore, switch all users to the shared transport
The goal here is to have a valid replacement for the MUX, not to replace something flawed with something equally flawed
Richard Achmatowicz (2 April 2008)
1) Concurrent stack and callbacks: Applications interact with JGroups via callbacks such as MembershipListener, MessageListener and ChannelListener. One significant change between the old stack and the concurrent stack is that with the old stack, these callbacks were never called concurrently. With the new stack, these callbacks can be called concurrently for events (view changes, state transfers, message arrivals) from different senders.
In the case where the stack contains ordering protocols, this may restrict the degree of concurrent callbacks. For example,
with FIFO ordering (e.g. pbcast.NAKACK for UDP mcast and UNICAST for UDP unicast), now events from the same sender will not be
concurrent, but processed in FIFO order. But in general, the application should assume that callbacks will be called concurrently.
As a consequence, at a minimum, all application level callbacks should be thread-safe.
2) Shared transport and concurrent thread pool: When using the shared transport, the number of incoming messages to the transport may double, triple, etc. depending on the number of channels sharing that transport. If thread pool sizes and thread pool queue sizes are not adjusted, this will probably result in events being rejected due to thread pool being overloaded. In the case that the thread pool is overloaded, the default rejection policy is to have the caller (multicast port, unicast port, TCP port) carry the event up the stack. This will effectively result in sequential processing of events (need to double check this) without any exception being raised. Thus, poor performance may be seen.
3) RpcDispatcher: Are all callbacks used internally in RpcDispatcher (and HAPartition, DistributedState) thread-safe?
Comments