Version 11

    Transaction Profiling and Visualization

     

    Overview of the Tool

     

    The purpose of this project is to create a tool for profiling the status of in-flight and complete transactions. The main aims of the tool are to help debug transaction related issues and also to provide insight into the running of the system. Essentially, the tool profiles a running system (or a log from a running system) and produces a detailed list of all transactions. A user may then select a particular transaction to see in more detail, including the participants involved in the transaction, the outcome of the transaction and the participants that influenced the outcome. The tool could be visual, showing a diagrammatic view of a transaction or may be command line based, producing a textual output (or maybe both).

     

    What kind of problems could this solve?

     

    • Transaction Timeout detection. We frequently get support requests (via the forums or elsewhere) from users who are experiencing intermittent rollbacks of transactions due to timeout. It is not a simple matter for the user to figure out that this is what's happening.
    • Transaction Profiling. It may be relatively easy to detect transactions that are taking longer than desired to complete. However, diagnosing which party in the transaction is to blame is non-trivial.
    • Loops and Diamonds detection. In a distributed transaction it is possible to introduce a loop or a diamond without realizing. JTS may tolerate this, but distributed JTA and bridged WS-AT to JTA does not. Identifying this scenario is non-trivial and requires detailed internal knowledge of the TM.
    • Performance Profiling. Diagnosing which transactions are taking a long time to run and discovering where the time is being spent.
    • Reasoning about the System Architecture. Producing an architectural diagram of a distributed transaction with many participants and servers is not an easy task.  Furthermore it can be error prone and could change based on the application inputs. The difficulty of this task is further compounded if multiple transaction types are involved (WS-AT, WS-BA, REST-AT, etc). This diagram is essential for reasoning about the current architecture and for considering improvements or changes. Also this diagram alone may not be enough without detailed profiling overlayed.
    • Analyzing Disk Syncs. One approach to improving the performance of a transaction is to reduce the number of disk syncs. The problem is that it is often difficult to calculate exactly how many you are performing and what delay each adds.

     

     

    Who is the target audience?

    • End users, to diagnose their own issues
    • GSS, to diagnose customer issues
    • Architects, to analyze their system architecture
    • Those new to the field of transactions, learning what's going on.

     

    GSS Requirements

    We are currently targetting GSS as our early adoptors for this tool. We will also make early release available to the community and work with any early adopters when they apear.

     

    • JTA-only at the moment then JTS. XTS and REST not a requirement for the forseable future.
    • Display the following information
      • List of transactions with outcome (success or failure)
        • Detailed information on a per transaction basis
          • List of resources
            • JNDI name of resource
            • Indication as to the type and vendor
            • Outcome of its participation in the protocol
          • Event log of the activities that occured in the transaction (with timings).
          • If timeout, what caused it.
          • Recovery information [VERIFY]
          • List of subordinate transactions (in JTS case)
      • Query transactions
        • By outcome.
        • By failure reason
        • Inflight
          • Stuck (inflight for a long time)
        • Recovered transactions [VERIFY]
    • Parse existing logs submitted by customers
    • Target EAP 6.1 [VERIFY]
    • Web and command-line interfaces [VERIFY]
    • Single-server (JTA-only) requirement higher than distributed (JTS) requirement at the momment.
    • Need a non-byteman solution. However, Byteman could be used for optional features, where no other alternative is available.
      • Many users will be unwilling to enable Byteman in production.
    • The logs must not leave the business domain [VERIFY]
      • Many users will not be willing to have their logs submitted to some third-party service for processing.

     

     

    How does it solve these problems?

     

    The tool analyses the server log and gathers data on every transaction ran. The tool could also support live updates for a server that is still running. The tool assumes that it can gather all its information from log statements. If it can't, we take the view that some logging is missing and then add it. We may also want the tool to support different log levels. For example, it may need a log level of TRACE to acquire full knowledge of the system. This may introduce too much of a performance overhead, so we may want to allow the tool to tollerate the reduced information provided by the less verbose log levels. In this case only basic information would be provided.

     

    It is unlikely that this tool be used routinely in production. I think the sensible thing to do is to gather aggregate information in such a way that has minimal impact on the running system. This is mostly covered at the moment, as Narayana can obtain, rollback, commit and in-flight transaction counters. This information is great for identifying the presence of a problem, but in many cases the user would then need to turn up the logging to establish the root cause of the problem. This is similar to how developers improve the performance of a system. Aggregate data (i.e. average throughput, average response times) is used to identify that a problem exists. The developer then attaches a profiler to identify the root cause of the performance problem.

     

    I see this tool more as a profiling tool to be used when the aggregate data has identified an issue. In which case, the overhead of gathering this data is less of an issue, assuming it doesn't impact the system to the point where it skews the data being gathered.

     

    Querying feature

    The data can be queried to find out specific information. For example, show me all the transactions that rolled back. Given a rolledback transaction, the data should be available to diagnose exactly why it rolledback. Other things you may want to query by:

     

    • Outcome. Committed, Rolledback, heuristic, etc
    • Type. JTA, JTS, WS-AT, REST-AT, etc
    • Duration. Find all fast or slow transactions
    • Inflight. Show transactions currently inflight
    • Stuck. Inflight transactions that have been running for longer than Xms.
    • Recovery. All transactions that needed to be recovered.

     

    Diagnostics

    The tool could also allow 'profiles' to be plugged in. A 'profile' is responsible for identifying common problems that we see users having. Hopefully these would allow issues to be identified earlier before they are passed further up the support chain. For example, we may create a 'profile' that searches over the data looking for loops or diamonds. Another could be responsible for identifying timeouts and maybe hinting at their cause. These 'profiles' would be created based on community demand.

     

    Visualization

     

    In order to help architects reason about the system architecture, the tool could produce a visualization of a transaction of interest. A few diagrams should give you an idea of what the tooling could produce:

     

    TXVisualCommitted(2).png

     

    Here we can see that a transaction was begun on Server1. It enlisted a DB and JMS queue locally and invoked a remote service on Server2. A second resource (DB2) was enlisted on Server2.

     

    TXVisualRollback.png

     

    In this example we can see that the transaction rolled back because DB2 voted rollback.

     

    Other possible features:

     

    •     Participant types. Display if the participant is one or two phase aware for JTA/JTS. Display the completion type (Participant/Coordinator) for WS-BA.
    •     Resource types. Whether it's a database or message queue and maybe include the name and version.
    •     Multiple Transaction Types. Somehow depict the type of transaction (JTS, WS-AT, REST-AT, etc).
    •     Display Bridges. Display when a transaction is bridged from one type to another
    •     In flight Transactions. More useful for long running transactions; display the current status of each participant and update the display in real-time as the status changes.
    •     Recovery Status. Show which resources crashed and the status of those resources already committed or recovered.

     

     

    Reducing the amount of data

    It's likely that enabling all the required TRACE logging will generate too large a log, and slow the system down. Therefore to optimize this we could do a number of things:

    1. Use Byteman to do the TRACE logging anyway (despite being turned off), for just the items we require
    2. Use Byteman to somehow buffer up detailed logging in memory, for all transactions. This  is discarded for committed transactions and dumped to the log for failed transactions.
    3. Update Narayana to dump a lot of data when the transaction fails. Same as the Byteman approach, but will be useful for more users.

     

    We should also reduce the amount of data held for 'identical' transactions. For example, we may have 1,000's of transactions that follow the same paths, interacting with the same resources. We should simply store this data once along with aggregate data for all the associated transactions.

     

    Research

    Things to look into:

     

     

    Example - log grepping

    Edit the log levels in standalone.xml setting the "com.arjuna" level to TRACE and the CONSOLE logger to TRACE. You'll then see a lot of logging produced by the transaction manager. Here's some interesting messages for different scenarios:

     

    Commit
    09:20:38,554 TRACE [com.arjuna.ats.arjuna] (pool-1-thread-1) BasicAction::Begin() for action-id 0:ffffac118223:61d0e901:515016c7:13
    
     [0m09:20:38,558 TRACE [com.arjuna.ats.jta] (pool-1-thread-1) TransactionImple.enlistResource ( org.jboss.narayana.txvis.simple.DummyXAResource@47b24153 )
     [0m09:20:38,563 TRACE [com.arjuna.ats.jta] (pool-1-thread-1) TransactionImple.enlistResource ( org.jboss.narayana.txvis.simple.DummyXAResource@1bb557c8 )
    
    09:20:38,565 TRACE [com.arjuna.ats.arjuna] (pool-1-thread-1) BasicAction::prepare () for action-id 0:ffffac118223:61d0e901:515016c7:13
    09:20:38,566 TRACE [com.arjuna.ats.jta] (pool-1-thread-1) XAResourceRecord.topLevelPrepare for XAResourceRecord < resource:org.jboss.narayana.txvis.simple.DummyXAResource@47b24153, txid:< formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffffac118223:61d0e901:515016c7:13, node_name=1, branch_uid=0:ffffac118223:61d0e901:515016c7:14, subordinatenodename=null, eis_name=unknown eis name >, heuristic: TwoPhaseOutcome.FINISH_OK com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord@6e09388c >
    09:20:38,567 TRACE [com.arjuna.ats.jta] (pool-1-thread-1) XAResourceRecord.topLevelPrepare for XAResourceRecord < resource:org.jboss.narayana.txvis.simple.DummyXAResource@1bb557c8, txid:< formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffffac118223:61d0e901:515016c7:13, node_name=1, branch_uid=0:ffffac118223:61d0e901:515016c7:18, subordinatenodename=null, eis_name=unknown eis name >, heuristic: TwoPhaseOutcome.FINISH_OK com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord@1f5fa7c4 >
    
    09:20:38,573 TRACE [com.arjuna.ats.arjuna] (pool-1-thread-1) BasicAction::doCommit (XAResourceRecord < resource:org.jboss.narayana.txvis.simple.DummyXAResource@47b24153, txid:< formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffffac118223:61d0e901:515016c7:13, node_name=1, branch_uid=0:ffffac118223:61d0e901:515016c7:14, subordinatenodename=null, eis_name=unknown eis name >, heuristic: TwoPhaseOutcome.FINISH_OK com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord@6e09388c >)
    09:20:38,574 TRACE [com.arjuna.ats.jta] (pool-1-thread-1) XAResourceRecord.topLevelCommit for XAResourceRecord < resource:org.jboss.narayana.txvis.simple.DummyXAResource@47b24153, txid:< formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffffac118223:61d0e901:515016c7:13, node_name=1, branch_uid=0:ffffac118223:61d0e901:515016c7:14, subordinatenodename=null, eis_name=unknown eis name >, heuristic: TwoPhaseOutcome.FINISH_OK com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord@6e09388c >
    09:20:38,574 TRACE [com.arjuna.ats.arjuna] (pool-1-thread-1) BasicAction::doCommit (XAResourceRecord < resource:org.jboss.narayana.txvis.simple.DummyXAResource@1bb557c8, txid:< formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffffac118223:61d0e901:515016c7:13, node_name=1, branch_uid=0:ffffac118223:61d0e901:515016c7:18, subordinatenodename=null, eis_name=unknown eis name >, heuristic: TwoPhaseOutcome.FINISH_OK com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord@1f5fa7c4 >)
    09:20:38,574 TRACE [com.arjuna.ats.jta] (pool-1-thread-1) XAResourceRecord.topLevelCommit for XAResourceRecord < resource:org.jboss.narayana.txvis.simple.DummyXAResource@1bb557c8, txid:< formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffffac118223:61d0e901:515016c7:13, node_name=1, branch_uid=0:ffffac118223:61d0e901:515016c7:18, subordinatenodename=null, eis_name=unknown eis name >, heuristic: TwoPhaseOutcome.FINISH_OK com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord@1f5fa7c4 >
    09:20:38,575 TRACE [com.arjuna.ats.arjuna] (pool-1-thread-1) FileSystemStore.remove_committed(0:ffffac118223:61d0e901:515016c7:13, /StateManager/BasicAction/TwoPhaseCoordinator/AtomicAction)
    

     

    Client Driven Rollback
     [0m10:03:41,737 TRACE [com.arjuna.ats.arjuna] (pool-2-thread-1) BasicAction::Begin() for action-id 0:ffffac118223:-507ee916:515020eb:13
    
     [0m10:03:41,741 TRACE [com.arjuna.ats.jta] (pool-2-thread-1) TransactionImple.enlistResource ( org.jboss.narayana.txvis.simple.DummyXAResource@5cbf424a )
     [0m10:03:41,747 TRACE [com.arjuna.ats.jta] (pool-2-thread-1) TransactionImple.enlistResource ( org.jboss.narayana.txvis.simple.DummyXAResource@53adb4e5 )
    
     [0m10:03:41,749 TRACE [com.arjuna.ats.arjuna] (pool-2-thread-1) BasicAction::Abort() for action-id 0:ffffac118223:-507ee916:515020eb:13
    
    [0m10:03:41,750 TRACE [com.arjuna.ats.jta] (pool-2-thread-1) XAResourceRecord.topLevelAbort for XAResourceRecord < resource:org.jboss.narayana.txvis.simple.DummyXAResource@5cbf424a, txid:< formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffffac118223:-507ee916:515020eb:13, node_name=1, branch_uid=0:ffffac118223:-507ee916:515020eb:14, subordinatenodename=null, eis_name=unknown eis name >, heuristic: TwoPhaseOutcome.FINISH_OK com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord@6c28e401 >
    [0m10:03:41,751 TRACE [com.arjuna.ats.jta] (pool-2-thread-1) XAResourceRecord.topLevelAbort for XAResourceRecord < resource:org.jboss.narayana.txvis.simple.DummyXAResource@53adb4e5, txid:< formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffffac118223:-507ee916:515020eb:13, node_name=1, branch_uid=0:ffffac118223:-507ee916:515020eb:18, subordinatenodename=null, eis_name=unknown eis name >, heuristic: TwoPhaseOutcome.FINISH_OK com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord@70913520 >
    
    
    

     

    Participant Driven Rollback
     [0m10:15:52,446 TRACE [com.arjuna.ats.arjuna] (pool-2-thread-1) BasicAction::Begin() for action-id 0:ffffac118223:-6fc6eada:515023cb:13
    
     [0m10:15:52,450 TRACE [com.arjuna.ats.jta] (pool-2-thread-1) TransactionImple.enlistResource ( org.jboss.narayana.txvis.simple.DummyXAResource@72e0b50 )
     [0m10:15:52,455 TRACE [com.arjuna.ats.jta] (pool-2-thread-1) TransactionImple.enlistResource ( org.jboss.narayana.txvis.simple.DummyXAResource@7aa354c9 )
    
     [0m10:15:52,458 TRACE [com.arjuna.ats.jta] (pool-2-thread-1) XAResourceRecord.topLevelPrepare for XAResourceRecord < resource:org.jboss.narayana.txvis.simple.DummyXAResource@72e0b50, txid:< formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffffac118223:-6fc6eada:515023cb:13, node_name=1, branch_uid=0:ffffac118223:-6fc6eada:515023cb:14, subordinatenodename=null, eis_name=unknown eis name >, heuristic: TwoPhaseOutcome.FINISH_OK com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord@50150c84 >
    [0m10:15:52,459 TRACE [com.arjuna.ats.jta] (pool-2-thread-1) XAResourceRecord.topLevelPrepare for XAResourceRecord < resource:org.jboss.narayana.txvis.simple.DummyXAResource@7aa354c9, txid:< formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffffac118223:-6fc6eada:515023cb:13, node_name=1, branch_uid=0:ffffac118223:-6fc6eada:515023cb:18, subordinatenodename=null, eis_name=unknown eis name >, heuristic: TwoPhaseOutcome.FINISH_OK com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord@15e8d72f >
    
     [0m [33m10:15:52,460 WARN  [com.arjuna.ats.jta] (pool-2-thread-1) ARJUNA016041: prepare on < formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffffac118223:-6fc6eada:515023cb:13, node_name=1, branch_uid=0:ffffac118223:-6fc6eada:515023cb:18, subordinatenodename=null, eis_name=unknown eis name > (org.jboss.narayana.txvis.simple.DummyXAResource@7aa354c9) failed with exception XAException.XAER_RMERR: javax.transaction.xa.XAException
    
     [0m [33m10:15:52,467 WARN  [com.arjuna.ats.arjuna] (pool-2-thread-1) ARJUNA012073: BasicAction.End() - prepare phase of action-id 0:ffffac118223:-6fc6eada:515023cb:13 failed.
     [0m [33m10:15:52,467 WARN  [com.arjuna.ats.arjuna] (pool-2-thread-1) ARJUNA012075: Action Aborting
     [0m10:15:52,467 TRACE [com.arjuna.ats.arjuna] (pool-2-thread-1) BasicAction::phase2Abort() for action-id 0:ffffac118223:-6fc6eada:515023cb:13
     [0m10:15:52,468 TRACE [com.arjuna.ats.arjuna] (pool-2-thread-1) BasicAction::doAbort (XAResourceRecord < resource:org.jboss.narayana.txvis.simple.DummyXAResource@72e0b50, txid:< formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffffac118223:-6fc6eada:515023cb:13, node_name=1, branch_uid=0:ffffac118223:-6fc6eada:515023cb:14, subordinatenodename=null, eis_name=unknown eis name >, heuristic: TwoPhaseOutcome.FINISH_OK com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord@50150c84 >)
     [0m10:15:52,468 TRACE [com.arjuna.ats.jta] (pool-2-thread-1) XAResourceRecord.topLevelAbort for XAResourceRecord < resource:org.jboss.narayana.txvis.simple.DummyXAResource@72e0b50, txid:< formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffffac118223:-6fc6eada:515023cb:13, node_name=1, branch_uid=0:ffffac118223:-6fc6eada:515023cb:14, subordinatenodename=null, eis_name=unknown eis name >, heuristic: TwoPhaseOutcome.FINISH_OK com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord@50150c84 >
     [0m10:15:52,468 TRACE [com.arjuna.ats.arjuna] (pool-2-thread-1) BasicAction::doAbort (XAResourceRecord < resource:org.jboss.narayana.txvis.simple.DummyXAResource@7aa354c9, txid:< formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffffac118223:-6fc6eada:515023cb:13, node_name=1, branch_uid=0:ffffac118223:-6fc6eada:515023cb:18, subordinatenodename=null, eis_name=unknown eis name >, heuristic: TwoPhaseOutcome.FINISH_OK com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord@15e8d72f >)
     [0m10:15:52,469 TRACE [com.arjuna.ats.jta] (pool-2-thread-1) XAResourceRecord.topLevelAbort for XAResourceRecord < resource:org.jboss.narayana.txvis.simple.DummyXAResource@7aa354c9, txid:< formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffffac118223:-6fc6eada:515023cb:13, node_name=1, branch_uid=0:ffffac118223:-6fc6eada:515023cb:18, subordinatenodename=null, eis_name=unknown eis name >, heuristic: TwoPhaseOutcome.FINISH_OK com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord@15e8d72f >
    
    

    Data Model

    The basic data model currently used in early iterations of the tool is described in the following ER diagram

     

    txvis-entities.png