1 2 3 Previous Next 32 Replies Latest reply on Nov 25, 2010 12:02 AM by jombo

Cluster messages not redistributed after node hard kill

parmstrong Nov 9, 2010 6:24 PM

So we have a cluster going with 2 nodes using hornetQ integrated with Jboss5. We bring the two nodes up and they discover each other fine. I bring a client up and start sending messages to a queue that is clustered across the nodes. Everything will work great for a while the messages get load balanced and all is happy. Then I can get in a strange state in a couple of different ways, but the easiest way is if I go to one of the server nodes and just kill the server process. The one server appears to detect the other node went down and adjusts its cluster view but after this happens every other message I send to the server doesn't get processed. It looks like the message is placed on the bridge queue that was connecting the two nodes. The consumer count on the bridge queue is one but the other server is dead and so it never gets picked up by anything. I would like it if the other node dies that the node that is still running would grab the messages and start running them. In this case I can restart the other node and it starts picking the messages up when it comes back up but there are other situations where this same sort of thing will happen but even after a restart of the one server it never starts grabbing messages off the brigde any more. In that case I have to restart the entire cluster. So two questions:

1. How do I get it to handle when an node is hard killed so that live cluster node will start processing the messages meant for the other node.

2. Have you seen it where one of the nodes in a cluster will get into a state where it cant rejoin the cluster even after restart in a way that it will start processing messages again, and if so what is there to to about that?

I attached my hornetq-jms.xml for reference.

16:09:22,620 INFO [PerseusPartition] New cluster view for partition PerseusPartition (id: 1, delta: 1) : [10.4.16.63:1099, 10.4.16.64:1099]
16:09:22,625 INFO [PerseusPartition] I am (10.4.16.63:1099) received membershipChanged event:
16:09:22,625 INFO [PerseusPartition] Dead members: 0 ([])
16:09:22,625 INFO [PerseusPartition] New Members : 1 ([10.4.16.64:1099])
16:09:22,625 INFO [PerseusPartition] All Members : 2 ([10.4.16.63:1099, 10.4.16.64:1099])
16:09:22,912 INFO [RPCManagerImpl] Received new cluster view: [10.4.16.63:37679|1] [10.4.16.63:37679, 10.4.16.64:39101]
16:09:34,348 INFO [BridgeImpl] Connecting bridge sf.my-cluster.52417568-e2df-11df-b017-000c29922be7 to its destination
16:09:34,563 INFO [BridgeImpl] Bridge sf.my-cluster.52417568-e2df-11df-b017-000c29922be7 is connected to its destination
16:10:10,307 INFO [RPCManagerImpl] Received new cluster view: [10.4.16.63:49982|1] [10.4.16.63:49982, 10.4.16.64:57256]
16:10:31,535 INFO [PerseusPartition] Suspected member: 10.4.16.64:39101
16:10:31,586 INFO [RPCManagerImpl] Received new cluster view: [10.4.16.63:37679|2] [10.4.16.63:37679]
16:10:31,589 INFO [PerseusPartition] New cluster view for partition PerseusPartition (id: 2, delta: -1) : [10.4.16.63:1099]
16:10:31,590 INFO [PerseusPartition] I am (10.4.16.63:1099) received membershipChanged event:
16:10:31,590 INFO [PerseusPartition] Dead members: 1 ([10.4.16.64:1099])
16:10:31,590 INFO [PerseusPartition] New Members : 0 ([])
16:10:31,590 INFO [PerseusPartition] All Members : 1 ([10.4.16.63:1099])
16:10:31,684 INFO [RPCManagerImpl] Received new cluster view: [10.4.16.63:49982|2] [10.4.16.63:49982]
16:11:26,281 WARN [InterceptorsFactory] EJBTHREE-1246: Do not use InterceptorsFactory with a ManagedObjectAdvisor, InterceptorRegistry should be used via the bean container
16:11:26,281 WARN [InterceptorsFactory] EJBTHREE-1246: Do not use InterceptorsFactory with a ManagedObjectAdvisor, InterceptorRegistry should be used via the bean container
16:11:36,174 WARN [RemotingConnectionImpl] Connection failure has been detected: Did not receive ping from /10.4.16.64:60438. It is likely the client has exited or crashed without closing its connection, or the network between the server and client has failed. The connection will now be closed. [code=3]
16:11:36,175 WARN [ServerSessionImpl] Client connection failed, clearing up resources for session 6a17c798-ec56-11df-bb46-000c29922be7
16:11:36,175 WARN [ServerSessionImpl] Cleared up resources for session 6a17c798-ec56-11df-bb46-000c29922be7
16:11:36,176 WARN [ServerSessionPacketHandler] Client connection failed, clearing up resources for session 6a17c798-ec56-11df-bb46-000c29922be7
16:11:36,176 WARN [ServerSessionPacketHandler] Cleared up resources for session 6a17c798-ec56-11df-bb46-000c29922be7
16:12:05,769 ERROR [ServerThread] WorkerThread#0[10.4.11.211:64297] exception occurred during first invocation

hornetq-jms.xml 1.9 KB

1. Re: Cluster messages not redistributed after node hard kill

clebert.suconic Nov 9, 2010 8:04 PM (in response to parmstrong)

We're doing some upgrades to cluster topology and client notification.
Meanwhile I believe you would need to re create your connection factories. However the messages should be redistributed fine.
Actions
2. Re: Cluster messages not redistributed after node hard kill

parmstrong Nov 10, 2010 1:58 PM (in response to clebert.suconic)

The surviving node needs to recreate the connection factory when the other node dies is that what you are saying? Maybe I am missing something. Is there more information that I need to give to help diagnose this problem. basically I can go to the jmx console on each of the machines and it looks like under the hornetq section there is a queue created on both of the nodes in the cluster that look to be the queues for the bridge between the nodes. I can see when everything is working fine that I send the message to one server and it processes it, then the next message seems to be put into the bridge queue and is processed by the other node. Then after I kill the on server the same thing happens only that every other message is added to the brigde queue but there is no other server on the bridge. Every other message just gets stacked up on the queue waiting for the other node to pick them up, but it is dead. Why is the live node still sticking messages into the bridge queue when the other machine is dead? Am I making sense? If not please let me know where I am lacking detail.
Actions
3. Re: Cluster messages not redistributed after node hard kill

clebert.suconic Nov 10, 2010 6:33 PM (in response to parmstrong)

Can you add details on how to replicate this at this JIRA please?: https://jira.jboss.org/browse/HORNETQ-568
Actions
4. Re: Cluster messages not redistributed after node hard kill

clebert.suconic Nov 10, 2010 6:34 PM (in response to clebert.suconic)

Actually: Can you try this with trunk first? I have seen a few issues fixed with the bridge.

I will close the JIRA depending on how it goes.
Actions
5. Re: Cluster messages not redistributed after node hard kill

parmstrong Nov 10, 2010 7:10 PM (in response to clebert.suconic)

I tried to build trunk and got an error:

BUILD FAILED
C:\hornetq\trunk\build.xml:211: The following error occurred while executing this line:
C:\hornetq\trunk\build-hornetq.xml:1191: The following error occurred while executing this line:
C:\hornetq\trunk\build-maven.xml:152: Execute failed: java.io.IOException: Cannot run program "mvn": CreateProcess error=2, The system cannot find the file specified

Also my hornetQ is integrated with jboss5 if I build trunk what all do I need to do to swap the trunk code of hornetQ into my jboss. I see that there are hornetQ jars in the all/lib dir and the all/deploy/jms-ra.rar directories. Do I just swap the jars in those two directories and that is it?
Actions
6. Re: Cluster messages not redistributed after node hard kill

clebert.suconic Nov 10, 2010 7:25 PM (in response to parmstrong)

This is something we will have to fix for windows... For now, Change build-maven.xml replacing every mvn by mvn.bat.
Actions
7. Re: Cluster messages not redistributed after node hard kill

clebert.suconic Nov 10, 2010 7:26 PM (in response to clebert.suconic)

You will also need maven at your path.
Actions
8. Re: Cluster messages not redistributed after node hard kill

parmstrong Nov 10, 2010 7:31 PM (in response to clebert.suconic)

I do have maven on my path but I still get the error. if I just type mvn and enter at the command line I get a maven message:

$ mvn
[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[ERROR] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO]

You must specify at least one goal or lifecycle phase to perform build steps.
The following list illustrates some commonly used build commands:

mvn clean
    Deletes any build output (e.g. class files or JARs).
mvn test
    Runs the unit tests for the project.
mvn install
    Copies the project artifacts into your local repository.
mvn deploy
    Copies the project artifacts into the remote repository.
mvn site
    Creates project documentation (e.g. reports or Javadoc).

Please see
http://maven.apache.org/guides/introduction/introduction-to-the-lifecycle.html
for a complete description of available lifecycle phases.

Use "mvn --help" to show general usage information about Maven's command line.

[INFO] ------------------------------------------------------------------------
[INFO] For more information, run Maven with the -e switch
[INFO] ------------------------------------------------------------------------
[INFO] Total time: < 1 second
[INFO] Finished at: Wed Nov 10 17:30:34 MST 2010
[INFO] Final Memory: 3M/122M
[INFO] ------------------------------------------------------------------------
Actions
9. Re: Cluster messages not redistributed after node hard kill

parmstrong Nov 10, 2010 7:31 PM (in response to clebert.suconic)

I fact many of the jars show up in the build jars dir but it fails after a whil with this error...
Actions
10. Re: Cluster messages not redistributed after node hard kill

clebert.suconic Nov 10, 2010 7:34 PM (in response to parmstrong)

The error you showed is just a regular maven message.

You just have to build with ./build.sh...

But you first need to edit build-maven.xml replacing mvn by mvn.bat
Actions
11. Re: Cluster messages not redistributed after node hard kill

parmstrong Nov 10, 2010 7:47 PM (in response to clebert.suconic)

Okay it looks like it build after replacing mvn with mvn.bat. Now what all do I need to do replace the old hornetQ code in jboss?
Actions
12. Re: Cluster messages not redistributed after node hard kill

clebert.suconic Nov 10, 2010 8:12 PM (in response to parmstrong)

do an svn update (as I just committed something as I found a minor XML typo that would fail for you)

execute ./build.sh distro

look at ./build... .you will see the same packages you would have on download. You can use the same procedure you used to install your server.

You can also change the JARs manually, but you will also need to replace the config files.
Actions
13. Re: Cluster messages not redistributed after node hard kill

parmstrong Nov 12, 2010 12:15 PM (in response to clebert.suconic)

Okay the build worked fine after changing mvn to mvn.bat. However, you said I can just do the same thing I did with the distro version of hornetQ to get it integrated with jboss. It looks like the bat file has changed locations. I found it under <hornetqdir>\src\config\jboss-as-5 instead of <hornetqdir>\config\jboss-as-5. then I tried running it and immediately get errors:

ANT_HOME is ../../tools/ant
Found javac
chmod: cannot access `../../tools/ant/bin/ant': No such file or directory
Using the following ant version from ../../tools/ant:
../../bin/build.sh: line 73: ../../tools/ant/bin/ant: No such file or directory
../../bin/build.sh: line 75: ../../tools/ant/bin/ant: No such file or directory

I have tried moving folders around and have gotten a little further but there seem to be a variety of things out of place for this to run properly. What do I do to get this thing to run?
Actions
14. Re: Cluster messages not redistributed after node hard kill

parmstrong Nov 12, 2010 12:26 PM (in response to clebert.suconic)

I also did something really simple to reproduce the problem. I downloaded a new version of Jboss5. Ran the config jboss 5 from HornetQ 2.1.1. I went to the deploy directory and renamed the hornetq folder to jms-ra.rar. I added a single queue to the hornetq-jms config:
     <queue name="testQueue">
            <entry name="/queue/testQueue"/>
    </queue>

I created a simple message driven bean and deployed it:

@MessageDriven(mappedName = "simpleBean", activationConfig =
{
    @ActivationConfigProperty(propertyName = "acknowledgeMode", propertyValue = "Auto-acknowledge"),
    @ActivationConfigProperty(propertyName = "destinationType", propertyValue = "javax.jms.Queue"),
    @ActivationConfigProperty(propertyName = "consumerWindowSize", propertyValue = "0"),
    @ActivationConfigProperty(propertyName = "destination", propertyValue = "queue/testQueue")
}, messageListenerInterface = MessageListener.class)
public class SimpleMessageBean implements MessageListener
{
    private static Logger logger = Logger.getLogger(SimpleMessageBean.class.getName());

    public void onMessage(Message inMessage) {
    ObjectMessage msg = null;
    logger.info("got message!");
}

}

I ran two jboss servers on different machines using the all config. They discovered each other fine. I wrote a very simple client to add messages to the queue:

public class MainFrame extends javax.swing.JFrame {

    /** Creates new form MainFrame */
    public MainFrame() {
        initComponents();
        this.setLocationRelativeTo(null);
        this.setTitle("Prometheus JMS Producer");
    }


    /** This method is called from within the constructor to
     * initialize the form.
     * WARNING: Do NOT modify this code. The content of this method is
     * always regenerated by the Form Editor.
     */

    // <editor-fold defaultstate="collapsed" desc="Generated Code">
    private void initComponents() {

        jLabel3 = new javax.swing.JLabel();
        jLabel1 = new javax.swing.JLabel();
        btnGo = new javax.swing.JButton();
        dpReconDate = new org.jdesktop.swingx.JXDatePicker();
        jLabel4 = new javax.swing.JLabel();
        cbServer = new javax.swing.JComboBox();
        jScrollPane1 = new javax.swing.JScrollPane();
        txtAccounts = new javax.swing.JTextArea();

        setDefaultCloseOperation(javax.swing.WindowConstants.EXIT_ON_CLOSE);

        jLabel3.setText("Recon Date");

        jLabel1.setText("Account IDs");

        btnGo.setFont(new java.awt.Font("Tahoma", 0, 10));
        btnGo.setText("Go");
        btnGo.addActionListener(new java.awt.event.ActionListener() {
            public void actionPerformed(java.awt.event.ActionEvent evt) {
                btnGoActionPerformed(evt);
            }
        });

        jLabel4.setText("JMS Server");

        cbServer.setEditable(true);
        cbServer.setModel(new javax.swing.DefaultComboBoxModel(new String[] { "st-downloader1", "localhost" }));
        cbServer.setSelectedIndex(1);

        txtAccounts.setColumns(20);
        txtAccounts.setFont(new java.awt.Font("Monospaced", 0, 12)); // NOI18N
        txtAccounts.setRows(5);
        jScrollPane1.setViewportView(txtAccounts);

        javax.swing.GroupLayout layout = new javax.swing.GroupLayout(getContentPane());
        getContentPane().setLayout(layout);
        layout.setHorizontalGroup(
            layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADING)
            .addGroup(layout.createSequentialGroup()
                .addContainerGap()
                .addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADING)
                    .addComponent(btnGo, javax.swing.GroupLayout.Alignment.TRAILING)
                    .addGroup(layout.createSequentialGroup()
                        .addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADING)
                            .addComponent(jLabel1)
                            .addComponent(jLabel3)
                            .addComponent(jLabel4))
                        .addPreferredGap(javax.swing.LayoutStyle.ComponentPlacement.RELATED)
                        .addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADING)
                            .addComponent(jScrollPane1, javax.swing.GroupLayout.DEFAULT_SIZE, 235, Short.MAX_VALUE)
                            .addComponent(cbServer, javax.swing.GroupLayout.Alignment.TRAILING, 0, 235, Short.MAX_VALUE)
                            .addComponent(dpReconDate, javax.swing.GroupLayout.Alignment.TRAILING, javax.swing.GroupLayout.DEFAULT_SIZE, 235, Short.MAX_VALUE))))
                .addContainerGap())
        );
        layout.setVerticalGroup(
            layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADING)
            .addGroup(layout.createSequentialGroup()
                .addContainerGap()
                .addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignment.TRAILING)
                    .addComponent(jLabel1, javax.swing.GroupLayout.Alignment.LEADING)
                    .addComponent(jScrollPane1, javax.swing.GroupLayout.DEFAULT_SIZE, 112, Short.MAX_VALUE))
                .addPreferredGap(javax.swing.LayoutStyle.ComponentPlacement.RELATED)
                .addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignment.BASELINE)
                    .addComponent(dpReconDate, javax.swing.GroupLayout.PREFERRED_SIZE, javax.swing.GroupLayout.DEFAULT_SIZE, javax.swing.GroupLayout.PREFERRED_SIZE)
                    .addComponent(jLabel3))
                .addPreferredGap(javax.swing.LayoutStyle.ComponentPlacement.RELATED)
                .addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignment.BASELINE)
                    .addComponent(jLabel4)
                    .addComponent(cbServer, javax.swing.GroupLayout.PREFERRED_SIZE, javax.swing.GroupLayout.DEFAULT_SIZE, javax.swing.GroupLayout.PREFERRED_SIZE))
                .addPreferredGap(javax.swing.LayoutStyle.ComponentPlacement.RELATED)
                .addComponent(btnGo)
                .addContainerGap())
        );

        pack();
    }// </editor-fold>

    private void btnGoActionPerformed(java.awt.event.ActionEvent evt)
    {
        try
        {
            String server = cbServer.getSelectedItem().toString();
            Properties jndiProps = new Properties();
            jndiProps.setProperty(javax.naming.Context.INITIAL_CONTEXT_FACTORY, "org.jnp.interfaces.NamingContextFactory");
            jndiProps.setProperty(javax.naming.Context.PROVIDER_URL, "jnp://" + server + ":1099");
            javax.naming.Context context = (javax.naming.Context) new InitialContext(jndiProps);
            Queue queue = (Queue) context.lookup( "queue/testQueue" );
            QueueConnectionFactory factory = (QueueConnectionFactory) context.lookup("ConnectionFactory");

            QueueConnection conn = factory.createQueueConnection();
            QueueSession session = conn.createQueueSession( false, Session.AUTO_ACKNOWLEDGE );
            QueueSender sender = session.createSender( queue );
            ObjectMessage message = session.createObjectMessage();

            String text = txtAccounts.getText();
            String[] accts = text.split("\n");
            try
            {
                for(String acct : accts)
                {
                    Integer acctId = Integer.parseInt(acct.trim());
                    message.setObject(acctId);
                    sender.send( message, DeliveryMode.PERSISTENT, 5, Message.DEFAULT_TIME_TO_LIVE);
                }
            }
            catch(NumberFormatException ex)
            {
                JOptionPane.showMessageDialog(this, "You need to specify integers");
            }

            sender.close();
            session.close();
            conn.close();

        }
        catch(NumberFormatException ex)
        {
            JOptionPane.showMessageDialog(this, "You need to specify an integer");
        }
        catch(Exception ex)
        {
            JOptionPane.showMessageDialog(this, "Error while trying to connect to queue: " + ex.getMessage());
            throw new RuntimeException(ex);
        }
}

    /**
    * @param args the command line arguments
    */
    public static void main(String args[]) throws Exception
    {
        // Set cross-platform Java L&F (also called "Metal")
        UIManager.setLookAndFeel(UIManager.getSystemLookAndFeelClassName());

        java.awt.EventQueue.invokeLater(new Runnable() {
            public void run() {
                new MainFrame().setVisible(true);
            }
        });
    }

    // Variables declaration - do not modify
    private javax.swing.JButton btnGo;
    private javax.swing.JComboBox cbServer;
    private org.jdesktop.swingx.JXDatePicker dpReconDate;
    private javax.swing.JLabel jLabel1;
    private javax.swing.JLabel jLabel3;
    private javax.swing.JLabel jLabel4;
    private javax.swing.JScrollPane jScrollPane1;
    private javax.swing.JTextArea txtAccounts;
    // End of variables declaration

}

I ran the simple client and sent several messages and watched the servers. They seemed to get the messages load balanced like I would expect. Then I kill -9'ed one of the servers. Then I sent some more messages from the client. Every other message would show up in the log of the surviving server and every other one would just dissappear. Looking at the jmx console I see that ever other one makes it to the queue that the mdb is looking at on the one server and every other one is added to another queue that I am assuming is the queue that is created for the bridge between the servers. I would think that since the other server is dead that it would start processing those messages itself or at least stop putting new messages into the bridge queue.
Actions

1 2 3 Previous Next

Go to original post