1 2 3 4 5 Previous Next 68 Replies Latest reply on Sep 13, 2010 9:09 AM by clebert.suconic Go to original post Branched to a new discussion.
      • 30. Re: Journaling errors
        clebert.suconic

        I was doing the exact same type of test yesterday.

         

        Let me do more tests over the next few days.. and I will get back to you when I have it all cleared.

         

        Meanwhile what changes you have to your last test?

        • 31. Re: Journaling errors
          ronnys

          Hi Clebert,

          Meanwhile what changes you have to your last test?

           

          the changes in the test tool should not make any difference for HornetQ. The load test has just been enhanced - besides some minor other changes - to support multiple receivers to test diverts in the future and to track lost/duplicate messages and sequence errors in a better way. Attached for reference. It requires one or more property files to define the connection settings now. Sample attached as well. Hope that helps.

          But I'm still accepting help if you have any other tests that will replicate it easier :-)

           

          I don't have any other means to replicate this issue, sorry It might help to run the load test multiple times in parallel on different queues, but this is just speculation.

           

          Best regards,
          Ronny

          • 32. Re: Journaling errors
            clebert.suconic

            Thanks Ronny,

             

             

            I've been able to work well with the tests now.. I just wanted to make sure if you found a different test scenario or not.

             

             

            Let me handle this for some time.. and I will get back to you when I think everything will be fixed.

             

             

             

             

            Thanks again

            • 33. Re: Journaling errors
              clebert.suconic

              Making this post just to report progress...

               

              I have fixed these errors now:

               

              pool-4-thread-1] 19:10:43,030 WARNING [org.hornetq.core.journal.impl.JournalCompactor]  Couldn't find addRecord information for record 64312022 during compacting
              [pool-4-thread-1] 19:10:43,030 WARNING [org.hornetq.core.journal.impl.JournalCompactor]  Couldn't find addRecord information for record 64312023 during compacting
              [pool-4-thread-1] 19:10:43,030 WARNING [org.hornetq.core.journal.impl.JournalCompactor]  Couldn't find addRecord information for record 64312024 during compacting
              [pool-4-thread-1] 19:10:43,031 WARNING [org.hornetq.core.journal.impl.JournalCompactor]  Couldn't find addRecord information for record 64312025 during compacting

               

              @Ronny, you don't need to make any tests yet. Let me make more tests and make sure everything will be good first.

              • 34. Re: Journaling errors
                clebert.suconic
                @Ronny, you don't need to make any tests yet. Let me make more tests and make sure everything will be good first.

                 

                 

                Actually.. it's better to refrain from doing any tests on trunk. I'm changing and testing Journal cleanup now after the changes I made.

                1 of 1 people found this helpful
                • 35. Re: Journaling errors
                  ronnys

                  Hi Clebert,

                   

                  ok, fine for me. Thanks a lot.

                   

                  Best regards,

                  Ronny

                  • 36. Re: Journaling errors
                    clebert.suconic

                    I started your test yesterday 13 PM yesterday, also wrote another test that would play with cleanup as well.. and still running.

                     

                    I'll make a few changes on cleanup tomorrow to make sure everything is rock solid. But I already have 30 hours of your test running. plus 30 hours of another test I wrote running in another computer. I guess we will be pretty good now.

                     

                    I didn't give up on testing it yet. As I said I want it bullet proof.

                    1 of 1 people found this helpful
                    • 37. Re: Journaling errors
                      clebert.suconic

                      I really think this will be fixed now. I have a bunch of tests running for over 4 days. I have finished the changes I wanted to do on clean up.

                       

                      @Ronny I have executed your tests as part of my work for a long time.. but if you want to do them yourself also as maybe there are other aspects you want to test.

                      • 38. Re: Journaling errors
                        ronnys

                        Hi Clebert,

                         

                        I really think this will be fixed now. I have a bunch of tests running for over 4 days. I have finished the changes I wanted to do on clean up.

                         

                        @Ronny I have executed your tests as part of my work for a long time.. but if you want to do them yourself also as maybe there are other aspects you want to test.

                         

                        Thanks a lot for your for all your help! Despite of your mail earlier last week, I built the latest version as of Friday afternoon my time (r9489; which appears to include all the needed fixes for the HornetQ server) in order to use the weekend for another load test. The test finished after 64h and 2.08 billion processed messages without any errors, lost or duplicate messages. I'm going to load test the diverts now, but I believe there will be no surprises.

                         

                        Excellent job!

                         

                        Thanks again, Best regards,

                        Ronny

                        • 39. Re: Journaling errors
                          ronnys

                          Hi Clebert,

                           

                          I'm going to load test the diverts now, but I believe there will be no surprises.

                           

                          Unfortunately still errors with diverts ... See here: http://community.jboss.org/message/555362#555362

                           

                          Best regards,

                          Ronny

                          • 40. Re: Journaling errors
                            azserve.luca

                            Hi,

                            in the last 10 days my hornetq froze with 1 subscriber (there are 16 subscribers), I wasn't able to analize queue with JConsole because when i clicked the queue Jconsole froze too. Then I restarted hornetq and it run correctly for same hours/days and then it froze again with another subscriber... This for 4 times. Today it froze again but when I restarted hornetq it there was the error:

                             

                            [main] 14:02:51,429 SEVERE [org.hornetq.integration.bootstrap.HornetQBootstrapServer]  Failed to start server
                            java.lang.IllegalStateException: Incompletely deployed:

                             

                            DEPLOYMENTS IN ERROR:
                              Deployment "JMSServerManager" is in error due to: java.lang.IllegalStateException: Cannot find message 199768

                             

                                at org.jboss.kernel.plugins.deployment.AbstractKernelDeployer.internalValidate(AbstractKernelDeployer.java:278)
                                at org.jboss.kernel.plugins.deployment.AbstractKernelDeployer.validate(AbstractKernelDeployer.java:174)
                                at org.hornetq.integration.bootstrap.HornetQBootstrapServer.bootstrap(HornetQBootstrapServer.java:158)
                                at org.jboss.kernel.plugins.bootstrap.AbstractBootstrap.run(AbstractBootstrap.java:83)
                                at org.hornetq.integration.bootstrap.HornetQBootstrapServer.run(HornetQBootstrapServer.java:116)
                                at org.hornetq.integration.bootstrap.HornetQBootstrapServer.main(HornetQBootstrapServer.java:73)
                            Exception in thread "main" java.lang.IllegalStateException: Incompletely deployed:

                             

                            DEPLOYMENTS IN ERROR:
                              Deployment "JMSServerManager" is in error due to: java.lang.IllegalStateException: Cannot find message 199768

                             

                                at org.jboss.kernel.plugins.deployment.AbstractKernelDeployer.internalValidate(AbstractKernelDeployer.java:278)
                                at org.jboss.kernel.plugins.deployment.AbstractKernelDeployer.validate(AbstractKernelDeployer.java:174)
                                at org.hornetq.integration.bootstrap.HornetQBootstrapServer.bootstrap(HornetQBootstrapServer.java:158)
                                at org.jboss.kernel.plugins.bootstrap.AbstractBootstrap.run(AbstractBootstrap.java:83)
                                at org.hornetq.integration.bootstrap.HornetQBootstrapServer.run(HornetQBootstrapServer.java:116)
                                at org.hornetq.integration.bootstrap.HornetQBootstrapServer.main(HornetQBootstrapServer.java:73)

                             

                            I realized a difference between last case and the other, for the first time the size of journal directory was smaller then the previous time. Maybe the error is during shrink or delete messages.

                            The size of journal was 1.3GB and now about 400MB, system will page after 1.6GB in fact there isn't the page directory.

                             

                            I hope this information can help you.

                             

                            Thanks

                            • 41. Re: Journaling errors
                              clebert.suconic

                              This is what was fixed per https://jira.jboss.org/browse/HORNETQ-440

                               

                               

                              Perhaps you should build a distribution from trunk.  Trunk @ 9498 seems safe now.

                               

                               

                              I will be pushing a release on the next few days.

                               

                               

                               

                              I will also add a flag to ignore errors on startup, case you already have a damaged journal system. But that shouldn't happen again now based in our extensive tests.

                              • 42. Re: Journaling errors
                                clebert.suconic
                                I will also add a flag to ignore errors on startup, case you already have a damaged journal system. But that shouldn't happen again now based in our extensive tests.

                                 

                                 

                                Actually, I won't add such flag.

                                 

                                 

                                If you need to maintain the journal, you can now use the Export / Import tool. export it as a txt file, you can edit the file if you need... and import will correct any missing records.

                                 

                                More information here ATM: http://community.jboss.org/thread/154985

                                • 43. Re: Journaling errors
                                  clebert.suconic

                                  Luca, I evaluated the data you sent me, and I can assure you the bug was fixed on trunk.

                                   

                                  What happened was compacting was feeding free files on the stream.. and it was eventually doing it out of order.

                                   

                                   

                                  If you want to fix your data, you will need use trunk, and export  your data as:

                                   

                                  java -cp hornetq-core.jar:netty.jar org.hornetq.core.journal.impl.ExportJournal ./data/journal hornetq-data hq 2 10485760 /tmp/export.dmp

                                   

                                   

                                  If you look for the ID that was missing during load, you will see that the update is arriving before the append.

                                   

                                  The append will be on the next file.

                                   

                                   

                                  What you need to do is to invert the position of these two files on the export dmp,

                                   

                                   

                                  and then you can remove the journal data, and recreate it with this command (after a backup of course):

                                   

                                   

                                  java -cp .hornetq-core.jar:netty.jar org.hornetq.core.journal.impl.ImportJournal ./data/journal hornetq-data hq 2 10485760 /tmp/export.dmp

                                   

                                   

                                  After you restart your system, the journal will be fixed and ready to be consumed.

                                  • 44. Re: Journaling errors
                                    ronnys

                                    Hi Clebert,

                                     

                                    need to revive this thread ...

                                     

                                    I did some more tests today with a slightly modified load test client (attached), that should survive server outages in order to see what happens if the HornetQ server gets restarted while the client is running. Setup etc. remained the same, HornetQ version was r9522.

                                     

                                    In addition I wrote a small script (attached) that - in a loop - starts the HornetQ server, waits until it is up (by waiting for the "HornetQ server ... started" log message), stops it after 15s, waits until it is down (i.e. until the process vanished) and restarts it after another 15s. Starting/Stopping is done using the standard run.sh/stop.sh scripts. This is of course not a real use case, but it helps to identify potential problems in case HornetQ needs to be restarted as our client needs to rely on HornetQ.

                                     

                                    I assumed that the system should survive these controlled and clean start/stop actions without issues. However, I detected, that the client received duplicate message, received message out of sync and even lost some messages. The server dumped several journaling errors as well. Too much to report everything in detail. I saw as well for example, that an exception received while committing consumed messages (in a transacted session) does not mean that the messages that were finally (from a client point of view) not committed are going to be re-received (this is something the client relies on); I had a case were these messages obviously had been committed and were not redelivered despite of the exception. Maybe the same applies to the producer side. [EDIT; just saw the latter is ok according to the JMS specs; will build a special test client to handle this]

                                     

                                    As you already have the right setup on your side, could you please run the updated loadtest client (or a client better suited to detect such errors) and use the attached script to constantly start/stop HornetQ? The load test parameters have not been changed.

                                     

                                    I'm planning to do some tests with kill SIGINT/SIGKILL as well; looks like HornetQ detects SIGINT and shuts down; don't know what happens to the journal etc. on SIGKILL. Is it built to automatically recover from hard shutdowns?

                                     

                                    Best regards,
                                    Ronny