10 Replies Latest reply on Mar 5, 2013 9:19 PM by clebert.suconic

    A paging issue

    gaohoward

      The issue could happen in the following case:

       

      Say we have a live and a backup in replication mode. A client sends 20 messages (m0 to m19) to an address in two batches, each sending 10 messages. And the paging setup is such that it causes messages to be paged in the following detail:

       

       

      1) first two (m0 and m1) are in the memory (not paged).

      2) next 5 messages (m2 to m6) are paged into file 1.page.

      3) next 3 messages (m7 to m9) are paged into file 2.page.

      4) next 5 messages (m10 to m14) are paged into file 3.page.

      5) next 5 messages (m15 to m19) are paged into file 4.page.

       

       

      At backup server those page files (1-4.page) are replicated to the backup's paging dir. Now if we let the following happen:

       

       

      1) create a client to receive and commit first 10 messages.

      2) crash live server immediately after.

      3) the client receives the next 10 messages after failover.

       

       

      After above step 1), 1.page and 2.page will be cleaned up. Accordingly the 1.page and 2.page at backup's paging store should be cleaned up too. However if step 2) comes too fast, the backup won't get a chance to receive the signal from live server to clean them up.

       

       

      Now if 1.page and 2.page are left at backup and the backup starts failover, it will load the paging files (all 1, 2, 3, 4.page in order) and re-calculate a valid page file to set up proper paging store. Due to a miscalculation, only 1.page is dropped as used page but 2.page (whose messages are all consumed before failover) are taking as a valid page and loaded.

       

       

      The result is that after failover the client will receive duplicated message (m7 to m9).

       

       

      I'll submit a fix for review soon.

        • 1. Re: A paging issue
          borges

          Seems like a bug that affects any servers (not just replicating backup).

           

          Great catch BTW.

          • 2. Re: A paging issue
            gaohoward

            That's correct. Thanks!

             

            Howard

            • 3. Re: A paging issue
              clebert.suconic

              Actually this is not a bug on the regular cases... The system is guaranteeing a delete of the page file before deleting any page completion records.

               

               

              For some reason the delete page complete record is arriving after the page-delete on the target replication... I'm doing a lot of debug on that.. I will need another day to figure out what's going on here.

              • 4. Re: A paging issue
                clebert.suconic

                Howard: did you find this because of a user's forum, or you were just debugging the testsuite?

                • 5. Re: A paging issue
                  gaohoward

                  Hi Clebert,

                   

                  I found this issue when I was trying to debug a test often failed on my laptop, also occassionally seen on jenkins run. The test is

                   

                  BackupSyncPagingTest.testReplicationDuringSync()

                   

                  When it fails it gives some error info like:

                   

                   

                  Error Message

                   

                  expected=0. Got: property['counter']=97 sendNumber=97 expected:<0> but was:<97>

                   

                  Stacktrace

                   

                  junit.framework.AssertionFailedError: expected=0. Got: property['counter']=97 sendNumber=97 expected:<0> but was:<97>

                      at junit.framework.Assert.fail(Assert.java:47)

                      at junit.framework.Assert.failNotEquals(Assert.java:280)

                      at junit.framework.Assert.assertEquals(Assert.java:64)

                      at junit.framework.Assert.assertEquals(Assert.java:198)

                      at org.hornetq.tests.util.ServiceTestBase.receiveMessages(ServiceTestBase.java:794)

                      at org.hornetq.tests.integration.cluster.failover.BackupSyncJournalTest.receiveMsgsInRange(BackupSyncJournalTest.java:308)

                      at org.hornetq.tests.integration.cluster.failover.BackupSyncJournalTest.testReplicationDuringSync(BackupSyncJournalTest.java:154)

                      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

                      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

                      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

                      at java.lang.reflect.Method.invoke(Method.java:601)

                      at junit.framework.TestCase.runTest(TestCase.java:164)

                      at junit.framework.TestCase.runBare(TestCase.java:130)

                      at junit.framework.TestResult$1.protect(TestResult.java:106)

                      at junit.framework.TestResult.runProtected(TestResult.java:124)

                      at junit.framework.TestResult.run(TestResult.java:109)

                      at junit.framework.TestCase.run(TestCase.java:120)

                      at junit.framework.TestSuite.runTest(TestSuite.java:230)

                      at junit.framework.TestSuite.run(TestSuite.java:225)

                      at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)

                      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

                      at java.lang.reflect.Method.invoke(Method.java:601)

                      at org.apache.maven.surefire.junit.JUnitTestSet.execute(JUnitTestSet.java:95)

                      at org.apache.maven.surefire.junit.JUnit3Provider.executeTestSet(JUnit3Provider.java:132)

                      at org.apache.maven.surefire.junit.JUnit3Provider.invoke(JUnit3Provider.java:109)

                      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

                      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

                      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

                      at java.lang.reflect.Method.invoke(Method.java:601)

                      at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray2(ReflectionUtils.java:208)

                      at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:158)

                      at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:86)

                      at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:153)

                      at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:95)

                   

                   

                  After several rounds of wild guess and a incorrect fix (the one you denied about two weeks ago) I finally found this.

                   

                  Howard

                  • 6. Re: A paging issue
                    clebert.suconic

                    Thanks Howard...

                     

                    You did a great job here... We wouldn't been this far without your help... It's just that I need to go to the root cause.. the page was not supposed to be deleted before the complete record, and that's happening during replication for some reason, which I'm investigating now.

                     

                    I was just wondering if that was also found by an user... and I wanted to follow the discussion if that was the case.

                    • 7. Re: A paging issue
                      clebert.suconic

                      The issue is on the BackupDelayed. If the delayed is opened while the cleanup is happening on the live, the backup will have the records deleted and not the original page.

                       

                       

                      it could be an issue on the interceptor... but so far I think the sync process is not working as expected. I don't see any calls during the replication start to the disable reclaiming, and I don't see any safeguard about deleting a page...

                       

                       

                      Say what if the pageDelete comes from live while the page is not synced yet. that clearly seems to be broken.

                       

                       

                      We should write a few unit tests (I'm talking about real unit tests here, like we have on the journal) simulating these situations. The backup sync should happen like the Compactor is happening. i.e. stuff that happened during the sync needs to be effective after the sync is done. The way I see the code now that won't happen.

                       

                       

                      We should have some talk tomorrow about this.. but if you Francsico could take a look already...

                      • 8. Re: A paging issue
                        gaohoward

                        Say what if the pageDelete comes from live while the page is not synced yet. that clearly seems to be broken.

                        I didn't find that in my tests. The issue seems always happen when all 'live' page files has no consumed messages, in other words, there isn't a page file that has mixed records. Either a page is suitable for cleanup or a page is full of un-delivered (acked) messages in paging store when live server crashed and backup takes over.

                         

                        Say 1.page and 2.page are to be deleted by a scheduler on live and backup, if live crashes before backup clean up them, the backup on activation should load all page files and drop those page files (1.page and 2.page).

                        However what happens is that 1.page and 2.page are loaded, only 1.page are dropped. 2.page somehow is deemed as the current page for paging.

                        • 9. Re: A paging issue
                          clebert.suconic

                          Cleanup is happening as it is supposed... the issue that due to the delayed backup no working properly the storage records are being removed while the page file surviving, and a result you get records being created.

                           

                           

                          There's nothing wrong with the cleanup per se.. The issue is with sync.

                          • 10. Re: A paging issue
                            clebert.suconic

                            I'm actually debugging your tests for this. I'm just getting down a level.

                             

                             

                            Say 1.page and 2.page are to be deleted by a scheduler on live and backup, if live crashes before backup clean up them, the backup on activation should load all page files and drop those page files (1.page and 2.page).

                            However what happens is that 1.page and 2.page are loaded, only 1.page are dropped. 2.page somehow is deemed as the current page for paging.

                             

                             

                            If you look closely, those pages had a complete record on the journal. The record is disapperaing during the cleanup, and the page is surviving. all because of the delayed sync.

                             

                             

                            I had changed the test locally to execute the sync earlier and it was working fine.