-
1. Re: A paging issue
borges Feb 25, 2013 9:51 AM (in response to gaohoward)Seems like a bug that affects any servers (not just replicating backup).
Great catch BTW.
-
2. Re: A paging issue
gaohoward Feb 25, 2013 9:02 PM (in response to borges)That's correct. Thanks!
Howard
-
3. Re: A paging issue
clebert.suconic Mar 4, 2013 11:23 PM (in response to borges)Actually this is not a bug on the regular cases... The system is guaranteeing a delete of the page file before deleting any page completion records.
For some reason the delete page complete record is arriving after the page-delete on the target replication... I'm doing a lot of debug on that.. I will need another day to figure out what's going on here.
-
4. Re: A paging issue
clebert.suconic Mar 4, 2013 11:31 PM (in response to clebert.suconic)Howard: did you find this because of a user's forum, or you were just debugging the testsuite?
-
5. Re: A paging issue
gaohoward Mar 5, 2013 2:33 AM (in response to clebert.suconic)Hi Clebert,
I found this issue when I was trying to debug a test often failed on my laptop, also occassionally seen on jenkins run. The test is
BackupSyncPagingTest.testReplicationDuringSync()
When it fails it gives some error info like:
Error Message
expected=0. Got: property['counter']=97 sendNumber=97 expected:<0> but was:<97>
Stacktrace
junit.framework.AssertionFailedError: expected=0. Got: property['counter']=97 sendNumber=97 expected:<0> but was:<97>
at junit.framework.Assert.fail(Assert.java:47)
at junit.framework.Assert.failNotEquals(Assert.java:280)
at junit.framework.Assert.assertEquals(Assert.java:64)
at junit.framework.Assert.assertEquals(Assert.java:198)
at org.hornetq.tests.util.ServiceTestBase.receiveMessages(ServiceTestBase.java:794)
at org.hornetq.tests.integration.cluster.failover.BackupSyncJournalTest.receiveMsgsInRange(BackupSyncJournalTest.java:308)
at org.hornetq.tests.integration.cluster.failover.BackupSyncJournalTest.testReplicationDuringSync(BackupSyncJournalTest.java:154)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at junit.framework.TestCase.runTest(TestCase.java:164)
at junit.framework.TestCase.runBare(TestCase.java:130)
at junit.framework.TestResult$1.protect(TestResult.java:106)
at junit.framework.TestResult.runProtected(TestResult.java:124)
at junit.framework.TestResult.run(TestResult.java:109)
at junit.framework.TestCase.run(TestCase.java:120)
at junit.framework.TestSuite.runTest(TestSuite.java:230)
at junit.framework.TestSuite.run(TestSuite.java:225)
at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.maven.surefire.junit.JUnitTestSet.execute(JUnitTestSet.java:95)
at org.apache.maven.surefire.junit.JUnit3Provider.executeTestSet(JUnit3Provider.java:132)
at org.apache.maven.surefire.junit.JUnit3Provider.invoke(JUnit3Provider.java:109)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray2(ReflectionUtils.java:208)
at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:158)
at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:86)
at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:153)
at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:95)
After several rounds of wild guess and a incorrect fix (the one you denied about two weeks ago) I finally found this.
Howard
-
6. Re: A paging issue
clebert.suconic Mar 5, 2013 10:38 AM (in response to gaohoward)Thanks Howard...
You did a great job here... We wouldn't been this far without your help... It's just that I need to go to the root cause.. the page was not supposed to be deleted before the complete record, and that's happening during replication for some reason, which I'm investigating now.
I was just wondering if that was also found by an user... and I wanted to follow the discussion if that was the case.
-
7. Re: A paging issue
clebert.suconic Mar 5, 2013 8:45 PM (in response to clebert.suconic)The issue is on the BackupDelayed. If the delayed is opened while the cleanup is happening on the live, the backup will have the records deleted and not the original page.
it could be an issue on the interceptor... but so far I think the sync process is not working as expected. I don't see any calls during the replication start to the disable reclaiming, and I don't see any safeguard about deleting a page...
Say what if the pageDelete comes from live while the page is not synced yet. that clearly seems to be broken.
We should write a few unit tests (I'm talking about real unit tests here, like we have on the journal) simulating these situations. The backup sync should happen like the Compactor is happening. i.e. stuff that happened during the sync needs to be effective after the sync is done. The way I see the code now that won't happen.
We should have some talk tomorrow about this.. but if you Francsico could take a look already...
-
8. Re: A paging issue
gaohoward Mar 5, 2013 9:06 PM (in response to clebert.suconic)Say what if the pageDelete comes from live while the page is not synced yet. that clearly seems to be broken.
I didn't find that in my tests. The issue seems always happen when all 'live' page files has no consumed messages, in other words, there isn't a page file that has mixed records. Either a page is suitable for cleanup or a page is full of un-delivered (acked) messages in paging store when live server crashed and backup takes over.
Say 1.page and 2.page are to be deleted by a scheduler on live and backup, if live crashes before backup clean up them, the backup on activation should load all page files and drop those page files (1.page and 2.page).
However what happens is that 1.page and 2.page are loaded, only 1.page are dropped. 2.page somehow is deemed as the current page for paging.
-
9. Re: A paging issue
clebert.suconic Mar 5, 2013 9:16 PM (in response to gaohoward)Cleanup is happening as it is supposed... the issue that due to the delayed backup no working properly the storage records are being removed while the page file surviving, and a result you get records being created.
There's nothing wrong with the cleanup per se.. The issue is with sync.
-
10. Re: A paging issue
clebert.suconic Mar 5, 2013 9:19 PM (in response to gaohoward)I'm actually debugging your tests for this. I'm just getting down a level.
Say 1.page and 2.page are to be deleted by a scheduler on live and backup, if live crashes before backup clean up them, the backup on activation should load all page files and drop those page files (1.page and 2.page).
However what happens is that 1.page and 2.page are loaded, only 1.page are dropped. 2.page somehow is deemed as the current page for paging.
If you look closely, those pages had a complete record on the journal. The record is disapperaing during the cleanup, and the page is surviving. all because of the delayed sync.
I had changed the test locally to execute the sync earlier and it was working fine.