-
15. Re: Journaling errors
clebert.suconic Jul 19, 2010 3:44 PM (in response to clebert.suconic)>> Are you starting the test with a clean directory?
I could answer that question myself as your test is validating sequences.. so you can only be doing it with a cleaned journal / data directory.
I still have the other two questions:
- How Long
- Are you sure you are using the updated version
-
16. Re: Journaling errors
ronnys Jul 20, 2010 2:49 AM (in response to clebert.suconic)Hi Clebert,
>> Are you starting the test with a clean directory?
I could answer that question myself as your test is validating sequences.. so you can only be doing it with a cleaned journal / data directory.
Yes.
- How Long
How long did it take to fail? It stopped processing messages completely after 22h, but the first errors (the ones I posted Friday afternoon CET) occured with 1 or 2 hours. Or have you meant something else?
- Are you sure you are using the updated version
Yes. I explicitely built it from your SVN trunk and the errors are different now.
Best regards,
Ronny
-
17. Re: Journaling errors
clebert.suconic Jul 20, 2010 3:02 AM (in response to ronnys)I can't replicate the ones you posted Friday your afternoon.
As a matter of fact... I just woke up and came to look at the test... everything is fine after 10 Hours run. and that's about the third time I put the test to run.
If you could please double check the version you posted (configs.. etc) against your latest post
-
18. Re: Journaling errors
ronnys Jul 20, 2010 4:20 AM (in response to clebert.suconic)I can't replicate the ones you posted Friday your afternoon.
As a matter of fact... I just woke up and came to look at the test... everything is fine after 10 Hours run. and that's about the third time I put the test to run.
If you could please double check the version you posted (configs.. etc) against your latest post
The posted data looks fine. I just checked out the latest trunk revision (9426) and will run the test again over night my time.
-
19. Re: Journaling errors
clebert.suconic Jul 21, 2010 1:58 AM (in response to ronnys)One important information for me: What OS version are you using?
Can you also describe what kind of hardware you're testing on please? Maybe we have a failure that won't happen on a XYZ combination.
-
20. Re: Journaling errors
ronnys Jul 21, 2010 2:57 AM (in response to clebert.suconic)Tested the latest trunk revision tonight. The test is still running, but a lot of error messages have been created. Log file attached.
One important information for me: What OS version are you using?
SUSE Linux Enterprise Server 10 SP2 (x86_64); Kernel 2.6.16.60-0.21-smp #1 SMP Tue May 6 12:41:02 UTC 2008
Can you also describe what kind of hardware you're testing on please? Maybe we have a failure that won't happen on a XYZ combination.
Dell PowerEdge R200 w/ 1 x Intel Xeon X3360 (Quad core), 8 Gb memory
-
hornetq.log.0.gz 84.8 KB
-
-
21. Re: Journaling errors
clebert.suconic Jul 21, 2010 3:11 AM (in response to ronnys)Thanks Ronny,
I did a lot of tests on trying to replicate it.. and I didn't have much success. I may try a different environment with more hardware power... allthough the one I have now is pretty much similar to yours, only difference is I'm using Fedora. But the kernel version and CPU power is similar to yours.. (that's why I asked what env you had).
I also added a stress test that will play all the possible combinations with adds, updates.. tx.. etc and starting compacting, to make sure there isn't any tricky situation I didn't handle. It's passing
I also, changed an internal writing operation on compacting, to avoid any possible races on the OS with writing. I don't have any indication the problem is here... but the change won't hurt anyway.
I will leave the test running here again.. and I will see how it behaves overnight. If you could do the same... (please make sure you do a ./build.sh clean after the update... just to be safe)
Thanks for the help
Clebert
-
22. Re: Journaling errors
ronnys Jul 21, 2010 5:22 AM (in response to clebert.suconic)Hi Clebert,
I did a lot of tests on trying to replicate it.. and I didn't have much success. I may try a different environment with more hardware power... allthough the one I have now is pretty much similar to yours, only difference is I'm using Fedora. But the kernel version and CPU power is similar to yours.. (that's why I asked what env you had).
I also added a stress test that will play all the possible combinations with adds, updates.. tx.. etc and starting compacting, to make sure there isn't any tricky situation I didn't handle. It's passing
I also, changed an internal writing operation on compacting, to avoid any possible races on the OS with writing. I don't have any indication the problem is here... but the change won't hurt anyway.
I will leave the test running here again.. and I will see how it behaves overnight. If you could do the same... (please make sure you do a ./build.sh clean after the update... just to be safe)
Ok, will keep the test running. Btw, the likelihood of this issue to appear might be related to the disk I/O subsystem performance. We ran the test on an SSD with less (but still some) errors. It might help if you can somehow limit the disk I/O bandwidth for the HornetQ process if that's possible.
Best regards,
Ronny
-
23. Re: Journaling errors
clebert.suconic Jul 21, 2010 7:16 PM (in response to ronnys)a good news is I could replicate the error.. and I found a bug on readJournal that is used on compactor. That could lead to errors on compacting.
I just committed this fix and I am putting your test to run again. Lets see.
-
24. Re: Journaling errors
clebert.suconic Jul 22, 2010 1:51 AM (in response to clebert.suconic)I'm 5 hours on this test.. so far so good.
I am leaving the test running overnight on three different computers.
- My Laptop (limited disk)
- I have a workstation for tests at home.. 4G, SCSI disk... 8 cores .. so a medium box.
- one instance of my performance lab (a powerful box with RAID disks... and a much faster CPU)
Lets see how it goes overnight.
-
25. Re: Journaling errors
ronnys Jul 22, 2010 5:00 AM (in response to clebert.suconic)Thanks a lot!
The test with revision 9426 looks good as well: There were some errors logged this night, but no new kinds of errors (I'm attaching the log file anyway). The test is still running after ~ 1.35 billion messages, no messages reported as lost or duplicate yet. I'm going to stop it now for some other tests and will start it again this afternoon with an updated load test tool and the latest HornetQ revision.
-
hornetq.log.0.gz 143.4 KB
-
-
26. Re: Journaling errors
clebert.suconic Jul 22, 2010 5:43 AM (in response to ronnys)Those are definitely gone:
java.nio.BufferUnderflowException
at java.nio.Buffer.nextGetIndex(Buffer.java:474)
at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:205)
at org.hornetq.core.journal.impl.JournalImpl.readJournalFile(JournalImpl.java:501)
at org.hornetq.core.journal.impl.JournalImpl.compact(JournalImpl.java:1536)
at org.hornetq.core.journal.impl.JournalImpl$5.run(JournalImpl.java:2334)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
[Thread-3 (group:HornetQ-server-threads1623247029-1609377764)] 19:35:00,768 WARNING [org.hornetq.core.postoffice.impl.DuplicateIDCacheImpl] Error on deleting duplicate cache
java.lang.IllegalStateException: Cannot find add info 68065214
at org.hornetq.core.journal.impl.JournalImpl.appendDeleteRecord(JournalImpl.java:918)
at org.hornetq.core.persistence.impl.journal.JournalStorageManager.deleteDuplicateID(JournalStorageManager.java:542)
at org.hornetq.core.postoffice.impl.DuplicateIDCacheImpl.addToCacheInMemory(DuplicateIDCacheImpl.java:177)
at org.hornetq.core.postoffice.impl.DuplicateIDCacheImpl.access$000(DuplicateIDCacheImpl.java:42)
at org.hornetq.core.postoffice.impl.DuplicateIDCacheImpl$AddDuplicateIDOperation.process(DuplicateIDCacheImpl.java:218)
at org.hornetq.core.postoffice.impl.DuplicateIDCacheImpl$AddDuplicateIDOperation.afterCommit(DuplicateIDCacheImpl.java:238)
at org.hornetq.core.transaction.impl.TransactionImpl.afterCommit(TransactionImpl.java:436)
at org.hornetq.core.transaction.impl.TransactionImpl$2.done(TransactionImpl.java:275)
at org.hornetq.core.persistence.impl.journal.OperationContextImpl$1.run(OperationContextImpl.java:207)
at org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor$1.run(OrderedExecutorFactory.java:96)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)I couldn't yet replicate those ones, but they might be related:
[pool-4-thread-1] 19:10:43,030 WARNING [org.hornetq.core.journal.impl.JournalCompactor] Couldn't find addRecord information for record 64312022 during compacting
[pool-4-thread-1] 19:10:43,030 WARNING [org.hornetq.core.journal.impl.JournalCompactor] Couldn't find addRecord information for record 64312023 during compacting
[pool-4-thread-1] 19:10:43,030 WARNING [org.hornetq.core.journal.impl.JournalCompactor] Couldn't find addRecord information for record 64312024 during compacting
[pool-4-thread-1] 19:10:43,031 WARNING [org.hornetq.core.journal.impl.JournalCompactor] Couldn't find addRecord information for record 64312025 during compactingAlso: you will need to update the client libraries on your next tests also.
-
27. Re: Journaling errors
clebert.suconic Jul 22, 2010 11:43 PM (in response to clebert.suconic)I'm still expecting those at your tests:
[pool-4-thread-1] 19:10:43,030 WARNING [org.hornetq.core.journal.impl.JournalCompactor] Couldn't find addRecord information for record 64312022 during compacting
[pool-4-thread-1] 19:10:43,030 WARNING [org.hornetq.core.journal.impl.JournalCompactor] Couldn't find addRecord information for record 64312023 during compacting
[pool-4-thread-1] 19:10:43,030 WARNING [org.hornetq.core.journal.impl.JournalCompactor] Couldn't find addRecord information for record 64312024 during compactingAs I could replicate it. However it took me 10 hours last time.
I'm debugging it now.. have added some logging to my copy and lets see if I can get any info. I'm sure I will fix it.. but it could take me a another day or two as this takes time to be replicated. (I feel like a scientist putting his mice to run and take results.. it just takes time).
But I'm still accepting help if you have any other tests that will replicate it easier :-)
-
28. Re: Journaling errors
clebert.suconic Jul 22, 2010 11:43 PM (in response to clebert.suconic)BTW: those *seem* harmless so far. But I want to understand it and fix them.
-
29. Re: Journaling errors
ronnys Jul 23, 2010 8:48 AM (in response to clebert.suconic)Hi Clebert,
just ran my load test against r9453 (I used r9457 before, but this one introduces some
other issues; the client was not always able to connect to the server thru the looked
up connection factory and messages were disappearing). Unfortunately, r9453 has some
issues, too; especially a memory leak. Log file attached.r9453 fails for me with an OOM error after ~30min; I ran the test against r9426 without
OOM error to verify that the updated test isn't causing the issue. A leak report is
attached; I can send you the heap dump as well if you've a FTP/SFTP location I can
drop it to.Best regards & have a nice weekend,
Ronny-
hornetq.log.0.zip 1.7 KB
-
java_pid23636_Leak_Suspects.zip 73.1 KB
-