Journaling errors| JBoss.org Content Archive (Read Only)

45. Re: Journaling errors

clebert.suconic Aug 16, 2010 10:33 AM (in response to ronnys)

I fixed an error that would cause duplicates during compacting (When doing the investigations about this thread), but that was fixed after r9522.

I will run your tests also. Give me an hour or so and I will let you know.

"I'm planning to do some tests with kill SIGINT/SIGKILL as well; looks like HornetQ detects SIGINT and shuts down; don't know what happens to the journal etc. on SIGKILL. Is it built to automatically recover from hard shutdowns?"

Yes.. it's built to recover from hard shutdowns.

46. Re: Journaling errors

clebert.suconic Aug 16, 2010 7:38 PM (in response to ronnys)

Your test is currently invalid.

You would need to play with XA and things like that to make it valid.

When you call session.commit();

You could have a failure at the time where the server was supposed to answer the commit.

You would need to use XA and play with recover to make sure the commit was accepted by the server. (The same scenario woud apply to kill)

BTW: I'm not suggestion you to use XAResource directly. That would be job for a TransactionManager. But I will see if i can change your test.

The bug I fixed wouldn't show up at your test. (It would be a rare occasion where the ACK was rolled back, compacting cleared this record and you restarted the server right after compacting was finished).

47. Re: Journaling errors

clebert.suconic Aug 17, 2010 1:18 AM (in response to clebert.suconic)

That would be an interesting test to do, but if you want to validate message receiving over shutdowns / kills, the test has even more basic stuff broken.

For example, when you reconnect, you're recreating the consumer, hence the messages will be received again, and you're not clearing the batch messages on consumer.

I have tried to change a couple of things on the test.. but I couldn't do finish today.

I will attach it here to give you an idea of what I wanted to change.

LoadTest.java.zip 7.8 KB

48. Re: Journaling errors

ronnys Aug 17, 2010 10:21 AM (in response to clebert.suconic)

Hi Clebert,

thanks a lot for finding this out and for the changes.

I wasn't aware that I'd need to use XA transactions for this. This is somehow out of scope for the planned use case. I'll create another test program that will handle errors and duplicates on the client side; this might be sufficient.

Best regards,

Ronny

49. Re: Journaling errors

ronnys Aug 18, 2010 8:41 AM (in response to ronnys)

Hi Clebert,

just ran some tests with the new client. It is as said handling errors while sending/receiving/committing batches by its own by simply retrying and resending the failed batch on the sender side and handling duplicates on the receiving side. Nothing special. Hope you agree this is ok. Received duplicates are ok (and expected as generated by the client) and could be filtered, lost messages are not ok. Code is attached, tool can be started like this (with the same JNDI properties files the older tool is using):

MultiThreadedLoadTest sender.properties receiver.properties stopfile 1750000 2250000 10 2048 5000 true

When running it against HornetQ 2.1.2.final started/stopped constantly with the previously posted script (with just a minor modification of keeping the server up for 60s instead of 15s to give it more time to start up in case it is doing something else in the background) I observed the following:

The first few restarts usually happen without any issues. After a few restarts I get different journaling log entries in the HornetQ logs (INFO to SEVERE; some NPEs): attached. Don't know if they're expected despite of the clean shutdown. After a while (1-2 hours) and a bunch of restarts later the client starts to report missing messages ... It looks like the issue is building up over time; could it be that the server takes longer and longer to perform some journaling cleanups after it signalled its finished startup and gets finally interrupted by the stop request at an inconvenient time? I saw even cases, where the client did not receive any messages while the server as up and running; the next server restart fixed this.

Hope this helps,

Best regards,

Ronny

testclient.zip 8.1 KB
hornetq_logs.tar.gz 481.8 KB

50. Re: Journaling errors

clebert.suconic Aug 18, 2010 7:03 PM (in response to ronnys)

this seems a different issue than what we have been working before.

I couldn't replicate the issue with your test, as my disk was set to sync, what limited rate from the test...

I will work on it next week.

51. Re: Journaling errors

ronnys Aug 19, 2010 2:46 AM (in response to clebert.suconic)

Hi Clebert,

ok, thanks a lot. Btw, I see the same NPEs reported yesterday

[JournalImpl::CompactorExecutor] 15:05:45,964 WARNING [org.hornetq.core.journal.impl.JournalCompactor] Error replaying pending commands after compacting
java.lang.NullPointerException
    at org.hornetq.core.journal.impl.JournalCompactor$DeleteCompactCommand.execute(JournalCompactor.java:476)
    at org.hornetq.core.journal.impl.JournalCompactor.replayPendingCommands(JournalCompactor.java:234)
    at org.hornetq.core.journal.impl.JournalImpl.compact(JournalImpl.java:1630)
    at org.hornetq.core.journal.impl.JournalImpl$6.run(JournalImpl.java:2396)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:619)

when running the multithreaded client against the divert. In order to run the test you'd need to run

MultiThreadedLoadTest sender.properties receiver1.properties,receiver2.properties stopfile 1750000 2250000 10 2048 5000 true

where the sender and receiver properties files are the ones attached here:

http://community.jboss.org/message/555362#555362

However, as far as I can see, there is no immediate impact on the client. It does not report any lost or duplicate messages.

Best regards,

Ronny

52. Re: Journaling errors

timfox Aug 19, 2010 4:48 AM (in response to clebert.suconic)

Clebert Suconic wrote:

I couldn't replicate the issue with your test, as my disk was set to sync,

If the user is running with disk write cache enabled, then testing for data loss is pointless.

Data may be lost in event of crash, even if HornetQ is doing everything properly! This is discussed in the user manual, persistence chapter.

53. Re: Journaling errors

timfox Aug 19, 2010 5:26 AM (in response to timfox)

I also noticed the user's test program is not doing any duplicate detection on either send or consume transactions.

If the server fails after a commit is sent from client but before it has finished processing on the server, the tx won't be committed and the client will have no way of determining whether the tx is committed or not. This is a standard (non HornetQ specific) issue with any RPC protocol.

HornetQ implements automatic duplicate detection for sends (see user manual). For consumes you will have to code it yourself.

54. Re: Journaling errors

ronnys Aug 19, 2010 5:37 AM (in response to timfox)

Tim Fox wrote:

If the user is running with disk write cache enabled, then testing for data loss is pointless.

Data may be lost in event of crash, even if HornetQ is doing everything properly! This is discussed in the user manual, persistence chapter.

I'm starting/stopping HornetQ using its run.sh/stop.sh scripts. HornetQ is neither being killed nor do I restart the machine, it is (hopefully) a clean shutdown from HornetQ's point of view. No data should be lost under these circumstances.

Tim Fox wrote:

I also noticed the user's test program is not doing any duplicate detection on either send or consume transactions.

If the server fails after a commit is sent from client but before it has finished processing on the server, the tx won't be committed and the client will have no way of determining whether the tx is committed or not. This is a standard (non HornetQ specific) issue with any RPC protocol.

HornetQ implements automatic duplicate detection for sends (see user manual). For consumes you will have to code it yourself.

The second client (attached to this posting: http://community.jboss.org/message/557900#557900) performs custom duplication checking to handle the failure-on-commit scenario you mentioned.

55. Re: Journaling errors

timfox Aug 19, 2010 5:59 AM (in response to ronnys)

Ronny Schuetz wrote:

I'm starting/stopping HornetQ using its run.sh/stop.sh scripts. HornetQ is neither being killed nor do I restart the machine, it is (hopefully) a clean shutdown from HornetQ's point of view. No data should be lost under these circumstances.

That's right. I assumed you were killing the server.

56. Re: Journaling errors

ronnys Aug 19, 2010 6:16 AM (in response to ronnys)

Hi again,

Ronny Schuetz wrote:

when running the multithreaded client against the divert. In order to run the test you'd need to run

MultiThreadedLoadTest sender.properties receiver1.properties,receiver2.properties stopfile 1750000 2250000 10 2048 5000 true

where the sender and receiver properties files are the ones attached here:

http://community.jboss.org/message/555362#555362

However, as far as I can see, there is no immediate impact on the client. It does not report any lost or duplicate messages.

The same test dumps warnings like that (no stacktrace) if HornetQ is being configured to use 100k journaling files (I actually wanted to test that ...)

[JournalImpl::CompactorExecutor] 11:44:44,041 WARNING [org.hornetq.core.journal.impl.JournalImpl] Couldn't find tx=9317246 to merge after compacting

Best regards,

Ronny

Message was edited by: Ronny Schuetz; note about the 100k journal files added ... forgot that ...

57. Re: Journaling errors

clebert.suconic Aug 19, 2010 11:51 AM (in response to ronnys)

I have debugged this warning and it is harmless. It only happened with the small file size, and it happened because a file was cleared up, but the transaction that's been complained would have been already cleared.

but I will eliminate the condition that let this message happen on the next version.

58. Re: Journaling errors

ronnys Aug 20, 2010 2:59 AM (in response to clebert.suconic)

Hi Clebert,

great! One more I got with the same test (i.e. without server restarts; just 100k journaling files and the multithreaded loadtest testing diverts) running overnight:

[Old I/O server worker (parentId: 1299509292, channelId: 653327464, null => ttwsrvdev02.dev.ttw/10.115.151.40:61445)] 22:42:28,177 SEVERE [org.hornetq.core.protocol.core.ServerSessionPacketHandler] Caught unexpected exception
java.lang.IllegalStateException: 1187155943 Could not find reference on consumerID=0, messageId = 42297410 queue = jms.queue.msgduplicate_divert_outbound_2 closed = true
        at org.hornetq.core.server.impl.ServerConsumerImpl.acknowledge(ServerConsumerImpl.java:512)
        at org.hornetq.core.server.impl.ServerSessionImpl.acknowledge(ServerSessionImpl.java:513)
        at org.hornetq.core.protocol.core.ServerSessionPacketHandler.handlePacket(ServerSessionPacketHandler.java:290)
        at org.hornetq.core.protocol.core.impl.ChannelImpl.handlePacket(ChannelImpl.java:471)
        at org.hornetq.core.protocol.core.impl.RemotingConnectionImpl.doBufferReceived(RemotingConnectionImpl.java:451)
        at org.hornetq.core.protocol.core.impl.RemotingConnectionImpl.bufferReceived(RemotingConnectionImpl.java:412)
        at org.hornetq.core.remoting.server.impl.RemotingServiceImpl$DelegatingBufferHandler.bufferReceived(RemotingServiceImpl.java:459)
        at org.hornetq.core.remoting.impl.netty.HornetQChannelHandler.messageReceived(HornetQChannelHandler.java:67)
        at org.jboss.netty.channel.SimpleChannelHandler.handleUpstream(SimpleChannelHandler.java:100)
        at org.jboss.netty.channel.StaticChannelPipeline.sendUpstream(StaticChannelPipeline.java:362)
        at org.jboss.netty.channel.StaticChannelPipeline$StaticChannelHandlerContext.sendUpstream(StaticChannelPipeline.java:514)
        at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:287)
        at org.hornetq.core.remoting.impl.netty.HornetQFrameDecoder2.decode(HornetQFrameDecoder2.java:169)
        at org.hornetq.core.remoting.impl.netty.HornetQFrameDecoder2.messageReceived(HornetQFrameDecoder2.java:134)
        at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:80)
        at org.jboss.netty.channel.StaticChannelPipeline.sendUpstream(StaticChannelPipeline.java:362)
        at org.jboss.netty.channel.StaticChannelPipeline.sendUpstream(StaticChannelPipeline.java:357)
        at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:274)
        at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:261)
        at org.jboss.netty.channel.socket.oio.OioWorker.run(OioWorker.java:90)
        at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
        at org.jboss.netty.util.internal.IoWorkerRunnable.run(IoWorkerRunnable.java:46)
        at org.jboss.netty.util.VirtualExecutorService$ChildExecutorRunnable.run(VirtualExecutorService.java:181)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
[Old I/O server worker (parentId: 1299509292, channelId: 653327464, null => ttwsrvdev02.dev.ttw/10.115.151.40:61445)] 22:42:28,194 SEVERE [org.hornetq.core.server.impl.ServerSessionImpl] There is no consumer with id 0
[hornetq-failure-check-thread] 22:43:30,724 WARNING [org.hornetq.core.protocol.core.impl.RemotingConnectionImpl] Connection failure has been detected: Did not receive ping from /10.115.151.40:64116. It is likely the client has exited or crashed without closing its connection, or the network between the server and client has failed. The connection will now be closed. [code=3]

The client recorded connectivity issues (... org.hornetq.core.logging.impl.JULLogDelegate ... WARNING: Connection failure has been detected: Did not receive data from server for org.hornetq.core.remoting.impl.netty.NettyConnection@70f12110[...] [code=3]) 15s before the exception above (client and server are running on the same box), the client thread connected to jms.queue.msgduplicate_divert_outbound_2 received duplicates at the time of the exception and failed to commit received messages a few milliseconds later. But besides of this, the whole client obviously recovered afterwards and is still running and working.

Best regards,

Ronny

59. Re: Journaling errors

clebert.suconic Aug 23, 2010 10:23 AM (in response to ronnys)

Ronny, I'm working full time on this as you know..

And I'm having a hard time to replicate the exact scenario you're having.

I know what could cause the issue, so I would be able to replicate it.. but since you already did it.. what property files were you using? still the ones from the divert case, and are you using divert configuration on the server to make this happen?

I ran your test already for a long period without getting issues.