Start by looking at print-data, and print-pages.... that file may be easier to share.
I'm wondering what caused that.
where can i find those? any pointers to how do i analyze them?
thanks a lot
so it may or may not be related, but something weird showed up in the logs:
15:29:38,814 ERROR QueueImpl:66 - Failed to deliver
Caused by: java.lang.NullPointerException
... 8 more
does it ring any bell?
also, i've tried PrintPages and PrintData but i could not understand what was what
thansk in advance
There were a few fixes beyond 2.2.21...
https://issues.jboss.org/browse/JBPAPP-10338 paging may lose messages if acks are done out of order during restart
It seems that would be a good idea to move to the latest available version.. even if you use git to get to it (case you don't have an EAP subscription).
It seems that the page file was deleted maybe while the journal still showing some acks in a previous page? I would need your data to know what's going on. (maybe you could send the print-data nad print-pages outputs to me). (take a look on the properties before you send to me... to make sure you're not sending anything you coudn't send... although I never look at anyt of that data).
My email is fairly easy to be found for sending the print-data and print-pages
and you can even speak in Portuguese to me if you like on the email.. from what I see you're Brazilian as well.
Hi Clebert thanks for all the attention. The previous NPE was a result of my moving files around i think so i guess we can safely ignore it (although NPE is, IMO, always a bug).
I'm back to square one: messages are stuck in the queue and are never delivered (i also cannot move/expire them using JMX so something is absolutely wrong). It happened again monday on a non critial queue so it was not a big deal.
I also get some instances of
12:35:50,790 WARN PageCursorProviderImpl:76 - Couldn't complete cleanup on paging
java.lang.IllegalStateException: Journal must be loaded first
while it is starting up and it looks like a racing condition (but i dont think it has anything to do with my problem anyway)
Im checking out from git to try it out and i will send the print data and print pages if it doesnt work.
Thank you so much
Oh and yes i am brazilian and i think we went to the same college (IME-USP) but a couple of years apart
You should probably upgrade to the latest version I told you ASAP. there were some bug fixes on paging.
It's hard for me to pinpoint the cause without looking at your data. But the upgrade will definitely fix things.
It would also be great to look at the full logs... there was probably an error before that.
And regarding the NPE... the cache the position was supposed to be there. I could of course treat the NPE and throw another exception. But there would be an error regardless if you move files.
@Marcelo: Why don't you come online on IRC and we can talk about this.
You should definitely move to the newer version. But I'm interested on learning what happened, to make sure it won't happen again. You are probably using some usecase that wasn't planned. (Are you using filters on paging for instance?.. if ou are.. you should move even sooner)
Ok so im back. I could join IRC but i dont want to bother you so much.
At any rate, i did some experiments and here are the results (still using 2.2.21.Final)
-the queue getting stuck is certainly something to do with paging because if i clear that, everything almost works again except that messages that are journaled are not consumed by the message listeners
-the journal files are loaded into memory but are never collected (and i suspect it may cause OOM's down the line but i am unsure) but, like i said above, the messages are never consumed and never leave the heap even after a full GC
Also i've tried the latest version from the 2.2.EAP branch - namely 2.2.24.EAP.snapshot - and the "queue stuck issue" is definitely gone, but the other issue (queue with X messages on its metadata and using a lot of heap) is still there and it may be related to the exception below:
16:43:24,439 WARN PageSubscriptionImpl:76 - Error while deleting page-complete-record
java.lang.IllegalStateException: Journal must be loaded first
Can i use (from a licence point of view) use this version even though its marked EAP?
Also do you need more information? Do i still send you the print-data and print-pages (they're pretty huge though)?
Thanks a lot
And i forgot to mention, but i dont use any fancy feature and my use case should be pretty trivial:
I have a small cluster (2 nodes at the moment) with 6 queues (including DLQ) and 1 topic (just for statistics broadcasting) and all of them are clustered.
Each queue (except DLQ) has a number of consumers (64 currently). The producers usually put an average of 300 messages per second (and rarely over 600) on all the queues with an almost identical body that is at most 500 bytes each transacted (so its all or nothing on putting messages on the queues). The consumers either process the message or put another messages on the queue.
I dont use any filters or selectors and messages are rarely paged (or at least the queue size doesnt go up so much that it has to be paged).
Oh and hornetq is running embedded in my application so local message consumption (is this even a word?) is local using in-vm connection factories and the netty connectors are used only to bridge messages between the node clusters. (i also had some questions about how such routing is done but ill look the code first).
I think thats all. If you need any more info, ill gladly provide'em
> Can i use (from a licence point of view) use this version even though its marked EAP?
Of course.... the source code is Apache License with 1 class in LGPL.. so, you're free to use.
EAP adds support and build distribution (including patches... etc).
so, EAP is a value added where you get some structure from RedHat along the software. If you get the software directly.. you're still free to use it.
.... Ahhh.. that queue stuck was a bug that was fixed indeed.. I remember about that now... I'm still not sure what caused the Journal loaded.. maybe some of your embedded code is wrong?
What application server are you using BTW?
If you are using Embedded.. you should probably use the Branch_2_2_AS7. if you are using AS7.. definitely get the AS7 branch. (The only differences are classLoading... nothing much beyond that).
>> Ok so im back. I could join IRC but i dont want to bother you so much.
It would be my pleasure... really!
I like to help users in trouble so we get more users using Hornetq, and also.. mainly... for the sake of avoiding future errors...
Im using tomcat 7 and when i say embedded really is just invoking EmbeddedJMS's start method and sniffing it to get the queue connection factory and registering a TopologyListener. I dont think it has to do with it, but i will try the standalone to see what i get. Im mimicing the startup script but maybe my configuration is just funky. I will clean it up and attach it too.
Im deploying the latest version now on our testing server and everything seems to be working great.