10 Replies Latest reply on Oct 16, 2009 11:15 AM by akarl

Farm deployment errors with large WAR files

pclark95 Jul 2, 2009 1:28 PM

Hi, I am having a problem using the farm deployment model with a large WAR file.

My setup:
two ubuntu 9.04 servers running sun java 1.6.0_14 and JBoss AS 5.1.0_GA
The two systems are connected through a 1 GB link and the default clustering settings are used (meaning UDP multicast, some settings are overridden on startup)

Here is the startup command for each server:
Server1
./run.sh -c all -b 10.200.90.105 -Djboss.messaging.ServerPeerID=10 -Djboss.messaging.groupname=TestPostOffice -g TestPartition1

./run.sh -c all -b 10.200.90.105 -Djboss.messaging.ServerPeerID=11 -Djboss.messaging.groupname=TestPostOffice -g TestPartition1

My first test worked fine, I was using a small war file (410K)
pclark@bks:~/development/servers/jboss-5.1.0.GA/server/all/farm$ ls -l
total 440
-rw-r--r-- 1 pclark pclark 6830 2009-06-30 11:12 cluster-examples-service.xml
-rw-r--r-- 1 pclark pclark 441162 2009-07-02 12:02 test.war

And the file deployed on both servers with:
12:03:50,614 INFO [TomcatDeployment] deploy, ctxPath=/test

However, the problem comes in with a larger war file. The second test was with a 120MB file. The file deploys on one server, but not on the other. It fails with the error:

2009-07-02 11:16:57,801 INFO [org.jboss.profileservice.cluster.repository.DefaultRepositoryClusteringHandler] (HDScanner) Unable to acquire local lock: Unable to acquire lock as it is held by 10.200.90.105:1099
2009-07-02 11:16:57,801 ERROR [org.jboss.system.server.profileservice.repository.clustered.ClusteredDeploymentRepository] (HDScanner) getModifiedDeployments(): Cannot acquire local lock

In addition, if I bring the master server up by itself and deploy through the farm directory, it deploys successfully. I will then bring up a second server and it will deploy successfully.

However, if I bring them all up at once it fails with the locking error. Or if I try to deploy once all servers in the cluster are up it fails with the locking error.

We were thinking that the size of the file was the problem. So we iteratively scaled down the war file until we could get it to successfully deploy. It appears that the limit is around 2MB.

Does anyone know what might be happening here? Are there any ideas on how to get the farm deployment to work reliably with large war files?

Thanks for your time,
Patrick

1. Re: Farm deployment errors with large WAR files

pclark95 Jul 2, 2009 1:30 PM (in response to pclark95)

I made a mistake in my post. the two servers are started up with the commands:

Server1
./run.sh -c all -b 10.200.90.103 -Djboss.messaging.ServerPeerID=10 -Djboss.messaging.groupname=TestPostOffice -g TestPartition1

Server2
./run.sh -c all -b 10.200.90.105 -Djboss.messaging.ServerPeerID=11 -Djboss.messaging.groupname=TestPostOffice -g TestPartition1

Thanks for your time,
Patrick
Actions
2. Re: Farm deployment errors with large WAR files

brian.stansberry Jul 2, 2009 3:06 PM (in response to pclark95)

How are you deploying the file? Placing it in farm on one node? Copying it to farm on both? Copying it to both and then starting the 2 nodes?

I'd think the first, but I want to be sure.

I'll assume it's the first and the node where it successfully deployed was the one where you copied the file and that was 10.200.90.105.

The logging you reported happens when the hot deploy scanner thread on 10.200.90.103 tries to acquire a lock to scan the farm dir, and can't after (configurable) 60 secs. The 10.200.90.105 node is holding the lock, which it will do while trying to push the war to the cluster. The ERROR msg should probably be a WARN, as this isn't necessarily a problem (e.g. if you are pushing huge content). But 60 seconds to push 2MB? Something's wrong there. Plus it sounds like the war *never* deploys.

Can you turn on TRACE logging in the server.log files for the following categories, test deploying the war and then zip and mail me the log files? Categories are:

org.jgroups.protocols.UDP
org.jgroups.protocols.pbcast.NAKACK
org.jboss.system.server.profileservice.clustered
org.jboss.ha.framework.server.lock
org.jboss.profileservice.cluster.repository

Thanks.
Actions
3. Re: Farm deployment errors with large WAR files

pclark95 Jul 6, 2009 11:23 AM (in response to pclark95)

Yep, you have the correct way that I am deploying it. I am starting both nodes, one at a time and I wait until they are both up. The 10.200.90.103 is started first and considered the master. Once both are up and clustered I place the war file on the 10.200.90.103 farm directory. I have enabled logging as you had asked and produced the files. Server time drift is minimal, but just so you can adjust between the two log files: 10.200.90.103 is 4 seconds faster than 10.200.90.105.

pclark@10.200.90.103:~/development/servers/jboss-5.1.0.GA/bin$ date
Mon Jul 6 10:08:48 CDT 2009

pclark@10.200.90.105:~/development/servers/jboss-5.1.0.GA/bin$ date
Mon Jul 6 10:08:44 CDT 2009

Here are the startup command for each:
10.200.90.103:
./run.sh -c all -b 10.200.90.103 -Djboss.messaging.ServerPeerID=10 -Djboss.messaging.groupname=TestPostOffice -Djboss.server.log.threshold=WARN -g TestPartition1

10.200.90.105:
./run.sh -c all -b 10.200.90.105 -Djboss.messaging.ServerPeerID=11 -Djboss.messaging.groupname=TestPostOffice -Djboss.server.log.threshold=WARN -g TestPartition1

I have emailed you the log files and the jboss-log4j.xml file (to make sure I enabled logging correctly). It doesn't look like I am able to attach a file to this post.

Thanks for your assistance,
Patrick
Actions
4. Re: Farm deployment errors with large WAR files

brian.stansberry Jul 8, 2009 2:55 PM (in response to pclark95)

Patrick,

Looking at the log files you sent I'm seeing JGroups doing a lot of retransmission of messages. Basically for each ~60K of content JGroups sends a message; if the receiver detects some messages are missing it asks for retransmission of the missing message and won't deliver messages to the application until the missing message is received. If you get a lot of that it will certainly slow things down.

I'm going to assume you are using the standard JGroups configurations that ship with the AS and that the problem isn't there.

Much more likely is it's an OS-level issue; OS isn't providing large enough buffers for UDP datagrams. Result is datagrams are lost at the network layer, so JGroups never sees the messages and needs to have them retransmitted. The default UDP buffer size on many OSs is very small.

http://www.29west.com/docs/THPM/udp-buffer-sizing.html has a nice section on how to configure UDP buffer sizes on various OS's. The stock JBoss AS config can utilize up to a 25MB in UDP buffer, so if you have memory resources to allow that much, go for it. If not, provide as much as you can but I'd say at least 2MB.
Actions
5. Re: Farm deployment errors with large WAR files

akarl Jul 13, 2009 9:48 AM (in response to pclark95)

I'm seeing the same behavior with a 37MB ear, JBoss 5.10 GA, and the base setup for clustering/JGroups (the All configuration). I am deploying using the recommended method of dropping the file into the farm directory on the master node. When I do this I begin seeing the "Cannot acquire local lock" messages on the child node. ~20 minutes later the deployment proceeds on both machines at the same time.

I'm experimenting with changing the OS level UDP settings as suggested but have not had dramatically better results yet. I am raising the net.core.rmem_max limit and gathering deployment times to see if it is improving as I raise it however I stopped setting Udp packet receive errors once I took the limit above 2MB so I suspect something else may be limiting the file transfer now.

I also don't see this behavior if I deploy to the master node while it is the only node and then bring up other nodes in the cluster after the deployment...i.e. when I do that the deployment to the child nodes is almost instantaneous following the typical startup time. This makes me thing the problem is not at the OS level.

I'll zip and email my server log as well after turning on the logging and will post some timings from my deployment testing for others to compare against.
Actions
6. Re: Farm deployment errors with large WAR files

brian.stansberry Jul 13, 2009 3:47 PM (in response to pclark95)

Thanks for the info; got your logs. The log shows a burst of messages being received, each of which is a piece of the file, and then a 60 second pause before the next burst.

Suspiciously, 60 secs is the default timeout for waiting for group RPC responses, hinting that something around that' i what is going on.

Can you do a couple things for me?

1) Add TRACE logging for org.jboss.system.server.profileservice.repository.clustered
2) Turn on all the TRACE logging for the other node as well.
3) Send me the server.log from both nodes.

Chances are only so-so that I'd have a chance to set up a test to reproduce this this week, so logs from your testing are very helpful in determining the problem.
Actions
7. Re: Farm deployment errors with large WAR files

akarl Jul 14, 2009 9:11 AM (in response to pclark95)

Done. I included 20 minutes of logging from both the master and child nodes.
Actions
8. Re: Farm deployment errors with large WAR files

brian.stansberry Jul 15, 2009 4:50 PM (in response to pclark95)

I haven't figured this out yet, but for sure it's a bug, so here's the JIRA for it:

https://jira.jboss.org/jira/browse/JBAS-7102
Actions
9. Re: Farm deployment errors with large WAR files

brian.stansberry Oct 2, 2009 6:48 PM (in response to pclark95)

I've commented on the JIRA, but will go into a bit more detail here.

First, the logs you sent me, Adam, showed the file that was being farmed was actually "xxx.ear.filepart" not "xxx.ear". From that I believe you were using a tool (perhaps WinSCP?) to upload the file directly to the farm directory. That's a lengthy enough operation that the hot deployment scanner ran in the middle.

Hence my recommendation on the JIRA to stop the hot deployment scanner thread before doing lengthy I/O operations.

The scanner can be stopped by invoking the JMX stop() operation on the jboss.deployment:flavor=URL,type=DeploymentScanner MBean. This can be done via the jmx-console or via the twiddle utility in $JBOSS_HOME/bin.

I also you recommend you do your upload to a temp folder and then do a local copy into farm/. Otherwise if there is a failure during the upload you'll be leaving a corrupt file in farm/.

Second, looking closely at the logs tells me that a very high percentage of messages JGroups is sending from the master are not being received on the child. This makes things go very slowly. You need to determine why there is such a high rate of loss on your network. Earlier in the thread we've discussed the OS maximum read buffer setting. You can also adjust the maximum write buffer. Note also that changes made via sysctl -w are not persistent across restarts; for persistent changes you need to edit /etc/sysctl.conf.

If you can't eliminate the UDP packet losses, you might consider using a TCP stack, particularly if yours is a 2 node cluster, where there is no benefit from using IP multicast. If you can use IP multicast for the initial cluster discovery messages (which seems to be working fine) then it's very easy to configure the AS to use TCP for regular traffic, just add this to your command line arguments:

-Djboss.default.jgroups.stack=tcp

(Switching to TCP is a bit more complicated if IP multicast for discovery isn't an option.)
Actions
10. Re: Farm deployment errors with large WAR files

akarl Oct 16, 2009 11:15 AM (in response to pclark95)

The rest of the story...

We created an application that uses JBossCache to share objects between nodes in the cluster. We configured JBossCache to utilize JGroups UDP communication which from what I can tell is configured nearly identical to the farm deployer. We noticed that if we ran two instances of JBoss within the same OS/virtual hardware we had excellent communications performance but when we split to two OS/virtual hardware the performance was pitiful. So, why? We had been using CentOS on top of Xen virtual hardware so we tried switching out one of those variables and running CentOS on top of VMWare and voila, massive performance benefit. So, something in our Xen hardware setup or Xen itself was causing very poor performance of UDP communications. Our solution at this point is simply to use VMWare. At some point I will get a chance to test the farm deployer against our VMWare based clusters and I'll try to post some results here but I fully expect it to work well.
Actions

Go to original post