1 Reply Latest reply on Aug 14, 2009 3:54 AM by timfox

Testsuite updates...

clebert.suconic Aug 14, 2009 1:35 AM

I just wanted to give you guys an update about my recent work on the testsuite.. You guys are a few hours ahead of me.. so an update would probably be a good thing now.

I - I profiled the testsuite, and I added some code locally (Using JVMTIInterface) to produce a memory dump on every test (with just counters... just like the one we have with kill -3).

There was/is an issue on Pinger. The Pinger instance will stay hold for a few seconds after the close is called, so it was hard to double check what tests are leaking PostOfficeImpl and QueueImpl. So, I modified the Pinger (locally) to clean basically all the fields. But I still wanted to understand why the future.cancel is holding the Pinger for more 10-30 seconds in avg after being called.

With this I've found several tests not setting up server to null, and one that was probably not shutting down the server properly.

I still have a few leaks after RedistributionTest... Redistribution will be my next update.

II - NettyFileStorageSymmetricClusterWithBackupTest::testStartStop

For some reason the live node is creating a few IDs during shutdown. Some consumers are getting closed during shutdown on live, and a few IDs are getting created (but not getting repplicated to the backup node). Because of that the backup node will get the IDs out of sync with live, so the backup won't be activated after the second restart due to this exception:

 if (liveUniqueID != backupID)
 {
 initialised = false;

 throw new IllegalStateException("Live and backup unique ids different (" + liveUniqueID + ":" + backupID + "). You're probably trying to restart a live backup pair after a crash");
 }

I'm looking at avoiding creating extra IDs during the shutdown. I'm almost there.. I didn't get there today just because it's too late now, but I'm confident about it now.

III - There is also another separate issue on NettyFileStorageSymmetricClusterWithBackupTest. (Which will probably show up in other tests).
When the bridge is deployed, it will send a message to a management queue. When several bridges are being deployed very quickly, you would eventually have messages arriving on the backup at a different order, and the message ACK repplication won't get there. I will work on this as soon as I finish II.

1. Re: Testsuite updates...

timfox Aug 14, 2009 3:54 AM (in response to clebert.suconic)

Thanks for the update.

If issues are due to replication per se, then I wouldn't spend to much time on it, since we're going to remove that after the next release anyway.
Actions