Buddyrep issue
fredrikj Dec 28, 2007 9:57 AMHi.
I am currently using cache 2.1.0 GA and jgroups 2.6.1 with buddy replication. Buddy rep is configured to use one buddy only.
The setup is four nodes with ip addresses like:
172.16.0.5
172.16.0.6
172.16.0.7
172.16.0.8
172.16.0.6
172.16.0.7
172.16.0.8
They are all started in the stated order so that .5 is the coordinator. The first node up (.5) will insert data to the cache and then data will gravitate to the other nodes when needed. This will occur mostly initially when applying load to the system. Data affinity is handled by a layer above the cache.
Using this scenario with 2.0.0 GA presented no problems except adding new nodes during load, so we are now investigating 2.1.0.
The issue I'm facing is that the coordinator seems to get two buddy backups, one being itself.
This is the contents on 172.16.0.5 (coordinator):
null {} /91 {91=com.cubeia.firebase.game.table.InternalMetaData@1134d9b com.cubeia.testgame.server.game.TestGame@1cdc8ce} /63 {63=com.cubeia.firebase.game.table.InternalMetaData@cc9ff5 com.cubeia.testgame.server.game.TestGame@951aeb} /92 {92=com.cubeia.firebase.game.table.InternalMetaData@e60a39 com.cubeia.testgame.server.game.TestGame@185c84e} ... /_BUDDY_BACKUP_ {} /172.16.0.8_8786 {} /15 {15=com.cubeia.firebase.game.table.InternalMetaData@d9e1c0 null} /16 {16=com.cubeia.firebase.game.table.InternalMetaData@742062 null} /172.16.0.5_8786 {} /31 {}
Notice that there are two members listed under_BUDDY_BACKUP, one is .8 and the other one is .5, which is itself.
Now, on 172.16.8 we get a lot of lock timeouts like the one below:
Caused by: org.jboss.cache.lock.TimeoutException: read lock for /_BUDDY_BACKUP_/172.16.0.5_8786 could not be acquired by GlobalTransaction:<172.16.0.6:8786>:41 after 5000 ms. Locks: Read lock owners: [] Write lock owner: GlobalTransaction:<172.16.0.6:8786>:1 , lock info: write owner=GlobalTransaction:<172.16.0.6:8786>:1 (activeReaders=0, activeWriter=Thread[Incoming,TableSpace,172.16.0.8:8786,5,Thread Pools], waitingReaders=25, waitingWriters=0, waitingUpgrader=0)
172.16.0.8 also shows two members under the buddy backup:
null {} /28 {} /29 {} /92 {} ... /_BUDDY_BACKUP_ {} /172.16.0.7_8786 {} /91 {91=com.cubeia.firebase.game.table.InternalMetaData@1fbeed6 null} /41 {41=com.cubeia.firebase.game.table.InternalMetaData@fd3922 null} /115 {115=com.cubeia.firebase.game.table.InternalMetaData@b215d9 null} ... /172.16.0.5_8786 {} /31 {}
It seems like .8's buddy to backup is in fact .7. But we still hold some buddy ref to the .5 member as well. In fact, all the lock timeouts on .8 is related to to .5 buddy back fqn:
failure acquiring lock: fqn=/_BUDDY_BACKUP_/172.16.0.5_8786