14 Replies Latest reply on Nov 8, 2008 7:34 AM by belaban

    Cache corrupted by 64bit windows member

    wjm

      Hello. We've run into some trouble in attempting to bring a 64bit windows machine into an existing cluster. The cluster has (had) two 64bit linux members and was running fine when we attempted to merge a 64bit windows machine into the group. Anytime the windows machine wrote on a node, the node was thereafter unreadable by either of the original members. Exceptions of the following form were thrown by the other members. All machines were using 32bit java. Has anyone seen anything similar? Have I missed something fundamental about 64bit windows?

      Thanks in advance for any help or insight.


      INFO | jvm 1 | 2008/10/10 10:05:13 | 0 [ERROR] AdjListJDBCCacheLoader.reportAndRethrowError(): - Failed to load node for fqn /IntegrationModules
      INFO | jvm 1 | 2008/10/10 10:05:13 | java.lang.Exception: Unable to load to deserialize result:
      INFO | jvm 1 | 2008/10/10 10:05:13 | at org.jboss.cache.loader.AdjListJDBCCacheLoader.loadNode(AdjListJDBCCacheLoader.java:397)
      INFO | jvm 1 | 2008/10/10 10:05:13 | at org.jboss.cache.loader.AdjListJDBCCacheLoader.get(AdjListJDBCCacheLoader.java:97)
      INFO | jvm 1 | 2008/10/10 10:05:13 | at org.jboss.cache.interceptors.CacheLoaderInterceptor.loadData(CacheLoaderInterceptor.java:530)
      INFO | jvm 1 | 2008/10/10 10:05:13 | at org.jboss.cache.interceptors.CacheLoaderInterceptor.loadNode(CacheLoaderInterceptor.java:408)
      [...]
      INFO | jvm 1 | 2008/10/10 10:05:13 | Caused by:java.io.EOFException
      INFO | jvm 1 | 2008/10/10 10:05:13 | at java.io.DataInputStream.readInt(Unknown Source)
      INFO | jvm 1 | 2008/10/10 10:05:13 | at java.io.ObjectInputStream$BlockDataInputStream.readInt(Unknown Source)
      INFO | jvm 1 | 2008/10/10 10:05:13 | at java.io.ObjectInputStream.readInt(Unknown Source)
      INFO | jvm 1 | 2008/10/10 10:05:13 | at org.jboss.cache.marshall.CacheMarshaller200.populateFromStream(CacheMarshaller200.java:740)
      INFO | jvm 1 | 2008/10/10 10:05:13 | at org.jboss.cache.marshall.CacheMarshaller200.unmarshallHashMap(CacheMarshaller200.java:705)
      INFO | jvm 1 | 2008/10/10 10:05:13 | at org.jboss.cache.marshall.CacheMarshaller200.unmarshallObject(CacheMarshaller200.java:564)
      INFO | jvm 1 | 2008/10/10 10:05:13 | at org.jboss.cache.marshall.CacheMarshaller200.objectFromObjectStream(CacheMarshaller200.java:147)
      INFO | jvm 1 | 2008/10/10 10:05:13 | at org.jboss.cache.marshall.VersionAwareMarshaller.objectFromStream(VersionAwareMarshaller.java:176)
      INFO | jvm 1 | 2008/10/10 10:05:13 | at org.jboss.cache.loader.AdjListJDBCCacheLoader.unmarshall(AdjListJDBCCacheLoader.java:702)
      INFO | jvm 1 | 2008/10/10 10:05:13 | at org.jboss.cache.loader.AdjListJDBCCacheLoader.loadNode(AdjListJDBCCacheLoader.java:392)

        • 1. Re: Cache corrupted by 64bit windows member
          wjm

          Small addendum, we're running the latest 2.2.0GA release. Thanks again.

          • 2. Re: Cache corrupted by 64bit windows member
            manik

            I doubt this is an OS problem, but just to prove it, do you see this issue if you had 1 windows and 1 linux machine in the cluster, and the other linux machine were to join?

            • 3. Re: Cache corrupted by 64bit windows member
              wjm

               

              "manik.surtani@jboss.com" wrote:
              I doubt this is an OS problem, but just to prove it, do you see this issue if you had 1 windows and 1 linux machine in the cluster, and the other linux machine were to join?


              Thanks for the reply. We see this problem immediately after the 64bit machine put()s or replace()s any node in the cache. That same node(s) becomes unreadable for all other members, whether current or newly joined.

              Note, we have not tried clustering two 64bit windows machines, because we only have one, but presumably they would play well together.

              We're going to try using the 64bit java installation on the windows machine to see if that might "help". So far we've only used the 32bit java everywhere.

              Otherwise, I hope you're right, but it's certainly evident only when 64bit windows is a member.




              • 4. Re: Cache corrupted by 64bit windows member
                wjm

                Just a followup here, as mentioned. When using 64bit java for windows (1.6 latest), we do see the same result.

                • 5. Re: Cache corrupted by 64bit windows member
                  manik

                  Do you see any issues with the replication? E.g., if you turn off the cache loader - or even use a different cache loader - does this work?

                  Finally, given that you are using a JDBC cache loader, which database backend are you using, and is it configured to be shared?

                  • 6. Re: Cache corrupted by 64bit windows member
                    wjm

                     

                    "manik.surtani@jboss.com" wrote:
                    Do you see any issues with the replication? E.g., if you turn off the cache loader - or even use a different cache loader - does this work?


                    Hello again, and thanks for the followup. We tried the cache without a loader today, and got the same result.

                    • 7. Re: Cache corrupted by 64bit windows member
                      wjm

                      I've now logged this as a formal JIRA case:

                      https://jira.jboss.org/jira/browse/JBCACHE-1432

                      Thanks again!

                      • 8. Re: Cache corrupted by 64bit windows member
                        manik

                        So what does this stack trace look like when you don't use a cache loader? :-)

                        • 9. Re: Cache corrupted by 64bit windows member
                          wjm

                          Ah, yes. Very fair question. :-) When it was apparant to me that the error lies in readInt() and readShort(), I didn't think about the rest. Clearly thats where the problem is, but this is the form we see without a loader enabled:

                          30 [ERROR] RequestCorrelator.receiveMessage(): - failed unmarshalling buffer into return value
                          java.io.EOFException
                          at java.io.DataInputStream.readShort(Unknown Source)
                          at java.io.ObjectInputStream$BlockDataInputStream.readShort(Unknown Source)
                          at java.io.ObjectInputStream.readShort(Unknown Source)
                          at org.jboss.cache.marshall.CacheMarshaller200.unmarshallObject(CacheMarshaller200.java:536)
                          at org.jboss.cache.marshall.CacheMarshaller200.objectFromObjectStream(CacheMarshaller200.java:147)
                          at org.jboss.cache.marshall.VersionAwareMarshaller.objectFromByteBuffer(VersionAwareMarshaller.java:154)
                          at org.jgroups.blocks.RequestCorrelator.receiveMessage(RequestCorrelator.java:544)
                          at org.jgroups.blocks.RequestCorrelator.receive(RequestCorrelator.java:365)
                          at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:746)
                          at org.jgroups.JChannel.up(JChannel.java:1151)
                          at org.jgroups.mux.Multiplexer$Task.run(Multiplexer.java:1036)
                          at org.jgroups.mux.Multiplexer$ExecuteTask.run(Multiplexer.java:1060)
                          at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
                          at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
                          at java.lang.Thread.run(Unknown Source

                          I'll attach this to the bug as well.

                          • 10. Re: Cache corrupted by 64bit windows member
                            genman

                            You're getting an EOF, meaning somebody in your cluster is cutting the connection to your system. It might just be a network connectivity issue. Look at the logs on the other machines (TRACE) and see what's triggering the disconnect, if it indeed is within the cache.

                            • 11. Re: Cache corrupted by 64bit windows member
                              wjm

                               

                              "genman" wrote:
                              You're getting an EOF, meaning somebody in your cluster is cutting the connection to your system. It might just be a network connectivity issue. Look at the logs on the other machines (TRACE) and see what's triggering the disconnect, if it indeed is within the cache.


                              Thanks for the feedback, but I don't think this is the case. The underlying problem seems related to the way 64bit architecture writes Int and Short to the object stream, causing other cluster members to throw this sort of exception when reading anything written by a windows 64bit member writes.

                              I've read that Nio handles this better than ordinary DataInputStream, but I'm not well versed enough to say for sure.

                              • 12. Re: Cache corrupted by 64bit windows member
                                genman

                                I doubt that's the case. All the Java IO classes write the same number of bytes regardless of the underlying architecture. I've never heard otherwise since I started using JDK 1.0.

                                If you're patient, feel free to run Wireshark and see if the Windows 64 bit machine is sending anything weird.

                                • 13. Re: Cache corrupted by 64bit windows member
                                  wjm

                                  I'm still hoping to get a confirmation. We feel certain that anyone else attempting to blend 64bit windows in with other architectures in a cluster will experience the same result, though.

                                  • 14. Re: Cache corrupted by 64bit windows member
                                    belaban

                                    I created a JIRA issue to verify this in JGroups (https://jira.jboss.org/jira/browse/JGRP-856). Although the specific issue you're seeing might be caused by marshalling code relying on explict assumptions about size of certain types or big/little endian issues.