2 Replies Latest reply on Jul 25, 2008 4:28 AM by Mathias Chouet

    Strange behavior when attach()ing large POJOs

    Mathias Chouet Newbie

      Hello everyone.

      After having successfully overcome the "Raster serialization" problem, thanks to your advices, I'm now facing a few new difficulties.

      First, replication is "slow".
      In my current program, I'm trying to replicate synchronously between two instances of the cache a homemade POJO called RasterAdapter. This object (which is contained in another homemade, @Replicable POJO) contains a transient Raster, and writeObject()/readObject() methods with a custom serialization process. The fact is the execution of writeObject() is quite fast, the 40MB Raster is serialized and transferred very quickly, and successfully read by the second (remote) instance of the cache. But, after writeObject() and the remote readObject() complete (ensured by debug messages), the program waits curiously during approximately 5 seconds. No exception, no message, but it waits. Then, the execution goes on normally.
      I'm using REPL_SYNC mode and REPEATABLE_READ isolation level, and my POJOs are attached to a path like: /common_part/of_the_path/unique_pojo_number
      Maybe there's something with the locks, timeouts?

      The other problem is even more strange.
      When performing the operations described above, i.e. successively attach()ing several instrumented POJOs which contain my custom-serializable RasterAdapter, the result is uncertain. The replication between the 2 instances of the cache always seems to be performed, because the "replicate then wait" phenomenon is visible, and no exception is thrown. However, when doing a find() further in the program, it returns null approximately 25% of the time.
      For a given Fqn, whether the find() is executed on one instance of the cache or the other changes nothing, and executing several find() does not change the result, which is at least coherent. In addition, when find() returns null, exists() returns false.
      So I guess the attach() operation simply does not work in certain (random?) cases, and the cache does not consider the attached POJOs as present. I'm very surprised, maybe something with the paths again?

      I hope those questions have not already been treated, and thank you in advance. Any advice would be greatly helpful.

      Mathias Chouet

        • 1. Re: Strange behavior when attach()ing large POJOs
          Jason Greene Master

          Do you see anything in the log? You could try cranking it up to debug. Also a thread dump would help us figure out where you are blocking at. A kill -QUIT on unix, or a ctrl-break on windows on the console will give you that.

          It could very well be you are experience gc pauses, which somehow lead to unreported out of memory conditions.

          The following arguments will tell you whats going on with gc:

          -XX:+PrintGCTimeStamps -XX:+PrintGCDetails
          


          The find should definitely not return null if you had a successful attach operation. If there somehow was a non reported out of memory error, that might cause it, so lets rule that possibility out.


          • 2. Re: Strange behavior when attach()ing large POJOs
            Mathias Chouet Newbie

            Thank you very much for your answer!

            You're absolutely right, the log clearly exposed the problem, sorry I didn't think of this earlier... I was experiencing a cache.lock.TimeoutException caused by a ReplicationException: ... retval=null, received=false. Searching on the mailing lists, I found users reporting this was due to UDP (probably because of its unreliability) and REPL_SYNC.
            Switching to a TCP based configuration, and/or REPL_ASYNC solved the problem. I still encounter lock timeout issues when using REPL_SYNC along with TCP, but I think this is because I try to make "concurrent" updates on the same node. I'll read more about the locking policies...

            However, using

            -XX:+PrintGCTimeStamps -XX:+PrintGCDetails
            
            allowed me to point out some other subtleties, which helped me with memory management.

            So thank you again, your help was very useful!
            Best regards.

            Mathias Chouet