1 2 Previous Next 20 Replies Latest reply on Sep 24, 2008 4:14 AM by adamw

    Storing/restoring diffs; GDIFF

    genman

      Storing just the fields that were changed is not that hard to implement and I'd like to have a shot at creating such a patch.

      A few thoughts:

      For every revision made, store the changes required to make the latest version the same as the previous version. (This is what's done in RCS.) E.g. for versions:

      (a, b, c) - 3
      (a, a, a) - 2
      (x, y, a) - 1
      

      Store:
      (a, b, c) - 3
      (-, a, a) - 2
      (x, y, -) - 1
      


      The store procedure is basically:
      1. Store all fields in the version being created. (insert)
      2. Null out fields in the previous revision that weren't changed. (update)

      And thus restoring old versions requires loading a bit of data as you have to recover the most current and "apply" the old changes going backwards. One way to optimize would be to do a multi-value select using this sort of query:

      (Assume an @Entity Person with versioned "name" and "address")
      SELECT COALELSCE( select name from person_versions order by revision_number where revision_number >= N) as NAME,
       COALELSCE( select address from person_versions order by revision_number >= N) as ADDRESS
      FROM DUAL
      


      Coalesce finds the first non-null value in the revision table. I'm not sure how well most DBs would handle this. If they don't support subqueries I guess a full load would be required. Using sub-queries would minimize wire transfer of data.

      Related to saving space by nulling out columns would be to calculate text/binary differences occurring in String or byte[] instances. This would be a big win for @Lob objects.

      For binary changes, Envers could use GDIFF: http://www.w3.org/TR/NOTE-gdiff-19970901

      Maybe the storage format would be called GDIFF ?

      There's a library implemented in Java called "xdelta", available here that does GDIFF encode and patching: http://sourceforge.net/projects/javaxdelta/

      For text changes, line-based change tracking seems appropriate, though I wonder if that should be the default. Since text changes aren't always line-based, a text-based GDIFF ("TEXT_GDIFF"?) format over a "diff (1)" seems to make sense. GDIFF would have to support character streams as well, but it would be easy to enhance I think.

        • 1. Re: Storing/restoring diffs; GDIFF
          adamw

          Hello,

          good to see you taking the initiative :)

          The first thing that cames to my mind, is how would you handle NULL fields?

          Anyway, I think that the way things are versioned should be configurable per field. Also, there are in fact three ways of storing fields, as you write it:
          * always store the full content of the filed
          * store the full content only if the field changed
          * always store the DIFF

          If you store the diff, you'll be unable to query against that field (as the results could get weird). If you store diffs, there's no problem without storing NULLs, as you can just store a NULL instead of a diff against an empty string.

          About subselects: I require support for them now anyway, to read an entity at a revision - if you want an entity at revision N, you have to find an entry in the versions table with the smallest revision number, greater than N, or if it doesn't exist, read it from the "live" table. That is because the latest version is only in the "live" table, so the entry in the versions table when an entity is created consists only of nulls and effectively serves as a marker. Later, when an entity is modified, I just store the old content.

          Finally, more technically, I suppose reading a DIFF/changed-versioned field should be lazy (as it requires a select), right?

          I hope you'll understand some of my chaotic writing ;) Waiting to hear your opinions on this :)

          --
          Adam

          • 2. Re: Storing/restoring diffs; GDIFF
            genman

            Null fields... Good point. I should have suggested to add extra columns (booleans or bits) to mark if a particular field was changed. You'd base the column name on a prefix/suffix such as "_c". (This would have to be configurable.) One advantage a separate column is that it could be later indexed using bitmap indexing.

            I spent some time looking at how to store DIFF values. There's a ton of algorithms out there. I spent hours yesterday looking around for possible choices, but I think it'd be best if the DIFF algorithm selection was user-selectable. Such a provider might have this kind of interface:

            import java.sql.*;
            package org.jboss.envers.diff;
            public interface DiffProvider {
             Blob diff(Blob old, Blob new);
             Clob diff(Clob old, Clob new);
            }
            


            Maybe you could extend the @Versioned annotation so you can indicate the provider like so:
            @Versioned(DiffProviderClass=EnversDiffProvider.class)


            I went shopping for Java Diff algorithms and did some trials.

            The "Longest common subsequence problem" is NP-hard. Usually, this isn't a problem for programs like "diff" that segment input into lines, but for fine-grained input (byte/char) it is a lot more work to calculate. For small input sizes, like under 200K characters/bytes, I found this program to work well and it works for all kinds of inputs (even lines or integers): http://www.incava.org/projects/java/java-diff/ But for larger input sizes, it uses vast amounts of CPU and memory.

            GDIFF (javaxdelta impl.) works faster, though it's less precise, meaning it won't create the smallest possible diff. It uses more efficient hash matching by blocks which does use a lot of memory but is much faster. The class only works for byte streams and wouldn't for unicode characters. The code itself isn't in great shape: block sizes are hard-coded and it's not well maintained, it feels like. I'm trying to contact the GDIFF developer(s?) and see if I can't clean things up a bit and adopt it to work with characters and maybe help maintain the code a bit more.

            Another algorithm I spotted called VCDIFF, which is newer, is provided by the XDelta library (at XDelta.org). The XDelta library is written in C. Maybe it'd be worth writing a native wrapper? But it's also GPL not LGPL, which may be a problem for integration. XDelta is the algorithm used for Subversion storage and it's well maintained.


            Finally, more technically, I suppose reading a DIFF/changed-versioned field should be lazy (as it requires a select), right?


            It would require N selects and a lot of computation. So given all the work that goes into restoring a revision, it would be best to have lazy initialization. Is there some way to register field/method interception in Hibernate to call, say, a data restore algorithm?


            • 3. Re: Storing/restoring diffs; GDIFF
              adamw

              Hello,

              thanks for putting in all this effort and researching diff libraries :)

              I think we should avoid using any native libraries. The simpler for the user the better :). The idea about specifying a diff provider is very good - we could either put this as an argument to versioned, or create a separate annotation which would work in conjunction with @Versioned (for example @DiffProvider), and which would be taken into account only if the field is annotated with @Versioned(modStore=ModStore.DIFF).

              The extra column is also a good solution, but is really need only if you store only those fields that changed, but their full content. If you store a DIFF, there is a difference between "no changes" -> empty column (""), "changes to null" -> NULL column, "changes to empty" -> every line with a "-".

              How did it go contacting the GDiff authors?


              It would require N selects and a lot of computation. So given all the work that goes into restoring a revision, it would be best to have lazy initialization. Is there some way to register field/method interception in Hibernate to call, say, a data restore algorithm?


              Currently, I use custom proxies, because I haven't yet found a way to reuse Hibernate's - they seem to be tightly bound to the fact that a table corresponds to an entity (and in my case, a table doesn't correspond to an entity, as the table contains for example the revision column, and the entity does not). Maybe I'll have a chance to talk to somebody from hibernate and clear this up.

              Anyway, creating a proxy for this shouldn't be hard. It's like I do with collections - see the org.jboss.envers.reader.lazy.proxy. If I detect a collection, I just set (by reflection) a ListProxy/SetProxy. So here, we could just detect a "DIFF" column and insert a "DIFF".

              --
              Adam

              • 4. Re: Storing/restoring diffs; GDIFF
                genman

                With "DIFF" you don't need that extra column, agreed. And how "no changes" is represented would be left up to the implementation of the provider.

                Re: GDIFF. I posted a message to the development list of javaxdelta: http://sourceforge.net/mailarchive/forum.php?forum_name=javaxdelta-devel

                With only a couple of messages per month, the project is fairly inactive and probably not worth maintaining through a third-party developer. So unless you or me are granted access to work on the project itself, it'd be easy enough to "fork" it.

                By the way, some databases like Oracle treat empty (text) columns the same as null columns.

                For Diff, again it's the provider that determines what "no changes mean." In GDIFF, no changes is represented by a GDIFF "magic number" followed by a copy command (see the format). For text, I'll have to invent something equivalent. Maybe I can prefix a text diff with "gdiff" and follow with "c 0,X" (copy) where X is the size of the data in bytes.

                I'll see if I can't start on some preliminary coding today. I do have a day job, but it'd be fun to get something going.

                • 5. Re: Storing/restoring diffs; GDIFF
                  adamw

                  Hello,

                  you're right, it seems that gdiff isn't really maintained ... but the implementation would be pluggable, so we can at first use that. Do you know which diff libraries are used by Eclipse, Idea or SvnKit?

                  Btw., Envers is also not my main project at jboss :).

                  --
                  Adam

                  • 6. Re: Storing/restoring diffs; GDIFF
                    genman

                    I should be getting source control access to the GDIFF project soon. I came up with a text diff which works okay and would be a good first try.

                    Creating line-based diffs is fairly easy, which is what Eclipse probably does. I don't know about Idea. SvnKit uses its own implementation of XDelta:
                    http://svn.svnkit.com/repos/svnkit/trunk/svnkit/src/org/tmatesoft/svn/core/internal/delta/SVNXDeltaAlgorithm.java

                    The code is dual-licensed so I don't think JBoss could take it. It's also binary based. Algorithmically, it appears to be similar to GDIFF.

                    The general algorithm used by both is this: Divide the first file into N-sized chunks. Hash each chunk to its position. Then take the second file and match a moving window N-size long to the hash map. Then try to extend the match to the next X bytes. In the case of GDIFF, you use a "copy" command. For unmatched strings, use a "data" command. The tricky part is optimizing a match, since you want the longest match possible.

                    So what is your main project?

                    • 7. Re: Storing/restoring diffs; GDIFF
                      adamw

                      Hello,

                      gdiff seems ok of course, I just asked out of curiosity :). And the diff implementation would be changeable anyway. Is there a common format of the results of a diff or is it implementation-specific?

                      My main job is developing software for the JBoss.ORG portal, lately mainly jboss.org/feeds module.

                      --
                      Adam

                      • 8. Re: Storing/restoring diffs; GDIFF
                        genman

                        Okay, I got Admin rights for javaxdelta. I even created a new page for it:

                        http://javaxdelta.sourceforge.net/

                        Documentation on Text diff:
                        http://javaxdelta.sourceforge.net/javadoc/com/nothome/delta/text/package-summary.html
                        Binary diff:
                        http://javaxdelta.sourceforge.net/javadoc/com/nothome/delta/package-summary.html

                        Anyway, I'll see if I can't do some work specifically on Envers to get this going.


                        • 9. Re: Storing/restoring diffs; GDIFF
                          genman

                          Answering your question... There is no common format for diff. One standard is GDIFF, http://www.w3.org/TR/NOTE-gdiff-19970901.html

                          The other format is: http://www.faqs.org/rfcs/rfc3284.html which incorporates RLE encoding commands, providing smaller patch sizes.

                          • 10. Re: Storing/restoring diffs; GDIFF
                            adamw

                            Great :)

                            Maybe you'd like a branch in SVN?

                            --
                            Adam

                            • 11. Re: Storing/restoring diffs; GDIFF
                              genman

                              If you want ... I don't know if I will necessary mess up your work.

                              By the way, is it possible to set up a String or byte[] proxy in hibernate? I'm thinking that for now BLOB/CLOB will be the only datatypes supported.

                              • 12. Re: Storing/restoring diffs; GDIFF
                                adamw

                                Hello,

                                if you'd like, here it is, with commit rights:
                                https://svn.jboss.org/repos/envers/branches/users/genman
                                (assuming your username is genman). By the way, did you sign the contributors agreement?

                                As to the proxies - well, we control how the entities are instantiated, so you can create a proxy on anything :) Also on a String or byte[].

                                --
                                Adam

                                • 13. Re: Storing/restoring diffs; GDIFF
                                  genman

                                  I've been a JBoss contributor since 2002, so yes.

                                  I looked at your List/Set proxy impl. which doesn't even use reflection. For "final" classes like String, I was thinking that something like Javassist would be required. I searched and found this: "JavassistLazyInitializer" and wondered why you hadn't used it.

                                  Anyway, I'll go with using java.lang.reflect.Proxy and just Blob/Clob for now since all the real interesting code really will occur in initialization.

                                  • 14. Re: Storing/restoring diffs; GDIFF
                                    adamw

                                    Hello,

                                    well, I didn't need to proxy anything besides lists and sets, so why I think a "direct" list/set proxy is better then one which uses reflection.

                                    Hibernate has some utility classes for lazy initialization, but as far as I have seen they are strictly tied to entity initialization, so it's hard to recycle them for our needs.

                                    One thing important about hibernate and bytecode manipulation - it's pluggable, you can use either cglib or javassit. Javassist is the default, but there are systems with which it doesn't work (Alfresco for example). But I haven't checked why this is so :).

                                    Also, can't you java.lang.reflect.Proxy only an interface?

                                    Does restricting to Blob/Clob simplify things much? What is different about varchars?

                                    --
                                    Adam

                                    1 2 Previous Next