Storing/restoring diffs; GDIFF
genman May 29, 2008 6:01 PMStoring just the fields that were changed is not that hard to implement and I'd like to have a shot at creating such a patch.
A few thoughts:
For every revision made, store the changes required to make the latest version the same as the previous version. (This is what's done in RCS.) E.g. for versions:
(a, b, c) - 3 (a, a, a) - 2 (x, y, a) - 1
Store:
(a, b, c) - 3 (-, a, a) - 2 (x, y, -) - 1
The store procedure is basically:
1. Store all fields in the version being created. (insert)
2. Null out fields in the previous revision that weren't changed. (update)
And thus restoring old versions requires loading a bit of data as you have to recover the most current and "apply" the old changes going backwards. One way to optimize would be to do a multi-value select using this sort of query:
(Assume an @Entity Person with versioned "name" and "address")
SELECT COALELSCE( select name from person_versions order by revision_number where revision_number >= N) as NAME, COALELSCE( select address from person_versions order by revision_number >= N) as ADDRESS FROM DUAL
Coalesce finds the first non-null value in the revision table. I'm not sure how well most DBs would handle this. If they don't support subqueries I guess a full load would be required. Using sub-queries would minimize wire transfer of data.
Related to saving space by nulling out columns would be to calculate text/binary differences occurring in String or byte[] instances. This would be a big win for @Lob objects.
For binary changes, Envers could use GDIFF: http://www.w3.org/TR/NOTE-gdiff-19970901
Maybe the storage format would be called GDIFF ?
There's a library implemented in Java called "xdelta", available here that does GDIFF encode and patching: http://sourceforge.net/projects/javaxdelta/
For text changes, line-based change tracking seems appropriate, though I wonder if that should be the default. Since text changes aren't always line-based, a text-based GDIFF ("TEXT_GDIFF"?) format over a "diff (1)" seems to make sense. GDIFF would have to support character streams as well, but it would be easy to enhance I think.