1 2 Previous Next 16 Replies Latest reply on Apr 11, 2010 2:33 PM by adamw

    Dealing with "New" History - Strategy?

      Hello All,

       

      I am using Envers with a client as part of a legacy application re-development project. We are rebuilding the guts of the years old production systems while maintaining it in production. As such, we have a lot of legacy data that is getting cleaned/polished into a Hibernate (and Envers) model.

       

      This incremental data migration process has been a little challenging with Envers, but until now I'd been able to deal with the old data in a workable strategy.

       

      Now we have a bunch of history that broke my initial "extra space", over a million records that need to be put into Envers.

       

      The strategy I'd been using:

      * increment the MasterVersionHistory.id generator to 1,000,000 (leaving some room for history.

      * when adding @Audited to an entity also populate that entities audit table with current data and version_id=500,000

      * new audit data will get a > 1,000,000 value and a new Date stamp.

       

      Feature Request: If the audit table is empty, can Envers treat the current Hibernate entity as the history entity?

       

      This feature would reduce the amount of data migration required to start auditing an entity.

       

      My Big Problem: Lots of history to be added.

       

      Envers assumes that the ID (and of course the DATE) is ordered with newer audit versions having greater numbers. For example AuditReader.getRevisionNumberForDate() yields:

      "select max(mastervers0_.id) as col_0_0_ from MasterVersionHistory mastervers0_ where mastervers0_.date<=$1"

      Notice the max() will select a version_id from further into the past if not ordered.

       

      A possible strategy is to do a self join or subselect to compare the min/max dates against the effective (and then max(id) only to handle duplicate dates). This would be a good deal for complicated in envers, changing this function and AuditReader.find() at least.

       

       

      Any good ideas? I don't really want to re-generate all of my MasterVersionEntity id values in SQL everytime we migrate more data into the system, but that may be all I'm left with.

       

      Help,

      John

        • 1. Re: Dealing with "New" History - Strategy?
          adamw

          In the beginning of Envers, only the "really" historic data was stored in the _AUD tables (which were _ver tables then), with the current data stored in the "real" tables. This was a bit more space-effective, but it was way harder to do any queries. In fact, any history-query involved either a SQL UNION or two database queries and hand-manipulation. So to keep things simple and more effective, at the cost of some (not really very noticable) space penalty, the users voted to store the data as it is stored today.

           

          But it seems that what you want to do is different. You write that you'd like Envers to "treat the current Hibernate entity as the history entity" if ""the audit table is empty". But then, if you modify an entity, the new values get written to the _AUD tables, and the old ones are gone forever. That's not probably what you want?

           

          Also, I don't quite understand the second part of your post. You have history data in some format that you'd like to import into Envers, right? And it's ordered using dates only? Can't you re-set the ids basing on the date ordering? Or am I missing something?

           

          Adam

          • 2. Re: Dealing with "New" History - Strategy?

            Adam,

             

            I see the challenge with not having data in the audit tables, the queries need to be unioned or _always_ execute two. I was assuming the default case would be one query, and only when that failed would a second query be executed.

             

            (Also, I'd hoped that the behavior of editing an existing entity without history would be to create 2 audit rows: one to represent the previous version, and one to represent the modified version. I wouldn't have wanted the old one to be gone forever.)

             

             

            Regarding importing history data, I'd really like to avoid a full "re-set of the ids based on date ordering". We are migrating data in waves and rolling Hibernate and Envers out further and further into the system.

             

            Here's the simplest case I can describe. Assuming that a number of Entities have already been @Audited, and now we want to @Audit Customer:

            REVINFO table:

            REV    REVSTMP

            1      2008-01-01

            2      2008-01-02

            3      2010-03-01


            CUSTOMER table:

            ID     INIT_DATE      NAME

            1      2009-02-01     Bob

            2      2009-03-01     Fred

             

            I'd like to just add the data to a CUSTOMER_AUD table and create REVINFO data like this:

            REVINFO table:

            REV    REVSTMP

            1      2008-01-01

            2      2008-01-02

            4      2009-02-01 # from CUSTOMER 1

            5      2009-03-01 # from CUSTOMER 2

            3      2010-03-01


            But I'm sure that would break a bunch of queries, because the *new* REVINFO ID values are out of order.

             

            I could dump the whole thing out, and rebuild the REVINFO ID vaues, but we're going to be doing this every few weeks for a bunch of entities. This is what you mean by "re-set of the ids based on date ordering", right?

             

            Thanks,

            John


            • 3. Re: Dealing with "New" History - Strategy?
              adamw

              Ah, now I understand your problem .

               

              Unfortunately Envers doesn't have built-in support for such cases (as it's quite deployment-specific), but if you'd like e.g. to have two audit records created (old and new) in case there's no audit record yet, Envers could be quite easily modified to do so (after all, it's open source ). But bear in mind that then for every update an additional SELECT would be required to check if there's an audit record already.

               

              One solution that I can think of so far is to collect values of the INIT_DATE column (or any other columns that will be used as revision timestamps) before the migration, from all tables. Then you can migrate entities one by one, having the revision numbers already populated. There is of course some problems still with the new values of the timestamps that appeared between two migrations, but it should be easier than re-indexing everything.

               

              Does this help in any way?

               

              Adam

              • 4. Re: Dealing with "New" History - Strategy?

                Adam,

                 

                (I'm not interested in the auto-create old audit records idea, too much work for everyone involved.)

                 

                I am still interested in changing the date range handling to allow "new" history records to be inserted into the audit tables for old data.

                 

                This query from RevisionInfoQueryCreator does a simple max(rev.id) that would break if records with an old date where added to the table:

                        revisionNumberForDateQuery = new StringBuilder()
                                .append("select max(rev.").append(revisionInfoIdName)
                                .append(") from ").append(revisionInfoEntityName)
                                .append(" rev where ").append(revisionInfoTimestampName).append(" <= :_revision_date")
                                .toString();

                 

                The resulting SQL is:

                     select max(mastervers0_.id) as col_0_0_ from MasterVersionHistory mastervers0_ where mastervers0_.date<=$1

                 

                I would like this SQL to be generated instead:

                      select max(rev.id) from masterversionhistory as rev left outer join masterversionhistory as rev2 on rev.date <= rev2.date where rev2.date <= 1266427653976 group by rev.date, rev.id having rev.date = max(rev2.date);


                This second query would pick out the max(date), and also break any ties with the max(id). It doesn't depend on id incrementing forward.

                 

                What do you think? I'm trying to avoid having to rebuild history all the time. As Envers becomes more popuar the issue of migrating legacy data will come up more and more.

                 

                Cheers,

                John

                • 5. Re: Dealing with "New" History - Strategy?
                  adamw

                  So you'd like to select the maximum revision from the revisions with the date equal to the maxium? In other words, you'd like to have a (date, revision) composite primary key, with lexigraphic ordering?

                   

                  One thing that could be included in Envers, is a pluggable way to determine revision ordering - so that either the default could be used (just max(rev)), or your version. If you would develop such a patch, I would of course include it. One difficulty here is that I think there are a couple of places where the revision is selected using the max, so it may be a bit of work. But it will be better to centralize it anyway .

                   

                  Adam

                  • 6. Re: Dealing with "New" History - Strategy?

                    Hello Adam,

                     

                    I've attached a patch that starts this process, but I'm getting bogged down

                     

                    So far I've gotten the RevisionForDate to work the way I intend.

                     

                    I've changed it from:

                    select max(rev.id)) from org.hibernate.envers.DefaultRevisionEntity rev where rev.timestamp <= :_revision_date

                     

                    to this:

                    select max(rev.id) from org.hibernate.envers.DefaultRevisionEntity rev where rev.timestamp = (select max(rev2.timestamp) from org.hibernate.envers.DefaultRevisionEntity rev2 where rev2.timestamp <= :_revision_date)

                     

                    That change is working for the existing test cases and a few bits of mixed up IDs that I threw in there.

                     

                     

                    This, from EntitiesAtRevisionQuery, is more challenging:

                    select e from org.hibernate.envers.test.entities.StrIntTestEntity_AUD e
                    where e.REVTYPE <> :_p1
                    and e.originalId.REV.id =
                         (select max(e2.originalId.REV.id) from org.hibernate.envers.test.entities.StrIntTestEntity_AUD e2
                              where e2.originalId.REV.id <= :_p0 and e.originalId.id = e2.originalId.id
                         )
                    and e.originalId.id = :_p2

                    Are you able to quickly help with the HQL that would accomplish what I'm trying to do? I'm going to keep plugging away at this, but I've definitely slowed down at this point.

                     

                    The attached patch doesn't include a pluggable API. Is this really an extension point? The changes I'm suggesting will continue to work with the common case of an incrementing version id, but will also work with out of order version ids. Performance of queries is a concern, I don't yet know how much impact this will have.

                     

                    If you feel strongly this should be an extension point, with the current behavior as the default, I'll be happy to package it that way. I'm going to keep working inline by patching for the time being, after things are working entirely I'll be able to wrap it with an API is you want.

                     

                    Cheers,

                    John

                    • 7. Re: Dealing with "New" History - Strategy?
                      adamw

                      Heh right, it gets ugly quite quickly. Hmm. Implementing a lexigraphic order on (timestamp, id) in SQL seems at least non-trivial .

                      Especially the select max(...), here you would end up with multiple nested queries. Certainly not a way you would want to go and maintain later.

                       

                      This will be much easier once the end-revision is implemented. One more reason to do it .

                       

                      The best way would be to combine somehow the timestamp and id into a single number. But that's a non-trivial task .

                       

                      In fact, if two revisions have the same timestamp, does it matter which is considered by envers to be "first"? (as your main problem, as I understand, is that you cannot use the timestamp as a id because there could be several revisions with the same timestamp?) Maybe then the revision number can be a float or sth like that ... with subsequent revisions squeezed into the remainder part.

                       

                      Adam

                      1 of 1 people found this helpful
                      • 8. Re: Dealing with "New" History - Strategy?

                        Adam,

                         

                        Yes, non-trivial

                         

                        What does the end-revision plan look like? When might it be implemented?

                         

                        Two revisions with the same timestamp: In my opinion it doesn't matter, but must be stable. In the SQL I've been writing I've ordered first by timestamp, then by id. The only circumstance that I think would be a bug is if the order of revisions with the same timestamp changed in some way leading to non-deterministic results.

                         

                        Thanks,

                        John

                        • 9. Re: Dealing with "New" History - Strategy?
                          adamw

                          About the end-revision: I hope to know in a week or two; but I can't promise anything right now (open source ... )

                           

                          I'll be thinking about the composite-revision-id problem

                           

                          Adam

                          • 10. Re: Dealing with "New" History - Strategy?

                            Hello Adam,

                             

                            Any further thoughts on end-revision or composite-revision-id? I understand you can't make promises.

                             

                            We're going to be relying on the revision ID more and more soon by recording it directly in non-versioned entities, as well as expanding auditing into more domain entities, so I'm trying to plan for the impact of changing those IDs.

                             

                            Thank you,

                            John

                            • 11. Re: Dealing with "New" History - Strategy?
                              adamw

                              Well then I guess it would be good to have a pluggable way of determining the ordering of revisions. The default being sorting by ids. Maybe the plug-in could just take a list of columns. In your case, sorting by timestamp and then by id wouldn't work?

                               

                              Adam

                              • 12. Re: Dealing with "New" History - Strategy?
                                hernanbolido

                                Hello!

                                 

                                We have this problem also. We are thinking about taking this approach:

                                 

                                The revision number is going to has its own id generator in hibernate. This generator will take the timestamp for the revision and concat a 3 digit number at the end. This 3 digit number will be a sequence (oracle sequence) which goes from 0 to 999 and then is reseted to 0.

                                 

                                This solution is hibernate-clustered applications because if two nodes in the cluster need a revision number at the same time, the cicle sequence is going to determine a different revision.

                                Also having a timestamp as part of the revision number let us migrate data from our legacy system anytime because the revision numbers leave holes between them.

                                 

                                It´s basically the idea we´re trying to implement.

                                 

                                 

                                What do you think?

                                 

                                 

                                Hernán.

                                1 of 1 people found this helpful
                                • 13. Re: Dealing with "New" History - Strategy?

                                  Hernan,

                                   

                                  That would be great, and simpler than what I was trying to do.

                                   

                                  Have you already defined a @RevisionEntity with these properties? Would you mind sharing it?

                                   

                                  Thanks,

                                  John

                                  • 14. Re: Dealing with "New" History - Strategy?
                                    hernanbolido

                                    John,

                                     

                                    My revision entity doesn´t has anything strange. The only consideration is using the timestamp as long.

                                     

                                    The key for this idea is setting the revision number correctly.

                                    I've found two ways to do this:

                                     

                                    1) Implementing my own RevisionListener and in the newRevision() set the id manually.

                                    2) Implementing a custom revision number generator (see org.hibernate.id.IdentifierGenerator)

                                     

                                    I chose the second one doing this:

                                    • Create a subclass of SequenceGenerator (in order to reuse the sequence part, but you can simply implement IdentifierGenerator)
                                    • Inside generate() method, I obtain the revision's timestamp and the sequence number
                                    • Then I "contat" timestamp + seq doing -> timestamp + (1000 * seq) -- (this is because seq goes from 0 to 999)

                                     

                                    In revision.hbm (i don´t know how to do with annotations)

                                     

                                         <id name="number" column="ID_REVISION_INFO" type="long">
                                                <generator class="package.CustomRevisionNumberGenerator">
                                                    <param name="sequence">IH2_S_BREVISION_INFO</param>
                                                </generator>
                                            </id>

                                     

                                    And here is the generate() mehod from package.CustomRevisionNumberGenerator class

                                     

                                    /**
                                       *  Obtains both parts of revision's number and concat them

                                       *
                                       * @param session
                                       * @param object
                                       * @return
                                       * @throws HibernateException
                                       * @see org.hibernate.id.SequenceGenerator#generate(org.hibernate.engine.SessionImplementor,
                                       *      java.lang.Object)
                                       */
                                      public Serializable generate(SessionImplementor session, Object object) throws HibernateException {

                                     

                                        if (logger.isDebugEnabled()) {
                                          logger.debug("Obtaining revision number");
                                        }

                                     

                                        // get the timestamp, objetct is the revision
                                        long dateTime = (Long) this.obtainInitialValue(object);

                                     

                                        // get the sequence
                                        long seqNumber= (Long) super.generate(session, object);

                                     

                                        // concat both values
                                        long result = (dateTime * 1000) + seqNumber;

                                     

                                        if (logger.isDebugEnabled()) {
                                          logger.debug("Revision number obtained: " + result);
                                        }
                                        return result;
                                      }

                                     

                                     

                                    Remember that the sequence starts in 0, goes to 999 and then reset to 0 again.

                                     

                                    I hope it will be  useful. Hernán.

                                    1 2 Previous Next