The ONLY way an alert definition can have it's recoveryIds messed up is if the template is edited/saved. Could there be other admins/operators using the system that you're unaware of? Maybe they are resetting things on you?
"Plus it still won't work because the cache is not restarted"
As I mentioned to you in the #jopr room, this was one reason why you should upgrade and use the 2.2 release. In it, there exists a way to reload all of the caches for each of the agents in the system. This operation was written with the specific intention of being able to correct the cache data in situation like this one (even know I didn't know about RHQ-2150 when I wrote/exposed the cache reloading operation).
However, maybe I can convince our buildmeister that we should put out a new community version (because this issue is now fixed in trunk). There are a few items that are in progress right now, so it may have to wait a few weeks until things settle down and get stable.
I post my "rhq_alert_definition" content on jopr pastebin  as you request on JIRA
|Retrieving data ...|