Sure, I can elaborate on the RHQ case. It was discussed on the rhq-devel mailing list a little while back. Here is my original post on the subject,
Currently there exists the possibility of numeric data loss when merging measurement reports. If there is an error storing raw data, we log the error but do nothing else. Suppose for example that while the server is storing a set of raw data, the storage cluster goes down half way through. In this scenario it is likely that the latter half of that data is lost. There has been some recent discussion about the potential for data loss, and I want to open it up to the list for additional thoughts, opinions, etc. I will briefly summarize a few options for dealing with data loss.
option 1 - do nothing
The case can be made that loss of metric data may not be as significant as losing inventory or configuration data for example. If the data loss is limited to a single measurement report or subset thereof, then it probably is not very significant since we are dealing with loss of a single data point for some some number of schedules. Of course, some dropped metrics here and some dropped metrics there can quickly add up to where we are dealing with a substantial amount of data loss, and this would be bad.
option 2 - Rely on agent/server comm layer guaranteed delivery
MeasurementServerService.mergeMeasurementReport(MeasurementReport report) has guaranteed delivery semantics. If the calls fails for whatever reason, the agent will retry it. The agent also spools the report to disk so that if it get disconnected from the server, it can retry after reconnecting. The downside of the guaranteed delivery is that the agent continually retries. If storing raw data failed because the storage cluster is overloaded, this could exacerbate the problem. I have actually experienced this in test environments where I was putting a heavy write load on the server and storage cluster. My server would be down or in maintenance mode for a while, and then the server comes back up, all my agents hammer the server with spooled measurement reports.
There is another aspect to consider in terms of efficiency. Suppose an agent sends 10,000 raw data to the server. An error occurs after storing 9,995 raw data. The agent will resend and the server will store again all 10,000. This is less than optimal and brings me to option 3.
option 3 - Do not overwhelm the server and only retry failed data
The server can report back to the agent the raw data that it failed to store. The agent can spool that data to disk, and resend it at some point in the future. There could be some different approaches. The agent could retry on some fixed interval, or maybe it uses some initial delay with an increasing back off, e.g., 2 minutes, 4 minutes, 8 minutes, etc. This option requires the most work, but I think that it is the most robust.
Later on in the rhq-devel mailing list thread someone suggested handling it completely on the server side by spooling the data to disk and retrying it at some point in the future. Maybe that is what we need to do, but as of right now I prefer to let the client handle it for a couple reasons. First, even if we let the server handle the failures, I think we still want to provide some error reporting to the client for logging/debugging if for nothing else. Secondly, I am concerned about the additional burden that this could put on the server. If we are dealing with a small number of data points, then it is not a big concern. But suppose we have a large, rapidly growing amount of data on disk that will have to be retried. We will want an efficient solution for processing the data on disk, maybe a queue of some sort. Shouldn't that queue be distributed though in the event we are running multiple servers? I just think it is easier and potentially more scalable to let the client retry if it wants.
Even if the server handles failures, I still think it would be nice to provide error reporting back to the client so that it can act accordingly even if that means nothing more than logging the errors for debugging.