1 2 Previous Next 19 Replies Latest reply on Apr 14, 2008 3:18 AM by tom.baeyens

log infrastructure proposal

tom.baeyens Mar 27, 2008 6:13 AM

i just committed a pvm log infrastructure proposal.

The overview goes like this:

During process execution, subclasses of ProcessLog objects are created and sent to the LogSession. The LogSession is obtained from the current environment.

Each ProcessLog is encouraged (by means of abstract methods getType and getProperties) to provide the details of the process log in a generic structure. The ProcessLogXmlSerializer will use those methods to serialize any ProcessLog object to XML.

For easy configuration, a number of log session implementations will be provided:
1) one that will save the process log in the db
2) one log session implementation will delegate to a chain of log sessions
3) a filtering log session, that only delegates to a log session in case the filter is passed. the default filter implementation is done based on log type.

Other envisioned log session implementations will serialize the process log to xml and push it onto the ESB or append it to a file.

The environment configuration xml will make it easy to configure the process logging. So that it should be easy to add listeners for certain events and have those injected into the esb.

1. Re: log infrastructure proposal

aguizar Mar 27, 2008 5:06 PM (in response to tom.baeyens)

Assuming the ProcessLog subclasses follow the Java Beans pattern, we could use the java.beans.XMLEncoder instead of a custom XML serializer.

Another possible log session implementation would save entries to a different database. This would address complains on the size of the table. It is true that it could fall out of synchrony with the main database unless JTA was used. However, this is a tradeoff that would affect a file-based log session as well.
Actions
2. Re: log infrastructure proposal

tom.baeyens Mar 28, 2008 4:22 AM (in response to tom.baeyens)

"alex.guizar@jboss.com" wrote:
Assuming the ProcessLog subclasses follow the Java Beans pattern, we could use the java.beans.XMLEncoder instead of a custom XML serializer.

good idea. i'll check if we can use that as the default impl of the serializer.

"alex.guizar@jboss.com" wrote:
Another possible log session implementation would save entries to a different database. This would address complains on the size of the table. It is true that it could fall out of synchrony with the main database unless JTA was used. However, this is a tradeoff that would affect a file-based log session as well.

in the new pvm approach, the log table will shrink. the log table will be used for passing information from the runtime data to the history data tables. the history data tables will form a complete datamodel on their own. the information in there will be created by parsing the logs. once that is done, the logs can be deleted. and what remains is the process definition and the history data. the runtime execution state and the logs will be temporary.

that should solve the problem you mention, i think.
Actions
3. Re: log infrastructure proposal

mvaldes Mar 29, 2008 8:28 AM (in response to tom.baeyens)

Pierre, could please you take a look to the Tom's proposal and compare it to the current logs/history implemetation we did in the XPDL extension ?

Would be great to work together and to share components on that topic as well.

Regarding the XML serialization I don't remember why but I think that Pierre had some problems or he encountered some restrictions when using the XMLEncoder...

Pierre could you please comment on that as well...

thanks,
Miguel Valdes
Actions
4. Re: log infrastructure proposal

csouillard Apr 1, 2008 5:00 AM (in response to tom.baeyens)

I am currently looking at logs in the PVm and I just want to have the status of the implementation of this feature...
Can you tell me if you have something developped on your side but not commited ? Do you need help for that point ?
I want to challenge this feature to check if we can use it Orchestra and Bonita...

Thanks for your help

Charles
Actions
5. Re: log infrastructure proposal

csouillard Apr 1, 2008 5:05 AM (in response to tom.baeyens)

Sorry one more question...

In ExecutionImpl, the method addProcessLog contains the following TODO :
"send log to the logging service if it is available in the environment"
What do you mean by logging service ?
Is is something like LogSession.addLog(processLog);

Or is it something different ?
Currently, there is no LogSession implementation.I have seen that Db and XML could be two proposals in the precedents posts ? Is that right ?

Charles
Actions
6. Re: log infrastructure proposal

csouillard Apr 2, 2008 8:44 AM (in response to tom.baeyens)

Conf call between Tom and Charles
Date : 2008-04-01
Duration : 45'
Subject : PVM logging/History mechanism

Tom thinks that all logs (at least the major part of logs) can be done in the PVM. Extensions can also log but the goal is to provide in the PVM essential logs for all applications.
Currently, the biggest problem is to define what should be logged : which kind of information and when.
Tom has a contact with CI company which may help us on this problem as it has some experience on that issue.
A first set of logs could be (small set, to be refined) :
- start/end a process
- enter/leave an activity
- cariable update
- event execution : NO. Tom thinks this should not be logged by the PVM.

The global idea of the logging mechanism in the PVM is the following :

All logs are stored in database. Storing logs in a file is not a good idea because we have to manage concurrent access. In addition, transactions are not available in a file mode. Java.io PAI can return and say "file written on hard drive" but the file is not yet on hard disk. This is a File System behaviour.
By default this mechanism is not pluggable as Tom can't see any other interesting implementations but if we need another one we can make it pluggable.
We can imagine that when we want to remove a log from history DB, we can flush it into an XML file. No effort will be done in the PVM for this point.
Logs are splitted in two different DB :
- runtime DB used to store logs not yet moved to history
-> this DB is hard to query (not queriable for users in fact), logs are flat
- history DB (Business Intelligence DB)
-> easy to query
History DB will contain a lot of rows. To improve performances, it is important to have a fixed DB schema. This schema can be a "start-style" schema : it means, main tables (Execution, task...) are in the middle and other tables around. This is a common BI DB architecture. Runtime and history database schema are not the same as they don't have the same purpose : temporary storing runtime datas vs store a big amount of easy queriable datas for a long time.

In the runtime database, we can imagine to have the same table for all logs (ProcessLog). For that, we can create the table with a fixed large number of columns to ensure all logs could be stored in this table (30 ? 40 ? 50 ?). By now, databases support with no problem tables with a large number of columns.
A log mode must be defined in the environment. A lof mode defines when a log move from runtime DB to history DB. This mode must be configurable at engine level and process level (Tom says instance level is too much).

Currently, Tom has defined 4 modes (see LogMode enum type):
- at the end of each transaction synchronously
- at the end of each transaction asynchronously
- at the end of the last transaction (end of the instance) synchronously
- at the end of the last transaction (end of the instance) asynchronously

Charles proposed to add a new one : on demand (every night, every week....).
As the "runtime log DB" is not queriable, it is very important to choose the good LogMode.
When a log is moved from runtime DB to BI DB, the following mechanism is performed :
example :
when an execution completes, you only want to insert a execution-complete log into the flat DB.
then when processing that process log, you should look up the execution in the history DB and update the end and duration columns.

Each log is a ProcessLog (type). The way to create a new specific log is to subClass ProcessLog.
ProcessLog properties (ProcessLogProperty) are used to translate the process logs into XML (ESB purpose, flush to a file...).
Each log is self-contained (contains values and not references).

StringVarUpdateLog example :
StringVariableUpdateLog
class StringVariableUpdateLog extends ProcessLog {
String variableName;
String oldValue;
String newValue;
}

might map to the db like this
variableName --> COL_ONE
oldValue --> COL_TWO
newValue --> COL_THREE

then the properties might be exposed like this

public String getType() {
return "string-var-update";
}

public List getProperties() {
List props = new ArrayList();
props.add(new ProcessLogProperty("variableName", variableName);
props.add(new ProcessLogProperty("oldValue", oldValue);
props.add(new ProcessLogProperty("newValue", newValue);
}

then a separate piece of the infrastructure might be able to serialize that log to

<process-log type="string-var-update"
time="12/34/2009 11:23:45,457"
execution="9233784"
process-instance="208937402">

</process-log>

Archiving (move from runtime DB to BI DB)
This point is not yet clear. Two ways are possible : an archive method in processLog or a global "Archiver" (RuntimeToHistoryDbArchiver for instance)
Querying
PVM will offer a set of predefined queries that cover the major part of common queries. The idea is to offer an extensible mechanism that allows users to do their own queries (hibernate dependent).
Actions
7. Re: log infrastructure proposal

philwilkins Apr 3, 2008 5:33 AM (in response to tom.baeyens)

As a result of discussions with Tom at JBoss World 2008 and by email, we would like to put forward the following for the PVM logging .

Background.

SeeWhy (www.seewhy.com) have produced a BAM and Event Driven BI solution for jBPM which we're looking to release as SeeWhy for JBoss jBPM. The basis of the implementation are some ActionHandlers that generate XML based messages that are sent to SeeWhy to processed using a set of metrics we've developed and will make available that address a standard set set of BAM measurements and also supports the ability to create further Business Intelligence (BI) metrics. Tom has identified that going forward it would be very beneficial for the PVM to be able to produce the information SeeWhy has identified so that it can be used to feed a BAM solution, such as the one SeeWhy have developed on top of our Event Driven BI platform.

Requirements as SeeWhy see it:

SeeWhy's analysis which led to the SeeWhy for JBoss jBPM demo at JBoss World 2008 in terms of what data needs to be captured applies equally to the PVM:

1. Provide information for generic business activity processing.

This requires the capture of relevant data for process and node start and end. The XML schemas for the SeeWhy Process & Node events detail the information required.

2. Provide information for business intelligence.

This requires the data in process specific variable to be mapped out to business events. This can be information such as Product ID, Sale Value etc.

It should be possible for the data to be delivered in a variety of forms and by a variety of mechanisms, e.g. XML on MQ, row in a database, log file output etc. These transformation and delivery requirements may be handled outside of the PVM logging mechanism, e.g. via the ESB.

Although the raw information in the PVM is at an abstract level, to enable easy reporting and monitoring of the data the output should be in a form that is as far as possible understandable by a business user. For example, it may be necessary to have an optional mapping that allows java handling classes to be substututed by Process, Node, Action etc. Likewise it should be possible to map variable names to meaningful terms. This responsibility may be outside the PVM logging mechanism.

As some information provided may not be needed by an audit mechanisman ability to provide some filtering would be highly beneficial.

Data needed to realise this:

To realise the BAM functionality we would suggest the following information is supplied from the PVM:

originator the process / node
Date start date
Date end date
name - process / node
version,
unique id,
parentId,
parent process name
parentVersion,
actor / executor
swim lane,

Note that when reporting the start of something then the end date wont be available, but on the completeion of executing activity(node) or process then both start and end should be supplied. By providing both times in the event means that typical metrics that are being performed such execution times for nodes (activities) and processes can be performed simply without the cost of having to correlate events. Avoiding the overhead of event correlation is very desirable as it can prove to be very expensive (potentially more so than the actual PVM execution cost) as the PVM has the chance to capture both pieces of information within its context the cost can be avoided.

The data wanted in a BI event would be largely the same as the BAM data with the addition of selected variables (which variables depends upon the BI calculations), and rather than the start and end times a single capture time would be wanted. The last additional piece would be an identification of what type of BI event is being represented.

As Tom has provided us with a steer towards code to look at the PVM perhaps the best way to implement the functionality we believe that this capability could easily bolted in as a LogSession. Although some additional work maybe needed to add to the execution context the start time so it can be included into the end events. The other element not seen in the PVM code is the idea of aliasing such that the abstracted elements maybe saubsituted with language that a business user could understand, but presumably this sort of feature could be incorporated in to a chained log session for specialised views of the logging?

To put some flesh onto this data the schemas for the BAM events SeeWhy takes can be found at http://seewhy.fileburst.com/jBPMEventSchemas/
As the BI events are some what more specific to a situation that a jBPM user has we provide a template which the user needs to complete by adding attributes which appear as [url]http://seewhy.fileburst.com/jBPMEventSchemas/[/url].

The BI events also provide the challenge of how to denote where a BI event is wanted.
Actions
8. Re: log infrastructure proposal

mvaldes Apr 3, 2008 8:06 AM (in response to tom.baeyens)

Thanks for the overview Phil !

This kind of information indeed help us in knowing exactly the kind of data (technical and business), transformations and underlying mechanism required when working with BAM and BI tools.

The main problem I see with the current PVM log proposal, compared with the current Bonita XPDL implementation is that both BAM and BI data will be stored in the same database. Let me explain that:

- As Logs are stored in a single table (so not queriable), history DB will be the central point to query on workflow data (both runtime and history data). IMO, this is not a good approach as the history (BI) database could contain a lot of data and so queries on runtime data (BAM, i.e give the name of the user who started the first activity) could take a long long time... so I'm really concerned on the performance impact this approach could have.

- Moreover, the only way to query on runtime data with the current proposal (and match any use case) would be to use asynchronous execution -> don't think this is good for performance. By default I would suggest to always use synchronous execution

In Bonita XPDL implementation, we differentiate between journal data (BAM) and history data (BI). Journal data can be used for logging but can also be queried. Both journal and history shares the same data model. Journal and History repositories are both configurable (DB, XML...).

So BAM and BI modules/applications can leverage dedicated repositories rather than share the same one.

This architecture also allow as to minimize the "move to history" like operations as by default history data is moved at the end of a workflow instance execution and not at each workflow transaction (between both the data always remains in the journal)...

regards,
Miguel Valdes
Actions
9. Re: log infrastructure proposal

philwilkins Apr 3, 2008 10:47 AM (in response to tom.baeyens)

Miguel,

Your observations about the database I'd fully agree with. I had visualised that different BAM/BI products would make use of a specialised LogSession (if I understand th logging correctly) such that they don't write to the std database but to another resource - so would need an alt. implementation of the serialisation interfaces that I believe have been suggested. For SeeWhy that other resource would be a JMS queue so that the data goes directly to the SeeWhy realtime engine. If I've got this right then you could reduce the DB concerns by having an alternate table for pure BI/BAM for a generic strategy.
Actions
10. Re: log infrastructure proposal

mvaldes Apr 4, 2008 5:33 AM (in response to tom.baeyens)

Phil, there is just one thing which is not yet clear to me: is SeeWhy using the same db for BAM and BI or you suggest to have two dbs, one for BI and another one for BAM (in addition to the internal workflow db) ?

Tom, any thought on my post on the current log proposal issues I see ?

thanks !

Miguel Valdes
Actions
11. Re: log infrastructure proposal

mvaldes Apr 4, 2008 5:34 AM (in response to tom.baeyens)

Phil, there is just one thing which is not yet clear to me: is SeeWhy using the same db for BAM and BI or you suggest to have two dbs, one for BI and another one for BAM (in addition to the internal workflow db) ?

Tom, any thought on my post on the current log proposal issues I see ?

thanks !

Miguel Valdes
Actions
12. Re: log infrastructure proposal

philwilkins Apr 4, 2008 11:53 AM (in response to tom.baeyens)

"mvaldes" wrote:
Phil, there is just one thing which is not yet clear to me: is SeeWhy using the same db for BAM and BI or you suggest to have two dbs, one for BI and another one for BAM (in addition to the internal workflow db) ?

Tom, any thought on my post on the current log proposal issues I see ?

thanks !

Miguel Valdes

Miguel,

For SeeWhy we wouldn't need additional tables as we'd look to have a specialised LogSession that generated JMS messages. But for a generic solution I'd suggest a separate table from the standard logging one. But that table could store data for BAM and BI activities (with maybe a column to indicate whether the record was for BAM or BI) which had columns for all the BAM info plus perhaps an additional column which held the variables in a structured way (perhaps XML representing name value pairs) as you wouldn't know until deploy time what variables are needed.

Perhaps the JMS and DB logsession implementations are derived from a common base implementation.
Actions
13. Re: log infrastructure proposal

tom.baeyens Apr 4, 2008 1:03 PM (in response to tom.baeyens)

for the past two days i'm trying to find a hole that is big enough to read through this thread. each attempt got interrupted. hang tight. i'll keep trying till i succeed :-)
Actions
14. Re: log infrastructure proposal

tom.baeyens Apr 7, 2008 4:49 AM (in response to tom.baeyens)

Miguel,

I am willing to bend the logs till it satisfies your requirements. But I am a bit concerned about the complexity that we will run into.

IFAICT now, we both want a BI (star schema) database for the history queries. On top of that you want another schema or table (called Journal) to perform queries on runtime executions that are not finished yet.

First I would like to ask to clearly sketch your requirements in terms of which databases or schema's you see, what kind of data they contain and what kind of queries will be issued against those dbs.

I don't quite get/follow the motivation yet to have a separation between runtime history data (journal, unfinished executions) and the real history data (finished executions). You say that you think the performance of the queries on the runtime data will get too slow. I don't think that the extra complexity of separating the journal from the history is worthed for that.

On top of that, I'm not sure if you see the full potential of the configuration options to archive the logs into the history db. I think asynchronous log archiving right after the runtime transaction should be the default. That means that the history tables will be always up to date. And the work to archive needs to be done anyway. So postponing it till end of execution doesn't really save any work to be done. So it doesn't result in higher throughput.

Another solution that you should envision in this picture is that we could consider the history database as part of the solution. Then the users could have another archive database with same schema as the history database that contains the archived history. E.g. all processes that have ended for longer then 6 months ago will be archived to the archive db.

Treat all this as a mixture of concerns and alternative pieces of the puzzle. Can you make it a bit more clear what it is exactly that you would like to see realized on the pvm ?

As for the namings, i propose following terms:

* Runtime DB (state of active executions, optimized for just state management. contains only active executions)
* Log table (flat list of events that are recorded during execution)
* History DB (execution information, optimised for querying. contains active and finished executions)

The act of processing the logs to the history db is called archiving

This also indicates the part for which I don't yet understand the full details and use cases: the Journal.
Actions

1 2 Previous Next

Go to original post