Resource Type matching
lkrejci Dec 12, 2014 5:09 PMIn a situation where new resource types can enter the system from the feeds that dynamically generated those types through analysis of some resources (i.e. generated a plugin for certain subtree of JMX management tree), one should have a way of determining whether 2 feeds didn’t find the same type. This is so that we can then understand the resources of those types as being of the same type.
At the same time, there is a requirement mentioned above to be able to compose new resources out of the existing data. Such resources would also need a type, composed of the individual data types of the involved measurements et al. Such composed types would essentially be in the same situation as the feed-originating types - we need to know if 2 types are equal or not so that we can group them together.
Because matching by some sort of ID is not possible (how would feed A and feed B know about each other’s work?), we need to use a structural match to find matching resource types. This structural match will have to determine the compatibility of two resource types (notice I avoided the word equivalence because that’s not exactly needed). Two resource types are compatible when all of their data has the same types (but potentially different names).
We need to determine compatibility when a user is updating a resource of certain resource type with data coming from elsewhere - in here we probably don’t need the compatibility of the whole resource types, but still need to determine whether a certain data type "fits into" a certain resource type.
Compatibility also comes in handy when we need to determine whether two resource types coming from 2 feeds (or from users) are the same. Judging this only by compatibility, disregarding name differences provides a more robust way of matching the resource types that is immune to 1 letter changes, typos or different naming conventions. It can also more readily produce "false positives" - i.e. when 2 distinct resource types are structurally identical but semantically different.
IMHO the best approach for this would a form of hash or fingerprint of the resource type or the data type.
Something resembling a Merkle tree hash (i.e resource type hash being a hash of all the data type hashes) could be the solution to this.
For compatiblity, the following properties of different datatypes could be used:
- Measurement Definitions
data type (numeric, avail, log event), unit
- Operation Definitions
return type, parameter types (see config how to represent the types)
- Configuration Definitions
for each property type, we’d record name, type and sub-properties recursively. Configuration defs could be represented using a JSON schema (parsing of which could be done with something like https://github.com/fge/json-schema-validator). Note that for configurations I consider the names of their properties significant. I am actually not 100% sure about this but not doing it I think could produce an uncomfortable number of false positives (e.g. any configuration consisting of 3 string properties would be considered compatible regardless of what those properties would be used for - ([ip, host, servername] would be compatible with [tenant, user, password])).