1 Reply Latest reply on Mar 5, 2010 11:58 AM by rhauch

Searching and projecting content

meetoblivion Mar 3, 2010 9:01 PM

So I'm trying to figure out the best way to project this data so that it's searchable. I believe display will always work properly.

So I have content that belongs to categories. it's basically a many to many relationship. Categories really have no information about them, pretty much just the name of the category. the Content though will have a lot of data about it, and this relates to being able to search the content.

what I want to determine, as more of a best way to structure it is try to answer

should content own the categories that it belongs to

should the categories own the content

since it's really a many to many relationship, my biggest concern is how to show the counts of data on both sides.

on one hand, i want to be able to show a count of content per category, but i also want to be able to show the total number of search results. my gut is saying that the content should own what categories its in, but i believe that attempting to do a count of how many per category is going to end up being slow if i have to project that over the total pieces of content that may be returned in a search query.

unless there's something i'm missing within the query api that may be able to do something like this for me, or even a fast way to do it via the returned set of nodes.

thoughts, anyone?

1. Re: Searching and projecting content

rhauch Mar 5, 2010 11:58 AM (in response to meetoblivion)

What is the primary organizational driver for the content? My guess is that the primary driver is not categories, but something a bit more natural to the content. As such, I'd think it makes more sense to have the content own the categories (or at least the association of which categories the content is in).

Have you thought about storing/caching on each category the number of associated content, and periodically refreshing those values (via searches)? If you don't need perfectly accurate, up-to-date numbers, this could save a lot of repeated searches and would be very fast. Of course, any need to view the related content under a category (or categories) could be done via a query. (I'd suggest specifying a limit and offset if you can.)

Another consideration is how the content will actually reference the category: a direct or indirect (weak) reference. The benefit of a direct reference is that the repository will maintain the association and dereferencing [1], but the repository will then enforce referential integrity (meaning you can't delete a category if it's being referenced by content). That may or may not be be a good thing in your case. Plus, some JCR advocates recommend against using references.

The benefit of indirect references is that you are more in control. For example, using the category name may make altering existing categories more difficult (which may be acceptable if it is an infrequent activity), while querying and searching would be very fast and the queries more intuitive (e.g., "... WHERE [my:category] = 'Category1'..." or "... WHERE [my:category] IN ('Category1','Category2','Category3') ..."). Alternatively you could choose to use a numeric identifier, which would make renaming categories easier but would change the queries to use the identifiers in the criteria. As long as your application could cache the identifiers for the categories, you wouldn't need to first look them up.

Does this help?

[1] If the content uses REFERENCE properties to store the association to the category nodes, retrieving the referencing content for a category may be as simple as calling "getReferences()" on the category node. That method actually returns the REFERENCE properties that are owned by the content nodes, as an iterator. And the iterators size method should return an accurate count, as long as users have all access to all content.
Actions