1 2 Previous Next 29 Replies Latest reply on Jan 5, 2010 12:52 PM by rhauch

Searching the created repositories

meetoblivion Dec 17, 2009 12:13 PM

Hey guys

So it looks like I've got a nod on the POC I created w/ DNA. In addition to the basic adding/updating/approving content I had, we want to add the ability to search. The searching gets a bit more complex in that some of our data has to come out of an external database (which we can pull info about via SQL).

So based on what I saw back in 0.4 you were planning to create something based around Lucene for searching. Is this still on the plate? What about the sequencing capabilities? I can't seem to find much documentation as far as how to search through the repo.

Thanks!

1. Re: Searching the created repositories

rhauch Dec 17, 2009 1:34 PM (in response to meetoblivion)

I'm actually working out some of the last kinks with final integration of the search functionality, and that's been the major driver for the 0.7 release (and the reason why the release has slipped).

The query mechanism works through the JCR query API, and in addition to the XPath query language (required by the JCR 1.0 spec), we also support an enhanced version of the JCR 2.0 SQL language as well as a simple full-text search language that is based upon the JCR 2.0 full-text search grammar. Basically, you'll obtain the session's query manager, create a query given a language and the query expression, execute the query, and process the results - all per the JCR 1.0 API.

Internally, we parse and transform all of these query languages into a single (internal) abstract query model upon which our query engine is based. The query engine transforms the AQM into a relational query plan, validates the plan, optimizes the query plan, generates an execution plan, and finally executes the query.

At the moment, the repository is using Lucene to maintain a set of indexes for the repository content, which we use to help quickly find the data that satisfies the criteria. (Of course, Lucene doesn't support joins, so the query engine still does all of this higher-level processing). And we've designed the whole system so that if a connector natively supports searches and queries, the query engine will utilize the connector. Over time, we'll likely be enhancing the connectors to provide native search and queries.

We were working hard to try to get 0.7 out the door before the holidays, but with some loose ends to tie up we've run out of time. Unfortunately, this means we won't release 0.7 until very early January.
Actions
2. Re: Searching the created repositories

meetoblivion Dec 17, 2009 1:44 PM (in response to rhauch)

well that's fine. we're not looking for this til next month really. if i check out the trunk, how accurate would you say it is?

what about my other question, regarding plugging in some data from an external source. based on the federated approach, i should be able to hook in a generate datasoruce, right, and construct something that turns that data into searchable nodes?

as far as lucene and indexing goes, will I have the ability to:

1. purge the indexes daily
2. limit what gets indexed based on an algorithm we have defined (essentially, since dna doesn't support versioning, i created a notion where the "node" contains all of its versions below it as items, and a separate tree exists that handles a rudimentary approval process)

Thanks!
Actions
3. Re: Searching the created repositories

rhauch Dec 17, 2009 2:00 PM (in response to meetoblivion)

I still haven't committed a number of the fixes to the JCR interfaces, so I don't think searching/querying works quite yet. Very very soon.

Yes, you can write a connector to your database and construct a graph-representation of your external data, and you should be able to federate this with your other content. Basically, you'd configure your normal sources and then a federated source that projected your other sources. The JCR repository would then use that federated source.

The indexes should be kept in sync with the changes, if those changes are made through JCR. But if you need or want to, you will be able to purge and regenerate the indexes with a JcrRepository-specific method.

In terms of limiting what gets indexes, we haven't yet implemented a way to customize or limit this, but it shouldn't be too difficult (especially for paths that should or should not be indexed). Internally, the indexing is guided by a set of rules, but we just need to expose those in a way that makes the most sense. If you have specific ideas on how to configure this, please log a feature request in JIRA.

BTW, versioning is going to be very high on the priority list.
Actions
4. Re: Searching the created repositories

meetoblivion Dec 17, 2009 3:42 PM (in response to rhauch)
So in this case, at least with the federated concepts, the data is updated offline, and we know approximately what time during the day it'll be updated. I assume then it's not a big deal to just reindex after know it's been updated. I also assume that if we're deploying 2 two servers, with discrete file systems (e.g. no shared/clustered file systems/networked) that the indexing would have be done on each instance, right?

If I understand your post correctly, then we'll end up indexing the entire repository (entire federated repository) rather than currently being able to index selected nodes, right?

If that's the case, it may make sense to use an interface, similar to how FilenameFilter works

public boolean index(Item i) { //put in the BL to index this Item }

or maybe allow for a BPM module that controls what gets indexed.

I guess I also need to read up on JCR 2's searching capabilities. In my use case at least, only full text searching really applies, not the sql grammar.
Actions
5. Re: Searching the created repositories

rhauch Dec 17, 2009 3:58 PM (in response to meetoblivion)
So in this case, at least with the federated concepts, the data is updated offline, and we know approximately what time during the day it'll be updated. I assume then it's not a big deal to just reindex after know it's been updated.
That's correct. But if the connector generates events (passing them to the source's Observer, passed to the source via the RepositorySource.initialize(RepositoryContext) method), then the system will update the indexes automatically. Basically the events are just ChangeRequest objects (that are normally sent to a writable connector), but this is on the advanced side of things.
I also assume that if we're deploying 2 two servers, with discrete file systems (e.g. no shared/clustered file systems/networked) that the indexing would have be done on each instance, right?
Yes, at the moment. At some point it will be possible to have the connector 'own' the query and search functionality, and DNA will delegate to them. The connectors could "share" their indexes, having one be responsible for updates and the other be read-only. (This is typically how a Lucene is clustered.)
If I understand your post correctly, then we'll end up indexing the entire repository (entire federated repository) rather than currently being able to index selected nodes, right?
As it stands, 0.7 will allow you to re-index the whole thing, though I can add methods to allow you to index just below some point. Our SearchEngine component is already able to do that, so I just need to expose it.
If that's the case, it may make sense to use an interface, similar to how FilenameFilter works

public boolean index(Item i) { //put in the BL to index this Item }

or maybe allow for a BPM module that controls what gets indexed.

That's a good suggestion. Thanks!
I guess I also need to read up on JCR 2's searching capabilities. In my use case at least, only full text searching really applies, not the sql grammar.
The JCR2-SQL grammar is pretty nice, actually, and you can include full-text search criteria in your queries (and get the scores back out). It is a bit hamstrung with the JCR 1.0 API (e.g., only one score is obtainable), but it does work.
Actions
6. Re: Searching the created repositories

meetoblivion Dec 17, 2009 4:48 PM (in response to rhauch)

So related to this, are there any considerations then when it comes to deploying DNA in a clustered JEE environment? There will be a front end for editing some of the sources, which we'll want to search, but not others. This means that one side will have the update but the other won't.
Actions
7. Re: Searching the created repositories

rhauch Dec 17, 2009 5:01 PM (in response to meetoblivion)

Clustering isn't implemented yet ... the main issue there is pushing the changes made on one process into the other processes in the cluster. We are going to address this soon, and the goal is that it won't matter in which process the changes are made because everything (storage and indexes) will be updated correctly/consistently.

BTW, the other thing that will control indexing are node types. We currently support the JCR 2.0 CND format, including stating whether properties are or are not full-text searchable. It's possible we could also use node types to help control indexing (or prevent indexing).
Actions
8. Re: Searching the created repositories

meetoblivion Dec 18, 2009 11:25 AM (in response to rhauch)

Restricting on NodeType would be good, but more likely than not in my case it's going to be based on node type + attribute value, but maybe mixin as well.

Is the CND 2.0 code checked in? From what I remember you had a proprietary api for adding node types. CND 2.0 was the programmatic approach to registering, rather than in the config file, right?
Actions
9. Re: Searching the created repositories

rhauch Dec 18, 2009 11:56 AM (in response to meetoblivion)

Restricting on NodeType would be good, but more likely than not in my case it's going to be based on node type + attribute value, but maybe mixin as well.
Yeah, we'll have to make this easy to turn indexing on by a variety of ways. You're earlier suggestion was a good one.
Is the CND 2.0 code checked in? From what I remember you had a proprietary api for adding node types. CND 2.0 was the programmatic approach to registering, rather than in the config file, right?

Yes, the CND stuff is committed. It was there in 0.6, but we rewrote the CND parser in trunk, and it's behaving much better.

We do have a proprietary API for programmatically creating new node types that is very similar to JCR 2.0 API (should be an easy upgrade path when we support JCR 2.0). However, you can definitely create CND files and upload a couple of different ways; the easiest is just adding the CND files using the JcrConfiguration, prior to creating the JcrEngine. Look at the example in Section 8.2.3.2 in our reference guide.
Actions
10. Re: Searching the created repositories

meetoblivion Dec 18, 2009 2:23 PM (in response to rhauch)

do you mean this part?

http://www.jboss.org/file-access/default/members/dna/freezone/docs/0.6/manuals/reference/html/cnd-sequencer.html

so it's still loading CND's from file? I'm a bit puzzled. so let's say we have 2 nodes running and database persistence. what happens?
Actions
11. Re: Searching the created repositories

rhauch Dec 18, 2009 3:15 PM (in response to meetoblivion)
do you mean this part?

http://www.jboss.org/file-access/default/members/dna/freezone/docs/0.6/manuals/reference/html/cnd-sequencer.html

so it's still loading CND's from file? I'm a bit puzzled. so let's say we have 2 nodes running and database persistence. what happens?
No, that's the CND sequencer, which just extracts the structure of CND files if you load CND (as content) into a repository.

If you want to register with the repository the node types in CND files, the easiest way to do that is to specify it in the JcrConfiguration, which is shown in the example in Section 8.2.3.2 (http://docs.jboss.org/jbossdna/latest/manuals/reference/html/configuration.html#programmatic_configuration). So something like this:

JcrConfiguration config = ...
config.repository("repository A")
      .addNodeTypes("myCustomNodeTypes.cnd")
      .setSource("source 1")
      .registerNamespace("acme","http://www.example.com/acme")
      .setOption(JcrRepository.Option.JAAS_LOGIN_CONFIG_NAME, "dna-jcr");
... JcrEngine engine = config.build(); Repository repository = engine.getRepository("repository A");

Note you can pass the (relative) path to the file, a File, an InputStream, or a URL. And you can call the method multiple times if you have multiple files.
Actions
12. Re: Searching the created repositories

meetoblivion Dec 20, 2009 4:24 PM (in response to rhauch)

alright, i think after reading the dna-cnd examples in the test/resources it's a bit clearer how my CNDs should look.

a question though. i have to somehow push the updated data through each instance of DNA running. will this be done as a read from the database or a push through to update the DNA database? i'm a bit lost how a sequencer fixes my problems...
Actions
13. Re: Searching the created repositories

bcarothers Dec 21, 2009 5:17 PM (in response to meetoblivion)

I agree that the sequencer does you any good. Until there's a true distributed deployment model for DNA, I'm pretty sure that you'll have to load the changed CND files into each server.

You can use code to this effect to do it:

Session session = ...;
String pathToResource = ...;
CndNodeTypeSource nodeTypeSource = new CndNodeTypeSource(pathToResource);

JcrNodeTypeManager nodeTypeManager = (JcrNodeTypeManager) session.getWorkspace().getNodeTypeManager();
nodeTypeManager.registerNodeTypes(nodeTypeSource);

The trick is that you would have to run that against each server (since they are, essentially standalone) and you would have to make sure that the CND files got reloaded the next time that you started up the server.
Actions
14. Re: Searching the created repositories

meetoblivion Dec 22, 2009 10:28 AM (in response to bcarothers)

Now I'm a bit confused.

I'm not looking to search CND files, and I am not looking to dynamically register CND files from the application. I have no clue what you're trying to tell me with your code. I have a CND in the file system and I can confirm that DNA is loading it correctly.

What I'm trying to figure out:

I have a schema with 2 tables (there are more tables, but I'm not interested in them). I want to be able to index these 2 tables as if they were part of the repository. Does this mean that I need to:

1. Add the contents of these tables to the repository directly. I assume that this would require pushing all of the contents through DNA at some point daily.

OR

2. Somehow using a sequencer, tell it to read the appropriate columns from the tables and create indexes based on the data it finds (in each case, the table's are about 60 columns each, but only a few fields are worth looking at for search). From the looks of it, this requires me to create a node definition that matches the table?

So I'm really not sure why I need to be parsing CND's, as the CND's do not contain the data I'm looking to search on. I'm not planning to change the CNDs.
Actions

1 2 Previous Next

Go to original post