Searching the created repositories| JBoss.org Content Archive (Read Only)

15. Re: Searching the created repositories

bcarothers Dec 22, 2009 4:20 PM (in response to meetoblivion)

Sorry about that. I looked at this comment and thought that we were talking about CND. I think I'm back on track now.

I would normally defer to Randall on search questions, but since he's AFK this week, I'll throw in my $0.02.

If I'm understanding you correctly, you want to use DNA to expose data in a relational database through JCR and provide search capabilities on it through JCR's query capabilities. DNA doesn't support this yet as an OOTB connector (see: DNA-199). In the meantime, I think your option 1 is the right thing to do.

16. Re: Searching the created repositories

meetoblivion Dec 22, 2009 9:45 PM (in response to bcarothers)

and since we discussed clustering/multiple instances.

let's say that the data is stored via the jpa connector. what happens when i have two instances of DNA pointing at the same source (in this case, a schema in an oracle database somewhere).

17. Re: Searching the created repositories

bcarothers Dec 23, 2009 8:25 AM (in response to meetoblivion)

It should "just work", but I'm not aware of anyone who's tested it yet.

We've tried to be conscious of this deployment model for many months now since anything we built that would fail under this deployment model would also fail in a clustered deployment model.

18. Re: Searching the created repositories

meetoblivion Dec 23, 2009 8:40 AM (in response to bcarothers)

so then searching it should work as well? what about building the indexes, that would still need to be done on a per node basis, correct? (again, just thinking high level)

19. Re: Searching the created repositories

bcarothers Dec 23, 2009 9:13 AM (in response to meetoblivion)

That's a really good point. As I understand it, the search indexes listen for changes and update themselves accordingly. Right now, we have no way of propagating changes between different nodes, so each search index would only reflect writes from its own node.

You could work around this (to a degree) by refreshing the indexes more often, but the real solution is distributed change notification.

20. Re: Searching the created repositories

meetoblivion Dec 23, 2009 10:14 AM (in response to bcarothers)

I agree completely, built in notification would work well. I suppose as long as the notification api is exposed to the client, we can write an EJB Timer to just update it hourly or so.

21. Re: Searching the created repositories

bcarothers Dec 23, 2009 11:11 AM (in response to meetoblivion)

You could definitely call jcrEngine.getRepositoryService().getRepositoryLibrary().register(yourObserver) to catch local changes.

22. Re: Searching the created repositories

rhauch Dec 27, 2009 10:50 PM (in response to bcarothers)

(Sorry for the delay... I've been on vacation.)

Brian correctly describes how the search indexes are updated. At the moment, each javax.jcr.Repository instance will utilize its own set of indexes, so multiple instances (i.e., multiple processes) will all need to be updated. And right now, the JcrEngine (that owns the javax.jcr.Repository instances) only processes events that originate within itself. Clustering will address this by setting up an event channel that multiple engines can share (each engine will publish events on the channel and will listen for events on the channel), so each JcrEngine instance will be able to receive events originating in other engines.

When clustering is added, each of the javax.jcr.Repository instances' indexes will be kept up to date. The benefit of this is that searching can be done locally, and each process is largely independent (apart from the event channel) from the other processes. But it also means the total footprint of the cluster is much larger than perhaps is really needed and the extra work involved in updating the indexes is duplicated on each process.

Obviously this may be considered non-ideal for some situations. In these cases, we hope allowing connectors to own the search functionality gives more flexibility in allowing a repository source (that is accessed by a connector in each process) to own the indexes and maintain a shared set of indexes.

For even more scalability, it is possible that we have a way to push the search engine functionality completely out to other processes altogether. Basically, we could have a black-box clusterable search engine that can keeps up with the changes to the content in a JCR cluster (by monitoring the JCR event channel), and which can be (remotely) searched and queried via the connectors. Simply configure and start one or more processes. This is merely an idea at this point, but I think it demonstrates the kind of flexibility we have in the JcrEngine.

23. Re: Searching the created repositories

meetoblivion Dec 28, 2009 11:24 PM (in response to rhauch)

Hey, no need to apologize, everyone needs a vacation here and then.

So your ideas have me thinking. Are you planning to use infinispan/jgroups under the hood for keeping the repos in synch? i mention infinispan specifically as apparently they have lucene support built in, and i know you're using lucene... though, the pub/sub description sounds more like JMS.

i like the idea of the event channel, and being able to query against that directly. it does sound like a highly scalable configuration.

another idea just to throw your way, are you interested in Weld/CDI components for DNA?

24. Re: Searching the created repositories

rhauch Jan 1, 2010 7:51 PM (in response to meetoblivion)

We do plan to use JGroups for clustering, and we plan to utilize Infinispan even more. Infinispan is providing a search functionality that is using Lucene, and I understand their actually working on the ability to put the Lucene indexes in the data grid. That'd be fantastic. As far as querying, it is possible that when Infinispan does support querying, we start to leverage that. Though I've been talking with the Infinispan lead about potentially reusing our query engine (which can have any back-end, not just Lucene). We'll have to see whether that pans out. But they definitely want search.

We'd definitely be interested in Weld/CDI components! Want to start a thread and let us know what you're thinking?

25. Re: Searching the created repositories

meetoblivion Jan 5, 2010 10:39 AM (in response to rhauch)

So just wondering how search looks? I took a look last night and it looks like quite a bit has been committed for search. Is it comprehensive from an API standpoint now, even if not all of the search features are there?

26. Re: Searching the created repositories

rhauch Jan 5, 2010 11:09 AM (in response to meetoblivion)

Yes, most of the search features work and we're passing most of the query/search-related TCK unit tests. We do have a number of outstanding issues that we're working on now:

XPath position() and last() functions are not implemented (DNA-612)
XPath order-by is not yet implemented (DNA-613)
Updating the search indexes sometimes blocks a connector (DNA-616)
Documenting our search languages in the reference guide (DNA-621)

DNA-616 is a bugger that may affect everyone, and it's already marked as a "blocker". But the first two only show up if you're using those features. And my tests with the SQL and search languages work pretty well.

If you do start using it, please file issues for any problems you come across. I've tried to create a lot of unit and integration tests, but the languages are pretty complex and powerful, so it's very difficult to make sure every combination of query features works.

27. Re: Searching the created repositories

meetoblivion Jan 5, 2010 11:37 AM (in response to rhauch)

So I notice that there's a top level dna-search that looks empty and an extension dna-search-lucene. I assume I want the lucene one, right?

If I want full text searching capabilities, is that part of the built in JCR APIs or is that an extension that I'll need to work with? For example, I may want to search exact someAttr=someValue and i may want someAttr ~= some value

28. Re: Searching the created repositories

rhauch Jan 5, 2010 12:07 PM (in response to rhauch)

Documenting our search languages in the reference guide (DNA-621)

FYI: Our investigation so far implies that DNA-621 is limited to cases where the JCR repository is set up to use a federated source. That helps narrow the condition, but it still is a blocker.

29. Re: Searching the created repositories

rhauch Jan 5, 2010 12:52 PM (in response to meetoblivion)

So I notice that there's a top level dna-search that looks empty and an extension dna-search-lucene. I assume I want the lucene one, right?

Oops. Yes, 'dna-search' is indeed just empty directories and needs to be deleted. I'll file a JIRA for that.

Our JCR implementation is in 'dna-jcr', and has a Maven dependency on 'dna-search-lucene' (even though the latter is in the extension directory).

If I want full text searching capabilities, is that part of the built in JCR APIs or is that an extension that I'll need to work with?

Yes, the JCR API includes the ability to submit queries, but the API supports multiple query languages and the spec requires support for only JCR's subset of the XPath query language (section 6.6 of the spec).

DNA supports these languages:

the XPath language defined and required by JCR 1.0 (section 6.6 of the JCR 1.0 spec)
the JCR-SQL2 dialect of SQL (chapter 6 of the JCR 2.0 specification, or http://www.day.com/specs/jcr/2.0/6_Query.html), with some custom extensions
a "full-text search" language that is more search-engine-like and is actually defined by JCR 2 (section 6.7.19 of the JCR 2.0 specification, or http://www.day.com/specs/jcr/2.0/6_Query.html#FullTextSearch).

For more information on the query API, see sections 6.6.8-6.6.13. Basically, you obtain a the session's query manager, define a query, execute the query, and then get the results. Here's an snippet showing how to issue a (very basic) SQL query:

String language = org.jboss.dna.jcr.JcrRepository.QueryLanguage.SQL;

javax.jcr.Query query = session.getWorkspace().getQueryManager()

.createQuery("SELECT * FROM [nt:base]",language);

javax.jcr.QueryResult result = query.execute();

From the result, you can get a NodeIterator or a RowIterator (depending upon how you want to process the results; it's likely with SQL that you'll want a RowIterator).

For example, I may want to search exact someAttr=someValue and i may want someAttr ~= some value

Assuming the 'someAttr' is defined on a 'my:type' node type, a SQL query with exact-match criteria would be:

SELECT * FROM [my:type] AS type WHERE type.[my:someAttr] = 'someValue'

Or if you don't want exact-match, you could use LIKE criteria:

SELECT * FROM [my:type] AS type WHERE type.[my:someAttr] LIKE '%someValue%'

You could use search-engine style full-text search for your criteria (so the results include similar values, with the ability to get the score in the results):

SELECT * FROM [my:type] AS type WHERE CONTAINS(type.[my:someAttr], 'someValue')

You can even search values for all of the type's properties (including inherited):

SELECT * FROM [my:type] AS type WHERE CONTAINS(type.*, 'someValue')

If you want to simply search values for all properties on all types, then use the "nt:base" type (from which all types extend):

SELECT * FROM [nt:base] AS type WHERE CONTAINS(type.*, 'someValue')

DNA's full-text search language makes the last query even easier. You could use the org.jboss.dna.jcr.JcrRepository.QueryLanguage.SEARCH language with the following query:

someValue

This language is defined by JCR 2.0 as the full-text search expression grammar that is used in the JCR-SQL2 language. We just pulled it out and made it available as a first-class query language. Per the spec, use double quotes to surround multi-word phrases and use '-' to prefix terms that should not appear.

BTW, the SQL grammar supported by DNA is actually JCR-SQL2 plus several very nice features:

This grammar is equivalent to the SQL grammar as defined by the JCR 2.0 specification, with some useful additions:

"... (UNION|INTERSECT|EXCEPT) [ALL] ..." to combine and merge results from multiple queries
"SELECT DISTINCT ..." to remove duplicates
"LIMIT count [OFFSET number]" clauses to control the number of results returned as well as the number of rows that should be skipped
Support for additional join types, including "FULL OUTER JOIN" and "CROSS JOIN"
Additional dynamic operands "DEPTH([<selectorName>])" and "PATH([<selectorName>])" that enables placing constraints on the node depth and path, respectively, and which can be used in a manner similar to " NAME([<selectorName>])" and "LOCALNAME([<selectorName>]). Note in each of these cases, the selector name is optional if there is only one selector in the query. on the node depth
Support for the IN clause and NOT IN clause to more easily supply a list of valid discrete static operands: " <dynamicOperand> [NOT] IN (<staticOperand> {, <staticOperand>})"
Support for the BETWEEN clause: "<dynamicOperand> [NOT] BETWEEN <lowerBoundStaticOperand> [EXCLUSIVE] AND <upperBoundStaticOperand> [EXCLUSIVE]"

Plus, coming soon is the ability to support simple arithmetic in numeric-based criteria and order-by clauses. For example, "... WHERE SCORE(type1) + SCORE(type2) > 1.0" or "... ORDER BY (SCORE(type1) * SCORE(type2)) ASC".