I'm wondering about the relationship between sequencers and extractors.
It seems like extractors are for creating indexes for full-text searching, whereas sequencers are for manipulating the repository-- adding nodes and properties (which can also be indexed for full-text searches) and such.
You are correct.
Text extractors are used to get the searchable text from binary values. This information is then used when applying CONTAINS criteria during query execution. If you don't use any CONTAINS criteria in your queries, there is no point using a text extractor.
Sequencers are much more complex and more useful. They process a binary value and extract structured information in the form of new nodes and properties written back into to the repository, where those new nodes and properties are accessible just like any other repository content. Automatic sequencers do this asynchronously when they detect content changes, while manual sequencers are invoked explicitly by an application and write the nodes to the current session, and the application dictates whether those changes are saved. See the sequencer documentation for more information. Note that if a sequencer produces any BINARY values as part of its output, those BINARY values might have their text extracted for indexing/search purposes.
Sequencers are never used to update indexes or during query execution. The only thing they do is produce additional repository content (nodes & properties) that, like any other content, is then able to be accessed and queried.
Specifically, I was happy to see some stuff for PDF files, and merrily went about trying it out. Love the way everything is glued together BTW! It was real easy to test the various configurations. Everything worked like it was supposed to. After the successful text extraction, I wanted to pull up an excerpt of what had been extracted-- similar to MODE-1163 which talks about a jackrabbit-specific way of doing it.
I say "similar" in that I think I actually want that data sequenced instead of indexed (assuming I understand the relationship between extraction and sequencing), as I need to to be able to display what was extracted, and probably put things in nodes/properties opposed to a blob of text.
Basically I'm wondering if I have the relation right 'twixt the two. Judging by MODE-1163, getting at the data the extractor stores isn't trivial-- but I honestly haven't looked at it, I'm just going by the ticket being pushed out a few times.
ModeShape does not have a way to return in the query results an "excerpt" of the matched content with highlighted bits. MODE-1163 is scheduled for 4.1, but is currently not very high on our priority list.
I wrote a quick sequencer for PDF files, but it was so easy I fear I'm missing something. The ticket for a PDF sequencer was closed ages ago, but I didn't see anything besides the Tika extractor in the sources. (Seems like instead of a PDF sequencer, a Tika sequencer would be more useful, since it can introspect so many file types, but before I mess around more I wanted to do a quick sanity check.) Am I missing something obvious?
I don't think it'd be difficult or complex - it's just a matter of creating nodes that reflect the structural elements in the document.
I don't think a Tika sequencer would be useful. Tika just produces a series of tokens from a PDF (or other kind of file), and those tokens have no implied structure. I can't imagine how to convert that list of tokens into a structure of nodes and properties. Perhaps I'm missing something obvious.
I think we used to have a PDF sequencer, but it was removed because it didn't work well. Bottom line is that we don't have a PDF sequencer at the moment. If we did, presumably it would do something similar to our MS Office document sequencer: read the document and produce a node structure that represents the document's structure of sections, paragraphs, etc. The PDF sequencer would likely use a library (like Apache POI) that provides structured access to the PDF document. (This is exactly how our MS Office document sequencer works, except it looks for Microsoft-specific elements.)
We'd gladly accept a PDF sequencer contribution, though.
If I'm not missing something, and the only reason there's no Tika sequencer is because nobody has needed one, then my next question will be about configuring sequencers: It doesn't seem like there's autowire magic for them like there is for extractors, configuration-wise. I think initialize(blah,blah) is called alone, vs looking for setters or whatnot, but I haven't really dug into it, maybe I'm just overlooking something that turns on the magic, so to speak.
It is possible to configure sequencers. See the sequencer documentation for more information. All of the fields on the Sequencer, TextExtractor, Connector, IndexProvider subclasses are set reflectively based upon the corresponding fields in the JSON file.