3 Replies Latest reply on Sep 5, 2014 2:43 PM by rhauch

sequencer versus extractor

denstar Sep 4, 2014 2:40 PM

Hello!

I'm wondering about the relationship between sequencers and extractors.

It seems like extractors are for creating indexes for full-text searching, whereas sequencers are for manipulating the repository-- adding nodes and properties (which can also be indexed for full-text searches) and such.

Specifically, I was happy to see some stuff for PDF files, and merrily went about trying it out. Love the way everything is glued together BTW! It was real easy to test the various configurations. Everything worked like it was supposed to. After the successful text extraction, I wanted to pull up an excerpt of what had been extracted-- similar to MODE-1163 which talks about a jackrabbit-specific way of doing it.

I say "similar" in that I think I actually want that data sequenced instead of indexed (assuming I understand the relationship between extraction and sequencing), as I need to to be able to display what was extracted, and probably put things in nodes/properties opposed to a blob of text.

Basically I'm wondering if I have the relation right 'twixt the two. Judging by MODE-1163, getting at the data the extractor stores isn't trivial-- but I honestly haven't looked at it, I'm just going by the ticket being pushed out a few times.

I wrote a quick sequencer for PDF files, but it was so easy I fear I'm missing something. The ticket for a PDF sequencer was closed ages ago, but I didn't see anything besides the Tika extractor in the sources. (Seems like instead of a PDF sequencer, a Tika sequencer would be more useful, since it can introspect so many file types, but before I mess around more I wanted to do a quick sanity check.) Am I missing something obvious?

If I'm not missing something, and the only reason there's no Tika sequencer is because nobody has needed one, then my next question will be about configuring sequencers: It doesn't seem like there's autowire magic for them like there is for extractors, configuration-wise. I think initialize(blah,blah) is called alone, vs looking for setters or whatnot, but I haven't really dug into it, maybe I'm just overlooking something that turns on the magic, so to speak.

Anyways, this project is swell-- I've had lots of fun and little frustration, so kudos!

1. Re: sequencer versus extractor

rhauch Sep 4, 2014 3:15 PM (in response to denstar)

I'm wondering about the relationship between sequencers and extractors.

It seems like extractors are for creating indexes for full-text searching, whereas sequencers are for manipulating the repository-- adding nodes and properties (which can also be indexed for full-text searches) and such.

You are correct.

Text extractors are used to get the searchable text from binary values. This information is then used when applying CONTAINS criteria during query execution. If you don't use any CONTAINS criteria in your queries, there is no point using a text extractor.

Sequencers are much more complex and more useful. They process a binary value and extract structured information in the form of new nodes and properties written back into to the repository, where those new nodes and properties are accessible just like any other repository content. Automatic sequencers do this asynchronously when they detect content changes, while manual sequencers are invoked explicitly by an application and write the nodes to the current session, and the application dictates whether those changes are saved. See the sequencer documentation for more information. Note that if a sequencer produces any BINARY values as part of its output, those BINARY values might have their text extracted for indexing/search purposes.

Sequencers are never used to update indexes or during query execution. The only thing they do is produce additional repository content (nodes & properties) that, like any other content, is then able to be accessed and queried.

Specifically, I was happy to see some stuff for PDF files, and merrily went about trying it out. Love the way everything is glued together BTW! It was real easy to test the various configurations. Everything worked like it was supposed to. After the successful text extraction, I wanted to pull up an excerpt of what had been extracted-- similar to MODE-1163 which talks about a jackrabbit-specific way of doing it.

I say "similar" in that I think I actually want that data sequenced instead of indexed (assuming I understand the relationship between extraction and sequencing), as I need to to be able to display what was extracted, and probably put things in nodes/properties opposed to a blob of text.

Basically I'm wondering if I have the relation right 'twixt the two. Judging by MODE-1163, getting at the data the extractor stores isn't trivial-- but I honestly haven't looked at it, I'm just going by the ticket being pushed out a few times.

ModeShape does not have a way to return in the query results an "excerpt" of the matched content with highlighted bits. MODE-1163 is scheduled for 4.1, but is currently not very high on our priority list.

I wrote a quick sequencer for PDF files, but it was so easy I fear I'm missing something. The ticket for a PDF sequencer was closed ages ago, but I didn't see anything besides the Tika extractor in the sources. (Seems like instead of a PDF sequencer, a Tika sequencer would be more useful, since it can introspect so many file types, but before I mess around more I wanted to do a quick sanity check.) Am I missing something obvious?

I don't think it'd be difficult or complex - it's just a matter of creating nodes that reflect the structural elements in the document.

I don't think a Tika sequencer would be useful. Tika just produces a series of tokens from a PDF (or other kind of file), and those tokens have no implied structure. I can't imagine how to convert that list of tokens into a structure of nodes and properties. Perhaps I'm missing something obvious.

I think we used to have a PDF sequencer, but it was removed because it didn't work well. Bottom line is that we don't have a PDF sequencer at the moment. If we did, presumably it would do something similar to our MS Office document sequencer: read the document and produce a node structure that represents the document's structure of sections, paragraphs, etc. The PDF sequencer would likely use a library (like Apache POI) that provides structured access to the PDF document. (This is exactly how our MS Office document sequencer works, except it looks for Microsoft-specific elements.)

We'd gladly accept a PDF sequencer contribution, though.

If I'm not missing something, and the only reason there's no Tika sequencer is because nobody has needed one, then my next question will be about configuring sequencers: It doesn't seem like there's autowire magic for them like there is for extractors, configuration-wise. I think initialize(blah,blah) is called alone, vs looking for setters or whatnot, but I haven't really dug into it, maybe I'm just overlooking something that turns on the magic, so to speak.

It is possible to configure sequencers. See the sequencer documentation for more information. All of the fields on the Sequencer, TextExtractor, Connector, IndexProvider subclasses are set reflectively based upon the corresponding fields in the JSON file.
Actions
2. Re: sequencer versus extractor

denstar Sep 5, 2014 1:09 AM (in response to rhauch)

I don't think a Tika sequencer would be useful. Tika just produces a series of tokens from a PDF (or other kind of file), and those tokens have no implied structure. I can't imagine how to convert that list of tokens into a structure of nodes and properties. Perhaps I'm missing something obvious.

The tokens are (mostly) unstructured metadata, so it was the old "- * (STRING)" and away I went. Heh.

It's about as useful as the mp3 sequencer, or the image sequencer I reckon. Basically just properties on a node, right? Just not much in the way of structure.

And that's the rub, really. Some of the metadata Tika returns is namespaced (dublin core, etc.), but a lot of it isn't, and there's not much (that I saw, at least) in the way of saying what came from where, as it were. Kinda a big mush of meta.

For the PDF sequencer I pulled in the dublin (dc:) namespace, and then used "- * (STRING)" for the rest, replacing any colons. That's when I thought, "hrm, really this is a tika sequencer". (You don't get the same type of info with PDFs as you would with office docs anyhow, re:paragraphs and such. It's mostly just dc:title, author, a couple other random attributes, and then whatever text it can make sense of, as really pdf text is a bunch of positioned characters, vs. sentences and such.)

I could cut out the *, and trace down/limit the metadata properties specific to PDFs, and then it'd be more of a PDF sequencer, but Tika does handle a huge variety of formats, and I'm thinking other people might want the same thing- content annotated with the metadata Tika can extract. I'll probably use it for some other file types later, actually. It *is* super unstructured though, which is kinda rough. I wrote an ics calendar file sequencer too, and that format is pretty solid, so it does have an actual node structure, with specific property names, etc..

Again, I'm mucho happy with how easy is has been to make my repository all automagical: upload a PDF, it's OCRed and searchable, upload a calendar file, bam!, manhandleable as well. Maybe I'll write a serializer for the calendar node, so the derived data is editable... hrm, that may just be silly tho... At any rate, thank you for the clarification Randall!

I'm happy to toss both these sequencers (or all three, if there's a need for a specific PDF sequencer vs. a general Tika one) back into the pot, so to speak-- I reckon opening tickets is the first step?
Actions
3. Re: sequencer versus extractor

rhauch Sep 5, 2014 2:43 PM (in response to denstar)
I'm glad you're finding ModeShape to be easy and useful. We would welcome you contributing the sequencers.

I don't know how useful the Tika sequencer might be, but it might be a bit more interesting if it worked two different ways:
If the output path is the same as the input path (or the 'nt:file' parent of the 'jcr:content' child node), it merely just adds a property. This would allow one to use the sequencer to extract text from a BINARY and put the text as a large STRING (or BINARY) property onto the same node.
If the output path is different than the input path, then it could create a node with that property.

As for the PDF sequencer, it would be great if this extracted the document structure. For example, Apache PDFBox provides access to the metadata as well as the pages, bookmarks, annotations and other higher-level structures of PDF documents. It'd be great if it could reconstruct the nodes to reflect the table of contents, but it doesn't look like PDFBox can provide that information.
Actions

Go to original post