1 Reply Latest reply on Jul 26, 2011 11:20 AM by rhauch

Text extraction and Content type detection with Aperture or Tika

lisak Jul 25, 2011 2:57 PM

Hey,

I've been working with Tika for a while, which is quite satisfying, simple enough when only Tika facade is used with either default or custom TikaConfig that represents org/apache/tika/mime/tika-mimetypes.xml

There are 1290 Media types and 118 types is resolved based on first bytes before magic marker. My repository should support only these : html,doc,docx,docm,odt,txt,rtf,pdf,odf,odp,xls,xlsx,xlsm,ppt,pptm.

What is the origin of those mime types in modeshape-common/src/main/resources/org/modeshape/mime.types ?

First I thought that I should have to remove all other mime types entries from tika-mimetypes.xml config file for it to be most efficient. Because if you take a look at TikaConfig() constructor, it is a little messed up implementation imho. Based on mimeTypes in tika-mimetypes.xml it creates as much parserDecorators as much mimeTypes is in tika-mimetypes.xml and that are the same except those that implement Parser interface. They are all part of CompositeParser, each one supports only one mimeType... That's why it is so greedy for memory unless one does't keep only the MimeTypes s/he really needs. Then I realized that it is better to have all mime types (at least those with magic marker declared) available for detection because considering the principal consists in matching the first bytes, then the sooner the mime type is discovered, the better.

Why did you decided to use Tika for extraction and Aperture for content detection ? I'm asuming it is not in the center of attention now, because the mimetype-detector-aperture extension includes aperture 1.1.0.Beta1 version. The Aperture is quite interesting software, it seems to be aimed at crawling documents on filesystem in directory structure, websites and mailboxes and it generates corresponding RDF. But I guess it doesn't have much of a use in ModeShape when resoures are not to be crawled and there is no need for RDF output.

I wonder how would 5 request / second perform with Tika singleton for detection and following parsing, I'll have to write some load tests. Also there is no way of checking integrity of documents in Tika because pdfbox or poi can throw expcetions that not necesserily means user won't be able to open it.

We'll see.

1. Re: Text extraction and Content type detection with Aperture or Tika

rhauch Jul 26, 2011 11:20 AM (in response to lisak)

What is the origin of those mime types in modeshape-common/src/main/resources/org/modeshape/mime.types ?
I actually don't recall, other than that it's an aggregation of multiple other files plus a number of custom entries. It's also used in our ExtensionBasedMimeTypeDetector class (one of several MimeTypeDetector implementations in ModeShape), and uses only the extension of the filename and none of the file content. This detector is also the only one enabled by default. You can always provide your own 'org/modeshape/mime.types' files on the classpath, too.

First I thought that I should have to remove all other mime types entries from tika-mimetypes.xml config file for it to be most efficient. Because if you take a look at TikaConfig() constructor, it is a little messed up implementation imho. Based on mimeTypes in tika-mimetypes.xml it creates as much parserDecorators as much mimeTypes is in tika-mimetypes.xml and that are the same except those that implement Parser interface. They are all part of CompositeParser, each one supports only one mimeType... That's why it is so greedy for memory unless one does't keep only the MimeTypes s/he really needs. Then I realized that it is better to have all mime types (at least those with magic marker declared) available for detection because considering the principal consists in matching the first bytes, then the sooner the mime type is discovered, the better.
I don't have a lot of experience using Tika, but I thought its focus was simply text extraction. If it can easily be used for content-based MIME type detection, then I'd love to see another ModeShape MIME type detector extension added to the 'modeshape-extractor-tika' module. Any interest in contributing this??

Why did you decided to use Tika for extraction and Aperture for content detection ? I'm asuming it is not in the center of attention now, because the mimetype-detector-aperture extension includes aperture 1.1.0.Beta1 version. The Aperture is quite interesting software, it seems to be aimed at crawling documents on filesystem in directory structure, websites and mailboxes and it generates corresponding RDF. But I guess it doesn't have much of a use in ModeShape when resoures are not to be crawled and there is no need for RDF output.

Well, we started using Aperture quite some time ago - I believe right around the time that Tika made its first incubating (0.1) release. I guess we just didn't know about it. The initial goal was to provide some mechanism for content-based MIME type detection, and Aperture seemed to be more capable than other LGPL-compatible libraries we found. Sure, Aperture can do a lot more than what we're using it for, but it does work for content-based MIME type detection. Honestly, I don't know that it gets used much (hence the older version), since it's not configured for use in ModeShape out of the box.
Actions