Text extraction and Content type detection with Aperture or Tika
lisak Jul 25, 2011 2:57 PMHey,
I've been working with Tika for a while, which is quite satisfying, simple enough when only Tika facade is used with either default or custom TikaConfig that represents org/apache/tika/mime/tika-mimetypes.xml
There are 1290 Media types and 118 types is resolved based on first bytes before magic marker. My repository should support only these : html,doc,docx,docm,odt,txt,rtf,pdf,odf,odp,xls,xlsx,xlsm,ppt,pptm.
What is the origin of those mime types in modeshape-common/src/main/resources/org/modeshape/mime.types ?
First I thought that I should have to remove all other mime types entries from tika-mimetypes.xml config file for it to be most efficient. Because if you take a look at TikaConfig() constructor, it is a little messed up implementation imho. Based on mimeTypes in tika-mimetypes.xml it creates as much parserDecorators as much mimeTypes is in tika-mimetypes.xml and that are the same except those that implement Parser interface. They are all part of CompositeParser, each one supports only one mimeType... That's why it is so greedy for memory unless one does't keep only the MimeTypes s/he really needs. Then I realized that it is better to have all mime types (at least those with magic marker declared) available for detection because considering the principal consists in matching the first bytes, then the sooner the mime type is discovered, the better.
Why did you decided to use Tika for extraction and Aperture for content detection ? I'm asuming it is not in the center of attention now, because the mimetype-detector-aperture extension includes aperture 1.1.0.Beta1 version. The Aperture is quite interesting software, it seems to be aimed at crawling documents on filesystem in directory structure, websites and mailboxes and it generates corresponding RDF. But I guess it doesn't have much of a use in ModeShape when resoures are not to be crawled and there is no need for RDF output.
I wonder how would 5 request / second perform with Tika singleton for detection and following parsing, I'll have to write some load tests. Also there is no way of checking integrity of documents in Tika because pdfbox or poi can throw expcetions that not necesserily means user won't be able to open it.
We'll see.