Hibernate Search and offline text extraction

Version 3

    Posted on November, 2007, by Roberto Bicchierai and Pietro Polsinelli.


    Suppose that using Hibernate Search you want to index not only the standard persistent content of your objects, like string contents such as name, description etc., but also external references to files, such as PDF documents, HTML contents and so on.

     

    We are going to address the following problem: if you use Hibernate Search in the simplest way to index such properties of your indexed objects, text extraction will happen at the same time as the storing of the objects, and hence in a transactional scope, hanging thread completion until text extraction is completed, even if indexing is done asynchronously, which is an option in Hibernate Search.

     

    To present a solution, we developed a Hibernate Search custom field bridge and a Lucene Fieldable lazy field implementation which give us the tool we needed.

    The code here presented should be taken as a sample which in all cases needs some changes to be adopted in your code. We assume familiarity with Lucene basic concepts, such as can be gained from the excellent book "Lucene in Action" by Manning.

    In your Hibernate Search configuration you must put this line to enable asynchronous indexing:

    ...
    hibConfiguration.getProperties().put("org.hibernate.worker.execution", "async");
    ...
    

     

    In our sample code, we have a Hibernate persistent class, "Issue", on which we want to Lucene-index two properties: description and attachment; the first is a String valued field, the second one is of "PersistentFile" type: the PersistentFile class abstract the notion of file, which can be implemented through different means, such as file system file (just like in our example implementation), a db blob, a SVN, FTP etc. file.

    Here is how to annotate the Issue fields for indexing:

    ...
    @Lob
    @Column(name = "descriptionx")
    @Field(name = "content", index = org.hibernate.search.annotations.Index.TOKENIZED, store = Store.NO)
    @Boost(3)
    public String getDescription() {
      return description;
    }
    ...
    
    @Type(type = "org.jblooming.ontology.PersistentFileType")
    @Column(name = "screenShot")
    @Field(name = "content", index = org.hibernate.search.annotations.Index.TOKENIZED, store = Store.NO)
    @FieldBridge(impl = PersistentFileBridge.class)
    public PersistentFile getAttachment() {
      return attachment;
    }
    ...
    

     

    Notice here the crucial part which is the @FieldBridge(impl = PersistentFileBridge.class)

    We propose as text extractor class a nice PDF extractor utility, PDFBox, which can be found at http://www.pdfbox.org. You can add your own extractors, say for HTML, .doc and so on.

    Our Hibernate Search custom field bridge class is this one

    public class PersistentFileBridge implements FieldBridge {
    
      public void set(String name, Object value, Document document, Field.Store store, 
              Field.Index index, Float boost) {
        if (value != null) {
          PersistentFile pf = (PersistentFile) value;
          LazyField field = new LazyField(name, pf, store, index, boost);
          document.add(field);
        }
      }
    }
    

     

    As you can see, it uses the LazyField class, to add a Lucene field to the Lucene document: here is the constructor of the LazyField:

    public LazyField(String name, PersistentFile persistentFile, 
         Field.Store store, Field.Index index, Float boost) {
        super(name, store, index, Field.TermVector.NO);
        //fundamental set: this instructs Lucene not to call 
        //the stringValue on field creation, but only when needed
        lazy = true;
        if ( boost != null )
          setBoost( boost );
        this.persistentFile = persistentFile;
      }
    

     

    And here the method lazily called when needed:

     public String stringValue() {
          if (content==null)
            try {
              content = TextExtractor.getContent(persistentFile);
            } catch (IOException e) {
              // you may implement something smarter
              throw new RuntimeException(e);
            }
          return content;
        }
    

     

    Notice the text extractor call, which goes like this:

    ...
    
    if (pf.getOriginalFileName().toLowerCase().endsWith(".pdf")) {
    
            PDFTextStripper stripper = new PDFTextStripper();
    
            PDDocument document = PDDocument.load(inputStream);
    
            stripper.writeText(document, sw);
    
            content = content + sw.getBuffer().toString();
    
            document.close();
    
          }
    
    ...
    

     

     

    If you have updated information, please feed this page (not the remarks below) or send us (mailto:ppolsinelli (at) open-lab (dot) com) infos to put here.

    References

    The sources were compiled with Hibernate Search 3.0.

    Full download of sample code:

    http://sourceforge.net/project/showfiles.php?group_id=128221&package_id=251438

    Hibernate Search:

    http://search.hibernate.org

    PDFBox:

    http://www.pdfbox.org

    Lucene in Action:

    http://www.manning.com/hatcher2

    A blog entry about "Using Hibernate Search with complex requirements":

    http://twproject.blogspot.com/2007/11/using-hibernate-search-with-complex.html