W-JAX 2009: Hibernate Search – Full-text search for Hibernate applications

hibernateDisclaimer: This entry has been written while listening to the talk. Please forgive me any typographical or grammatical errors resulting from this approach.

Emmanuel Bernard from JBoss is next – he is going to cover basic topics like general search functionalities with Hibernate as well as advanced topics like how to handle typos and approximations in combination with Lucene.

Emmanuel starts by outlining the basic problems of relational approaches to fluu text queries – users think work-oriented, relation databases store data column-wise – full text searches thus are extremely costly. He then proceeds to outline how full text search techniques help to cope with this problem, explaining tokenizers, n-grams and so on.

Then he proceeds to a demo. Quick observations from Hibernate Search:

  • @Indexed is used on the class to indicate that the class is to be index.
  • @DocumentId is used on the primary key to create a Lucene document ID connection.
  • @Field(index=…) can be used for indexing strategies.
  • @IndexEmbedded can be used to index embedded (directly dependend) objects.

Installing Hibernate Search is simple:

  • Add the JAR to the classpath.
  • Add a few configuration lines to either the hibernate.cfg or persistence.xml to configure the directory provider.

Searching also is pretty simple in the basic version:

  • The regular entity manager is wrapped in the full text entity manager.
  • The full text entity manager than provides the API for full text queries. All new interfaces use existing JPA / Hibernate interfaces.
  • MultiFieldQueryParsers allow to weigh the importance of fields to which searched words are applied.
  • Then the query is transformed into a FullTextQuery and processing continues from there on along the well-known lines.
  • Analyzers are initialized via annotations that configure tokenizers via @AnalyzerDef, @TokenizerDef and @FilterDef.  Then the @Field annotation is changed (or repeated) to use the new analyzer by name.
  • Phonetic search can be done by Soundex or Metaphone. Emmanuel reported that he usually finds phonetic search not that much useful – n-grams usually work as well with less CPU overhead.
  • Synonyms can be handled by TokenFilter strategies. Emmanuel recommends indexing reference words instead of indexing every synonym.
  • For finding words from the same family word stemming was recommended, Porter stemming for english languages, Snowball stemming for most Indo-European languages. Again this can be done by using a TokenFilter.

A fluent API seems to be in the work for the next version of Hibernate Search.

From there Emmanuel proceeded to distributed clustering which was not that much of interested to me, so I didn’t jot down the details – but feel assured that Hibernate Search offers efficient mechanics to handle full text searches in clusters (actually using JMS queues to keeping distributed caches up to date).

In summary Hibernate Search significantly seems to lower the entry barrier to using Lucene if you already are using Hibernate. Future versions also will bring more performance improvements and clustering enhancements.

Overall the presentation was very meaty and interesting. Emmanuel did a great job and provided quite a bit of inspiration for one of our internal products. Kudos for a very inspiring presentation!

Comments are closed.