The case for sharing all of your content in full RSS feeds Red Nose Ricky Gervais
Mar 22

Hadoop and Mike Cannon Brookes on using Lucene for Data rather than Text

Open Source Add comments

Mike kindly started the presentation with a consuming warning, letting us know in advance that he was going to be pimping JIRA (because this was going to be case study-esque).

These days JIRA uses Lucene for “Generic Data Indexing”: Fast retrieval of complex data object. This isn’t about text searching for “dog” sorted by relevance. The statistic pages all come back from a Lucene index, not from the DB.

Lucene has a way for you to write your own Sort routines via Sort, SortField.

I have seen the “viral Lucene” pattern apply in a variety of projects. You start out using it for /search, and then you see that you can use it for other things. Slowly your DB is doing less, and your Lucene indexes are growing. This is a killer open source project, even if the API is a little weird.

Hadoop: Open Source MapReduce

I had a couple of people ask “why Google hasn’t open sourced our MapReduce?” They didn’t know about Hadoop:

Hadoop is a framework for running applications on large
clusters of commodity hardware. The Hadoop framework
transparently provides applications both reliability and data
motion. Hadoop implements a computational paradigm named
map/reduce, where the application is divided into many small
fragments of work, each of which may be executed or reexecuted
on any node in the cluster. In addition, it provides a
distributed file system that stores data on the compute nodes,
providing very high aggregate bandwidth across the cluster. Both
map/reduce and the distributed file system are designed so that
node failures are automatically handled by the framework.

The intent is to scale Hadoop up to handling thousand of
computers. Hadoop has been tested on clusters of 600

Hadoop is a Lucene sub-project
that contains the distributed computing platform that was
formerly a part of Nutch. This
includes the Hadoop Distributed Filesystem (HDFS) and an
implementation of map/reduce.

For more information about Hadoop, please see the Hadoop wiki.

The great efforts of Christophe Bisciglia of the open source group revolve around UW classes where Hadoop is used in the curriculum.

6 Responses to “Hadoop and Mike Cannon Brookes on using Lucene for Data rather than Text”

  1. Anthony Eden Says:

    Do you think Ferret provides an equivalent pattern for Ruby apps? My understanding is that Ferret started as a Lucene port but has moved further and further from Lucene in implementation. Do you think this is the case and if so is that a good thing, bad thing, or irrelevant thing?

    I’ve been thinking about testing out Ferret, but so far have not. Maybe it’s about time?

  2. Dion Almaer Says:

    And there is Solr:

    Worth checking out too. Ferret has moved away a bit, which is a shame from the standpoint of index compat, but good if it is more ruby-y (if that is all you care about)



  3. Erik Hatcher Says:

    Solr indeed! And now with more Ruby goodness with the solr-ruby library we’ve developed:

    We could use an acts_as_solr built in, though there is already an acts_as_solr at RubyForge which may do the trick until we roll it into solr-ruby proper.

    And don’t forget Flare: as demonstrated on several datasets here:

    p.s. Hey Dion!

  4. Erik Hatcher Says:

    Re: Ferret – it’s a good thing. There were very well considered decisions that forked it from the Java Lucene file format. The creator of Ferret is collaborating with the KinoSearch creator on the Lucene Lucy project in order to bring their goodies back to the Lucene community.

  5. replicahandbags Says:

    thaks for this helpful info.

  6. replica handbags Says:

    yes.good helpful to me.thank you!

Leave a Reply

Spam is a pain, I am sorry to have to do this to you, but can you answer the question below?

Q: Type in the word 'cricket'