I have really been impressed with Apache Lucene. I am used to seeing open source projects which are nice frameworks and such… but this is a really hard core set of code that they have developed.
I was able to finally get rid of the search on TheServerSide (which used the web crawler ht://Dig), and reimplemented it with Lucene. Now our actually content is indexed from the inside, and I can implement a simple interface, add a link to an XML file, and another source joins the index.
I wonder if I should make the pagination read:
TheSeeeeeeeeeeeerverSide
like Goooooooogle does :)
Since the first couple of pages are the most important in results, I was able to implement something which always bugs me about Google. If you go many pages into your results… they don’t show the first page. We use:
1 .. 15 16 17 18 19 20 .. 79
So you can always get to the ends cleanly.