Xapian

acme on 2006-05-02T13:47:09

As I mentioned recently on the mightyv blog, I've added full-text searching to mightyv. This enables you to find programmes containing pies. I've been toying with all these full-text search engines and recently decided upon the Xapian project. While I've toyed with doing this in Perl-space in the past, you really want to do this in C-space so that it is lightning-fast.

Rather kindly, the Xapian project supplies Debian and Ubuntu packages for the latest version and there is the rather under-documented Search::Xapian module as an interface to it.

Playing with Xapian, I've found that it creates small indexes and is really very fast indeed. It is best to use the Flint backend ($ENV{XAPIAN_PREFER_FLINT} = 1;) and I like the stemming code. For example, the xapian-compact-ed index for title and categories for 180k recipes is 37M and I can search for "killer salsa" in 7ms. Creating and updating the index is a little tricky (but you can update while reading from it, unlike Plucene), so after a little more experience I might well release a Search::Xapian::Simple which will just do the right thing for the common case.

Basically, it's fast and neat. What do you use for full-text searches?


Swish-e

kennyg on 2006-05-02T14:38:41

We've been using swish-e. http://www.swish-e.org/. Easy to setup and install, and it's fast.

Re:Swish-e

acme on 2006-05-02T14:57:21

It's hard to tell, but looks like swish-e is only set up to index files. I don't have files!

Re:Swish-e

domm on 2006-05-02T20:48:11

The last time I've used swish-e you could call some external programm to 'fake' files. Something like swish-e -S prog.

Re:Swish-e

grantm on 2006-05-02T22:17:13

Last time I used Swish-E I was indexing files, but they included things like MS Word and PDF documents so we used an indexing script to filter the files through X_to_text programs and feed the results to the indexer. There's no reason why your indexing script couldn't get its data from DBI or similar rather than files.

The other thing I liked about the Swish-E indexing process was that you could feed arbitrary metadata fields to the indexer. This allowed you to get things like author name, publication date, title (and in our case all sorts of business-unit meta-fluff) directly in your search results so you didn't have to go back to the source documents when displaying a search results screen. Do you know if Xapian does that too?

The downside of the version of Swish-E that I was using is that it didn't support incremental indexing. You created an index by feeding in a bunch of documents. If you later wanted to add a document then you'd 'simply' recreate the index by feeding in all the original documents plus the new one. I know development versions of Swish-E claim to support incremental indexing but I don't know if that's in a stable release and I've never actually used it. Presumably Xapian supports this.

Re:Swish-e

acme on 2006-05-03T05:43:10

Right, Xapian allows you to store abitrary metadata and works find under incremental indexing, which I consider key.

Re:Swish-e

Smylers on 2006-05-02T16:16:54

Is Swish-E working for you with Unicode? We've found it unsatisfactory once non-Ascii characters are being used in the data being searched (and the search terms and results).

Smylers

lucene-ws.net and KinoSearch

LTjake on 2006-05-02T23:47:06

We liked what Lucene had to offer, but Plucene left much to be desired. So, we ended up creating a java servlet so we could use Lucene proper as a web service (lucene-ws.net).

There's a Perl client in the SVN repository, though it requires an as-yet-unreleased version of WWW::OpenSearch. Indexing is a bit slow mostly due to the HTTP overhead, but searching is pretty slick and it now includes search suggestions.

We'd like to replace it, eventually, with something more native to Perl. KinoSearch is relatively new, but is coming along really well. It's a Lucene-alike, which would make the transition pretty painless for us.

HyperEstraier seems very interesting!

dpavlin on 2006-05-10T16:43:50

HyperEstraier with a little help from Search::Estraier fits my needs quite nicely.

I started using search engines with swish-e (which I still use quite a bit), but threre is also another very interesting project: KinoSearch which looks very promising full control from perl is required (it somewhat reminds me of WAIT which powered CPAN).