Configuring Lucene

From Wiki
Jump to navigationJump to search

I've had MediaWiki installed for some time, and always been a little bothered by it's lack of ability to search the bodies of wiki entries. After hopping on #mediawiki on irc.freenode.net and asking the experts, I was referred to the Lucene search engine.

The instructions for connecting MediaiWiki to the search engine are remarkably simple. The instructions for getting Lucene running are absolutely horrid. So, I thought I'd document my experiences here, and perhaps help out the next person trying to get Lucene running.

First, what is Lucene? From the Lucene page, "Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform." Well, for me, something being written in Java isn't much of a selling point, but what the heck... If it'll index my pages and it's good enough for Wikipedia, it's good enough for me.

The nice folks in #mediawiki pointed me to this page: http://www.mediawiki.org/wiki/Extension:MWSearch Connecting your MW installation to Lucene is remarkably trivial. Export the sources from the Subversion repository, add a couple lines to the LocalSettings.php file, and you're golden. Incidentally, don't use the Subversion 'co' (checkout) command, unless you want those .svn directories. There's no need for them.

If you're not a Subversion user, use the 'Download snapshot' link on the right of the above page, and install it like you would any other extension. I'll assume that you probably know enough to install MW extensions. If not, you may want to back-up a couple steps and learn how to install something a little simpler the first time around.

On to getting Lucene running... This page contains the instructions. They're wrong, they're confusing, and they were probably written by someone who has done this before, and makes the assumption that you have, too. Remember what they say about 'assumption': It makes an ass out of u and umption (The Long Kiss Goodnight).

First, if you're a Gentoo Linux distribution user, you might think to 'emerge lucene'. Don't bother, this doesn't actually appear to do anything useful that you can do anything with. Instead, make sure you have a Java JDK installed on your system. I used the Sun JDK for no particular reason. As a Gentoo user, I merely had to 'emerge sun-jdk'. You'll need the JDK, the JRE by itself isn't enough. You'll also need Ant, which can be installed with 'emerge ant'. "Ant is a Java-based build tool. In theory, it is kind of like Make, without Make's wrinkles and with the full portability of pure Java code." Wheee... More Java.

Once you have a JDK and ant installed, you'll need to grab the MW Lucene snapshot. I used the devel version because I like bleeding edge. And the guys in #mediawiki think there's no good reason NOT to use it. It's probably easier to install the binary. But since there's no binary snapshot for the devel version, you'll have to use Subversion to pull the latest.

'svn export http://svn.wikimedia.org/svnroot/mediawiki/branches/lucene-search-2.1/'

This gets the latest version at the time of this writing. Check the the box on the right to make sure you have the latest link, however. This can be downloaded to any directory. I usually create something with a creative name like 'temp'. Change into the temp directory, then down into the directory that has the file 'OVERVIEW.txt'. Edit the file 'hostname', and change the 'oblack' to whatever name the command

hostname

returns. In my case, my server is called 'eta' on the private network, so I changed 'oblack' to 'eta'. Now run the command

./configure XXX

. Change the 'XXX' to the directory where your MediaWiki installation resides. In my case, that directory is '/var/www/localhost/htdocs/mediawiki', so my command was

./configure /var/www/localhost/htdocs/mediawiki

On my server, in the '/var/www' directory, I have several names of virtual hosts. 'localhost' happens to be the same as my external domain name, because 'tinymicros.com' is symlinked to it. You may want to use whatever your domain name is instead of 'localhost'. This worked for me, and I wasn't about to tamper with what was working.

At any rate, once you run the configure command, you'll see some output along these lines:

./configure /var/www/localhost/htdocs/mediawiki
Generating configuration files for > wikidb> ... 
Making lsearch.conf
Making lsearch-global.conf
Making lsearch.log4j
Making config.inc

This will generate some configuration files. Curiously, a couple appear to have some garbaged lines in them. You'll need to edit these by hand, but we'll do that in a little bit, after we build the LuceneSearch.jar file and copy the files to the working directory.

Next, build the Lucene system with

ant

This will invoke the 'ant' build system and create the files we need. Next, we'll copy these files over to where they'll run from. As per the MW Lucene docs, we're going to run this from the /usr/local/search directories. You'll want to do this as root. Generally speak, regular users are not allowed to create files in the '/usr/local' tree. The following several commands will do this:

mkdir -p /usr/local/search/ls2
mkdir -p /usr/local/search/indexes
mkdir -p /usr/local/search/wikixml
cp -a bin /usr/local/search/ls2/bin
cp -a lib /usr/local/search/ls2/lib
cp -a LuceneSearch.jar /usr/local/search/ls2/bin
cp -a LuceneSearch.jar /usr/local/search/ls2/lib # Yes, a 2nd time
cp -a lsearchd /usr/local/search/ls2
cp -a lsearch.conf /usr/local/search/ls2
cp -a lsearch-global.conf /usr/local/search/ls2
cp -a lsearch.log4j /etc

You'll notice there's a file we copied twice. There are two stages in running Lucene. The Lucene server, which listens on a port for a connection from MW for something to search, and the indexer process. It appears that these two functions aren't coordinated, and look for files in different directories. I spent a brief time trying to get them running without duplication, and had little success. I was also getting seriously annoyed with the existing instructions, and perhaps could have spent more time trying to resolve that issue. If you figure that out, feel free to let me know, I'll update the instructions. There's also another file a little later we'll be making a duplicate of also.

Now change into the ls2 directory with the following command

cd /usr/local/search/ls2

. If you examine the 'lsearchd' file, you'll see it (probably) has the following contents:

#!/bin/bash
jardir=`dirname $0` # put your jar dir here!
java -Djava.rmi.server.codebase=file://$jardir/LuceneSearch.jar -Djava.rmi.server.hostname=$HOSTNAME -jar $jardir/LuceneSearch.jar $*

Not a terribly helpful comment, is it? Seems that we need to edit it. Change it to match the following:

#!/bin/bash
jardir=`dirname $0`/lib # put your jar dir here!
java -Djava.rmi.server.codebase=file://$jardir/LuceneSearch.jar -Djava.rmi.server.hostname=$HOSTNAME -jar $jardir/LuceneSearch.jar $*

Note the addition of the '/lib' after the closing quote of the second line.

Now edit the lsearch.conf file, find the line that matches 'MWConfig.global' (usually around line 9), and replace it with the following:

MWConfig.global=file:///usr/local/search/ls2/lsearch-global.conf
MWConfig.lib=/usr/local/search/ls2/lib
Logging.logconfig=/etc/lsearch.log4j

Now find the line that matches 'Indexes.path' (usually around line 14) and replace it with the following:

Indexes.path=/usr/local/search/indexes

Lastly, find the line that matches 'Localization.url' (usually around line 82) and replace it with the following:

Localization.url=file:///var/www/localhost/htdocs/mediawiki/languages/messages

Remember that if you used a domain name instead of 'localhost' to fix that in this line.

That completes the changes to 'lsearch.conf', so save it and exit the editor. Now we'll make a copy of that file in the bin directory.

cp lsearch.conf bin

Now we'll edit the 'lsearch-global.conf' file. You'll need to know the name of the database that contains the MediaWiki data. Normally this will be 'wikidb', although if you're supporting multiple wiki's on the same database server, it might be named something else. On my machines, it's called 'wikidb'. You'll want to make yours match the following code. Generally, only the first several options need editing. For some reason, the configure script writes some trash into it. You'll also want to change the 'eta' to the name of your machine, as determined from the step earlier where we ran 'hostname'.

################################################
# Global search cluster layout configuration
################################################

[Database]
wikidb : (single) (spell,4,2) (language,en) (warmup,10)

[Search-Group]
eta : wikidb

[Index]
eta : wikidb

[Index-Path]
<default> : /search

[OAI]
<default> : $wgServer$wgScriptPath/index.php

[Namespace-Boost]
<default> : (0,2) (1,0.5)

[Namespace-Prefix]
all : <all>
[0] : 0
[1] : 1
[2] : 2
[3] : 3
[4] : 4
[5] : 5
[6] : 6
[7] : 7
[8] : 8
[9] : 9
[10] : 10
[11] : 11
[12] : 12
[13] : 13
[14] : 14
[15] : 15

To re-iterate, if necessary, change the 'wikidb' (3 of them) to whatever your database is named. You can determine this from the LocalSettings.php by finding the variable '$wgDBname'. You'll also need to change the 'eta' to the name of the host MediaWiki is running on. Everything else should be OK.

We're getting close to being done, only a couple more steps. Now we need to create a shell script that will take the dump of the wiki database we'll be creating shortly, and process it into something that Lucene understands. Perform the following steps:

cd bin
echo '#!/bin/sh' > update_lucene.sh
echo 'java -cp LuceneSearch.jar org.wikimedia.lsearch.importer.Importer -s ../../wikixml/wikidb.xml wikidb' >> update_lucene.sh
chmod +x update_lucene.sh

Now change to the MediaWiki installation, with the following command. Again, change 'localhost' if you need to.

cd /var/www/localhost/htdocs/mediawiki

These next few commands will create a script that will dump the MediaWiki database to an XML file for the 'update_lucene.sh' to process.

echo '#!/bin/sh' > mwdumpxml
echo 'php maintenance/dumpBackup.php --current --quiet > /usr/local/search/wikixml/wikidb.xml' >> mwdumpxml
chmod +x mwdumpxml

You'll also need to make sure you have an 'AdminSettings.php' file present for this to run. If you do not already have one, please follow the instructions in the 'AdminSettings.sample' file. If the data in the file is not correct, the database cannot be dumped. Assuming that you have an 'AdminSettings.php', run the database dump script with the following command:

./mwdumpxml

If all goes well, you should see nothing (no errors, etc). If you have a large MW database, this may take a couple seconds. Now let's verify that the file was created.

cd /usr/local/search/wikixml
ls -l

You should see a file here named 'wikidb.xml' with the current date and time. If you don't, well... I dunno. You need to figure out why your database isn't dumping.

Now let's import the dumped data into Lucene.

cd ../ls2/bin
./update_lucene.sh

If all goes well, you should see output similar to this:

MediaWiki lucene-search indexer - index builder from xml database dumps.

Trying config file at path /root/.lsearch.conf
Trying config file at path /usr/local/search/ls2/bin/lsearch.conf
0    [main] INFO  org.wikimedia.lsearch.util.Localization  - Reading localization for En
136  [main] INFO  org.wikimedia.lsearch.ranks.Links  - Making index at /usr/local/search/indexes/import/wikidb.links
239  [main] INFO  org.wikimedia.lsearch.ranks.LinksBuilder  - Calculating article links...
630 pages (1,145.455/sec), 630 revs (1,145.455/sec)
933  [main] INFO  org.wikimedia.lsearch.index.IndexThread  - Making snapshot for wikidb.links
958  [main] INFO  org.wikimedia.lsearch.index.IndexThread  - Made snapshot /usr/local/search/indexes/snapshot/wikidb.links/20090126162751
971  [main] INFO  org.wikimedia.lsearch.search.UpdateThread  - Syncing wikidb.links
1009 [main] INFO  org.wikimedia.lsearch.ranks.Links  - Opening for read /usr/local/search/indexes/search/wikidb.links
1011 [main] INFO  org.wikimedia.lsearch.related.RelatedBuilder  - Rebuilding related mapping from links
1348 [main] INFO  org.wikimedia.lsearch.index.IndexThread  - Making snapshot for wikidb.related
1362 [main] INFO  org.wikimedia.lsearch.index.IndexThread  - Made snapshot /usr/local/search/indexes/snapshot/wikidb.related/20090126162752
1404 [main] INFO  org.wikimedia.lsearch.importer.Importer  - Indexing articles (index)...
1405 [main] INFO  org.wikimedia.lsearch.ranks.Links  - Opening for read /usr/local/search/indexes/search/wikidb.links
1449 [main] INFO  org.wikimedia.lsearch.analyzers.StopWords  - Successfully loaded stop words for: [nl, en, it, fr, de, sv, es, no, pt, da] in 29 ms
1449 [main] INFO  org.wikimedia.lsearch.importer.SimpleIndexWriter  - Making new index at /usr/local/search/indexes/import/wikidb
630 pages (204.612/sec), 630 revs (204.612/sec)
4533 [main] INFO  org.wikimedia.lsearch.importer.Importer  - Closing/optimizing index...
4533 [main] INFO  org.wikimedia.lsearch.importer.SimpleIndexWriter  - Optimizing wikidb
Finished indexing in 4s, with final index optimization in 0s
Total time: 4s
4689 [main] INFO  org.wikimedia.lsearch.index.IndexThread  - Making snapshot for wikidb
4712 [main] INFO  org.wikimedia.lsearch.index.IndexThread  - Made snapshot /usr/local/search/indexes/snapshot/wikidb/20090126162755

Note that on the initial import there may be some additional text (I really don't remember). If you don't... Time to start debugging. Double check your configuration files, directory names, etc. The error message may be of some help, so be sure to carefully read what it's telling you.

Assuming that the import correctly worked, you should now be able to start Lucene.

cd ..
./lsearchd

Assuming no errors, you'll see a load of text spewed to the output. Most of the text is Lucene 'warming up' the database and caches. You'll likely see it searching for things that you KNOW don't exist in your database. Don't worry about it. What you're really looking for is no errors, and not exiting back to the command prompt. Lucene runs as a process, and it's now listening on a TCP port for connections from the MediaWiki server. Which you're now ready to test, assuming you've already done the instructions from the http://www.mediawiki.org/wiki/Extension:MWSearch page.

Be sure to remember to set the correct IP address, if your Lucene server is not running on 192.168.0.1. On my network, I use the 172.16.xxx.yyy address ranges. Now go visit your MediaWiki installation and type a word that you (hopefully) know exists in the body of a wiki entry, but not in a title. If all has gone as expected, the search box should return the page with the text highlighted in red. Pretty cool, huh?

If it didn't, I'm not sure there's much I can tell you, other than to double check anything. Also please remember these instructions were written AFTER I got it working. I've tried to be very careful to cover every step I did, and reproduce and test as much as possible without damaging my running installation.

Now that you've (hopefully) gotten everything working, there's two last steps. One is to create a cron job that will periodically update the Lucene database automatically.

crontab -e

In most cases, this will open an editor. If you have no other entries in the file, simply add the following lines. If you do, you'll need to insert them somewhere. The actual order isn't important, so if you have a sense of esthetic about your cron jobs, put it where it looks best.

#
#  Index the MediaWiki data every 30 minutes
#
3,33 * * * * (cd /var/www/localhost/htdocs/mediawiki; ./mwdumpxml; cd /usr/local/search/ls2/bin; ./update_lucene.sh)

If you're not familiar with cron, this will run the command in the parenthesis every 30 minutes, at 3 minutes and 33 minutes past the hour. You may wish to adjust the time or frequency the command is run at. My wiki has very little activity, other than me, and realistically, this is probably a lot more often than it needs. It's a trade off between how fast you want new content to show when it's searched for, and how busy your server is. I chose 3 and 33 minutes past the hour because I have other tasks that run at the top of the hour, and there's no need for them to run all at the same time.

Once you've made the addition to cron, exited the editor and the new crontab installed, you're pretty much good to go, with the exception of one thing to be aware of:

Lucene has no provisions for starting it automatically when the system boots. I haven't created any scripts to run it from /etc/init.d, so I have to start Lucene by hand. This isn't a big deal for me, since my system stays up for hundreds of days at a time between reboots. I start it on a console screen, and just let it run. You may wish to create a script to runs it with 'nohup' (see 'man nohup' if you're not familiar with it), and pipes the output to /dev/null. Lucene is pretty yakky when it's indexing, and it spews out progress reports as it's searching. There's probably some easy way to turn it off, but I haven't really looked for it.

I hope this guide was somewhat useful in getting Lucene running. It's somewhat system specific, and while there's not a lot of configuration options, the lack of documentation makes it difficult to get going, at least for someone not familiar with it. If you find errors, please feel free to email me. I'm also open to suggestions to improve this documentation.

Happy searching!