Once again in a contract I find myself writing a small search engine, this time for an internal Intranet content management system. The client has several use-cases of which several contain words 3 characters long such as ‘cog’ and ‘net’. Additionally, they’ve got a couple of abbreviations with punctuation in (‘H&S’ being one). And once again I find myself explaining to the boss why the MySQL fulltext search doesn’t return a full set of results because (out-of-the-box) the fulltext defaults aren’t very useful.

So here’s why, and how you might be able to fix the issues (and ‘don’t use MySQL’ is not going to be an answer here, dolphinistas)…

For a start, it only indexes words of 4 characters or more, so if you’re trying to index ‘cog’ or ‘net’ then it won’t unless you explicitly configure MySQL to enable this behaviour using the ‘ft_min_word_len config parameter. Perish the thought if you’re search for your own name and you’re called Bob, because MATCH(keywords) AGAINST(“bob”) will always return nothing, as will searching for the classic album ‘Let It Be’.

Then let’s take the stopword list. Stopwords in MySQL are words which won’t be indexed – common words such as ‘the’, ‘and’, ‘or’, etc. (there’s a full list here if you’re curious). However, it also prevents indexing of words such as ‘will’ – a real bugger if you’re indexing a list of users: pity the poor guy in sales called Will who can’t find his own name. You can give MySQL an alternative stoplist in the config, or disable it completely using ‘ft_stopword_file‘.

(It’s not looking good for poor Will or Bob is it!)

Finally, we’ve got the punctuation. This one is a little bit more silly because of the MySQL boolean search which uses various forms of punctuation to specify which words are mandatory, optional, grouped, and so on: the default list is ‘+ -><()~*:””&|‘ (note that we’ve got & in there). Personally I change & to ? since I’m less likely to index with the latter, using ‘ft_boolean_syntax‘ in the config file, and things like ‘H&S’ will get indexed.

Therefore your my.cnf gets the following extra bits:

ft_boolean_syntax=+ -><()~*:""?|

So, mostly sortable right? Wrong: these are server-wide variables and if you alter them you must reindex all fulltext-indexed tables using REPAIR TABLE (or the dreaded myisamchk if you have a lot to do). Now read that again: it’s server-wide so will affect all databases on that server… As an example, the client I’m working with right now shares the CMS database server with a MediaWiki installation which has configured in some places therefore it comes under the banner of ‘service-impacting’ straight away. It also means if you’re on a shared hosted service, no matter how much you lick your ISP it’s unlikely that they’ll change the config on a shared box full of users (with all the reindexing that implies) just for you.

I understand that the InnoDB engine is getting fulltext capability in the next major version of MySQL – about time too: I hope that will herald fulltext configuration changes on a per-database (or even more usefully a per-table) basis.

(Postscript: Nowadays I tend to tackle most of this by creating a search index from data I need to make searchable, a simple script which assembles data and removes unneeded words in a self-configured stoplist table (that way I can also write something which warns the user that not all their search terms were used). I still need to do the changes above but it affords me a lot more control over index sizes, etc.)