I started using sphinxsearch for fulltext search a long time ago and I have been quite satisfied with it. A year ago I decided to migrate to manticore which is an up-to-date, fully open-source fork of Sphinx maintained by original Sphinx developers. In this article and following blog posts, I am going to talk about some tips, workarounds and how to use it efficiently in combination with mysql. For the rest of the article, I will use the name "manticore" only.
Using MySQL as the Data Source
Manticore is well integrated with MySQL, which means you can just set the configuration parameters and the indexer does all the job. Of course, this is true for only fully (re)indexing. However, a common use case is, running a cron job once every day/week to index all documents at once and inserting new documents one by one during each period. In this case, you must create a real-time index and modify it in your app using MySQL protocol. More on this will be explained in the next article.
If you use the MyISAM storage engine, you could easily use the built-in full-text capabilities of MySQL/MyISAM; however, it is highly inefficient because of its locking behavior as well and the engine is outdated.
Let's continue with a simplified real-world example. Suppose that we have a MySQL table in which we store documents for full-text search purposes. Here is a product
table that has an integer primary key, a variable length name and a product category list.
CREATE TABLE `product` (
`id` int(0) UNSIGNED NOT NULL AUTO_INCREMENT,
`name` VARCHAR(255) NOT NULL,
`categories` text NOT NULL,
PRIMARY KEY (`id`)
) ENGINE = InnoDB;
# a table to keep some indexing information
CREATE TABLE `product_search_status` (
`id` varchar(30) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL,
`value` bigint(20) UNSIGNED NOT NULL,
PRIMARY KEY (`id`) USING BTREE
) ENGINE = InnoDB;
We need to start configuring Manticore by defining the data source in the configuration file. Assuming that we want to search in both names and categories, we need 2 full-text indexes.
source product_data
{
type = mysql
# mysql connection params
sql_host = localhost
sql_user = root
sql_pass = 123456
sql_port = 3306
sql_db = testdb
# mysql query to create manticore documents
# we set unicode charset and wait_timeout to a high value to prevent connection timeout errors
sql_query_pre = SET NAMES utf8
sql_query_pre = SET SESSION wait_timeout=3600
# we store the index time for information
sql_query_pre = REPLACE INTO product_search_status (id, value) VALUES ('last_indexed_time', UNIX_TIMESTAMP())
# we set start-end document ids so that manticore will know where to start and stop indexing
sql_query_range = SELECT MIN(id), MAX(id) FROM product
sql_range_step = 10000
# this is the main query to create documents
sql_query = SELECT \
id, \
name AS name_ft, \
categories AS categories_ft, \
name \
FROM product \
WHERE id >= $start AND id <= $end
# we store the most recent document id for information
sql_query_post_index = REPLACE INTO product_search_status (id, value) VALUES ('last_indexed_id', $maxid)
# the first column in above query must be **unique** document id
# the others must be defined as follows unless they are fulltext columns, which means any column that is not defined here except id will be indexed for fulltext search.
sql_attr_string = name
}
In the configuration above, I added the product name twice to the query to get the product name from the result set, so that I can respond back to the client as fast as possible for a better search experience. The reason behind this is that Manticore does not store full-text indexes in plain text, but in a rather complex format, so it cannot return the text. This is also why it can search billions of documents very fast.
In most scenarios, there is a lot more to display in search results, in which case indexing a column twice is not a good idea. It would consume unnecessary disk space and memory. Instead, the document IDs should be retrieved from Manticore to query the MySQL database to fetch the rest of the data.
Indexing
Manticore has a few configuration blocks such as source
, index
, indexer
, searchd
. We already defined the data source using source block, and it's time to use index
to define a full-text index based on product data.
index product
{
path = /var/lib/manticore/data/product
source = product_data
morphology = none
min_word_len = 3
min_prefix_len = 3
}
The configuration above is pretty straightforward: Do not rely on any morphology (e.g. stemming), do not index words less than 3 chars to gain a bit of space and performance, enable prefix/wildcard search (e.g. lap* -> laptop) and restrict it to minimum 3 chars. Now, the following command will (re)build the index based on product data.
$ indexer --rotate product
using config file '/etc/sphinxsearch/sphinx.conf'...
indexing index 'product'...
collected 829 docs, 0.2 MB
creating lookup: 0.8 Kdocs, 100.0% done
creating histograms: 0.8 Kdocs, 100.0% done
sorted 0.0 Mhits, 100.0% done
total 829 docs, 223480 bytes
total 0.416 sec, 536143 bytes/sec, 1988.82 docs/sec
total 4 reads, 0.000 sec, 53.1 kb/call avg, 0.0 msec/call avg
total 15 writes, 0.000 sec, 26.3 kb/call avg, 0.0 msec/call avg
rotating indices: successfully sent SIGHUP to searchd (pid=13968).
The Unicode Characters
Unicode characters on searches will be ignored, if not configured correctly. If you want to support searching some language-specific characters the right way, you must use a charset table, otherwise, they will be ignored.
An example would be "şçğüöıŞÇĞÜÖI" chars in the tr-TR (Turkish) language. When you store these chars in your index, by default, your search query including these chars will respond you an empty or wrong resultset, because the words containing these chars in your query will not be processed.
There was only one workaround for this problem until Manticore v3: creating a charset table having a Unicode list of special characters. This is quite useful for a few languages, but otherwise, it would be a mess. Since version 3.0, fortunately, there is now cjk
and non_cjk
values that handle almost all the charsets internally. An example of this can be seen below.
index product
{
morphology = none
min_word_len = 3
min_prefix_len = 3
# manticore v3.0+
charset_table = non_cjk
# before v3.0
regexp_filter = (ı|I) => İ
charset_table = 0..9, a..h, j..z, A..H->a..h, J..Z->j..z, \
U+131->U+049, U+049, \
U+069->U+130, U+130, \
U+0FC->U+0DC, U+0DC, \
U+15F->U+15E, U+15E, \
U+0F6->U+0D6, U+0D6, \
U+0E7->U+0C7, U+0C7, \
U+11F->U+11E, U+11E, \
U+40, U+2E, U+5F, U+2B, U+2D, U+25, U+23
}
U+131->U+049, U+049
means treat lowercase dotless "ı" as uppercase Latin "I", and ignore itself.
The use of regexp_filter
is like an edge case caused by the Turkish language. Unlike in other Latin languages, uppercase counterparts of "i" and "ı" in this language are "İ" and "I" respectively. Here we can use regexp_filter = (ı|I) => İ
to make all four of "iİıI" interchangeable similar to English.
Querying
The following command will start the MySQL console for Manticore with the default port (as long as the MySQL client is installed on the same machine).
mysql -h0 -P9306
- Find all products whose names contain words starting with 'laptop':
mysql> SELECT * FROM `product` WHERE MATCH('@name_ft laptop*') LIMIT 0, 10;
+------+----------------------------------------------------------------------------------------------------------------------------------------+
| id | name |
+------+----------------------------------------------------------------------------------------------------------------------------------------+
| 40 | Samsung - 850 PRO 512GB Internal SATA III Solid State Drive for Laptops |
| 153 | DENAQ - AC Power Adapter and Charger for Select HP Omnibook, Pavilion and Presario Laptops - Black |
| 207 | Corsair CMSA8GX3M2A1066C7 Apple 8 GB Dual Channel Kit DDR3 1066 (PC3 8500) 204-Pin DDR3 Laptop SO-DIMM Memory 1.5V |
| 228 | DENAQ - AC Power Adapter and Charger for Select Toshiba Satellite and Satellite Pro Laptops - Black |
| 238 | Lenovo - AC Adapter for Select Lenovo Yoga Laptops - Black |
+------+----------------------------------------------------------------------------------------------------------------------------------------+
4 rows in set (0.01 sec)
- Find all products whose names include "wireless" and, whose categories include "mobile":
mysql> SELECT * FROM `product` WHERE MATCH('@name_ft wireless @categories_ft mobile') Limit 0, 5;
+------+-------------------------------------------------------------------------------------------------+
| id | name |
+------+-------------------------------------------------------------------------------------------------+
| 3820 | h.ear go Wireless Speaker (Viridian Blue) |
| 6406 | Sony SRS-XB41 Portable Wireless Bluetooth Speaker - Black - SRSXB41/B (Certified Refurbished) |
| 5996 | h.ear go Wireless Speaker (Cinnabar Red) |
| 6977 | Boombox+ Bluetooth Wireless Speaker |
| 107 | Air-Fi Runaway AF32 Stereo Bluetooth Wireless Headphones with Hidden Microphone (Black and Red) |
+------+-------------------------------------------------------------------------------------------------+
5 rows in set (0.00 sec)
Since manticore is using mysql protocol, you can easily use any mysql client to connect and query. There is a well writen php library that encapsulates mysqli and provides orm-like interface to query as easy as possible.