DB2 Net Search Extender with optional CONTAINS search parameters
I have a number of DB2 stored procedures that must use the NSE CONTAINS
scalar function with optional search
parameters. It turns out, that if you write such stored procedures in the most obvious way, then the performance is
terrible. However, with a little insight, and some resulting tweaks, such stored procedures can be sped up immensely -
eg more than 1,000 times!!
But first, let me explain what I mean by "optional search parameters". Basically, what I mean is stored procedures that return a result set based on a number of potential filter parameters, where the an NSE search string is just one of many possible filters.
For example, consider a stored procedure that returns a list of people. It might like something like this:
CREATE PROCEDURE test1 (
IN iDate DATE,
IN iSearchStr VARCHAR(30),
OUT oList XML
)
DYNAMIC RESULT SETS 0
READS SQL DATA
NOT DETERMINISTIC
LANGUAGE SQL
NO EXTERNAL ACTION
BEGIN
SET oList = (
SELECT ... FROM table
WHERE col1 = iDate AND col2 LIKE '%'||iSearchStr||'%'
);
END
Ok, so far so good, except we're not actually using db2's Net Search Extender (NSE). So, let's go ahead an replace the
LIKE
clause with NSE's CONTAINS
scalar function.
CREATE PROCEDURE test2 (
IN iDate DATE,
IN iSearchStr VARCHAR(30),
OUT oList XML
)
DYNAMIC RESULT SETS 0
READS SQL DATA
NOT DETERMINISTIC
LANGUAGE SQL
NO EXTERNAL ACTION
BEGIN
SET oList = (
SELECT ... FROM table
WHERE col1 = iDate AND CONTAINS(col2,iSearchStr) = 1
);
END
Ok, that's better. Note, however, that now we've changed the format of iSearchStr
- in the first example, it could be
any text string, but now it must a valid NSE search string. I won't go into the format of NSE search strings here... if
you're not already familiar with that, then this article probably won't be of much use to you anyway ;)
So, now we have a stored procedure that can search on both a date and a string, and requires both to be given. Now what if I want to be able to search on either the date or the string or both? Should be easy...
CREATE PROCEDURE test3 (
IN iDate DATE,
IN iSearchStr VARCHAR(30),
OUT oList XML
)
DYNAMIC RESULT SETS 0
READS SQL DATA
NOT DETERMINISTIC
LANGUAGE SQL
NO EXTERNAL ACTION
BEGIN
SET oList = (
SELECT ... FROM people WHERE
( iDate IS NULL OR col1 = iDate ) AND
( iSearchStr IS NULL OR CONTAINS(col2,iSearchStr) = 1 )
);
END
The logic is correct, but the syntax is not... what you may or may not realise (depending on how much you've played with
NSE) is that the CONTAINS
scalar function will be evaluated regardless of the value of iSearchStr
, and so when
iSearchStr
is NULL
, an error will be returned because CONTAINS
cannot be called with a NULL
parameter. So, we
need to do one more quick bit of work before this stored procedure can be fully functional.
CREATE PROCEDURE test4 (
IN iDate DATE,
IN iSearchStr VARCHAR(30),
OUT oList XML
)
DYNAMIC RESULT SETS 0
READS SQL DATA
NOT DETERMINISTIC
LANGUAGE SQL
NO EXTERNAL ACTION
BEGIN
DECLARE searchStrLen INTEGER DEFAULT 0;
SET searchStrLen = LENGTH(iSearchStr);
IF searchStrLen < 0 THEN
SET iSearchStr = '"ignored"'; -- Must be a valid NSE search string.
END IF;
SET oList = (
SELECT ... FROM people WHERE
( iDate IS NULL OR col1 = iDate ) AND
( searchStrLen = 0 OR CONTAINS(col2,iSearchStr) = 1 )
);
END
Ok, so this stored procedure will now work - and work under all circumstances. However, it will be slow - much slower than it should be!
Now, just as an aside, you might be thinking that the procedure should really use two separate SET
statements like
this:
CREATE PROCEDURE test5 (
IN iDate DATE,
IN iSearchStr VARCHAR(30),
OUT oList XML
)
DYNAMIC RESULT SETS 0
READS SQL DATA
NOT DETERMINISTIC
LANGUAGE SQL
NO EXTERNAL ACTION
BEGIN
IF LENGTH(iSearchStr) = 0 THEN
SET oList = (
SELECT ... FROM people WHERE
( iDate IS NULL OR col1 = iDate )
);
ELSE
SET oList = (
SELECT ... FROM people WHERE
( iDate IS NULL OR col1 = iDate ) AND
CONTAINS(col2,iSearchStr) = 1
);
END IF;
END
Indeed, that would be acceptable for such a simple example, but what if the WHERE
clause contains several other
restrictions? Then management would suffer. But even more so, in this example, you may notice that the oList
output
parameter is of type XML
... although not shown here, the SELECT ...
part of the statement actually contains a very
large and complex set of SQLXML
clauses for generating an XML tree from the resultset
. So duplicating that large
set of XML clauses would be most impractical.
So, back to the non-branches example.
CREATE PROCEDURE test4 (
IN iDate DATE,
IN iSearchStr VARCHAR(30),
OUT oList XML
)
DYNAMIC RESULT SETS 0
READS SQL DATA
NOT DETERMINISTIC
LANGUAGE SQL
NO EXTERNAL ACTION
BEGIN
DECLARE searchStrLen INTEGER DEFAULT 0;
SET searchStrLen = LENGTH(iSearchStr);
IF searchStrLen < 0 THEN
SET iSearchStr = '"ignored"'; -- Must be a valid NSE search string.
END IF;
SET oList = (
SELECT ... FROM people WHERE
( iDate IS NULL OR col1 = iDate ) AND
( searchStrLen = 0 OR CONTAINS(col2,iSearchStr) = 1 )
);
END
The problem, as mentioned above, is that this procedure is quite slow... but why? Well, if you dig into the NSE
documentation, you'll discover that the CONTAINS
scalar function is much more than just a scalar function. In fact,
the mere presence of the CONTAINS
function causes DB2 to perform a separate search of the source table, followed by a
join if the CONTAINS
resultset
table to the source table. This join should be fairly straight forward, however it
seems that DB2's query optimiser gets broken in some way when the CONTAINS
function is evaluated as part of an OR
operator. So, after some stuffing around, I have found that the easiest solution is to simply avoid the OR
/CONTAINS
combination by performing an explicit JOIN
(or rather a WHERE id IN subselect
clause).
CREATE PROCEDURE test6 (
IN iDate DATE,
IN iSearchStr VARCHAR(30),
OUT oList XML
)
DYNAMIC RESULT SETS 0
READS SQL DATA
NOT DETERMINISTIC
LANGUAGE SQL
NO EXTERNAL ACTION
BEGIN
DECLARE searchStrLen INTEGER DEFAULT 0;
SET searchStrLen = LENGTH(iSearchStr);
IF searchStrLen < 0 THEN
SET iSearchStr = '"ignored"'; -- Must be a valid NSE search string.
END IF;
SET oList = (
SELECT ... FROM table WHERE
( iDate IS NULL OR col1 = iDate ) AND
( iSearchStrLen = 0 OR id IN
( SELECT id FROM table WHERE CONTAINS(col2,iSearchStr) = 1 )
)
);
END
Now this looks like the database now has to do considerably more work, but as I mentioned above, the database already
had to do the additional JOIN
effort, so this version is not really more exhausting. However, it seems that
re-writing the query in this way allows the DB2 query optimiser to do a lot better job, and thus increase performance
considerably.
So, just how much performance is gained? Well, of course it depends on the nature of the data you are searching against, but here's a couple of quick examples from a 3GB test set:
[db2inst1@tst ~]$ time db2 "SELECT ... FROM table WHERE CONTAINS(col1, '\"test\"') = 1" > /dev/null
real 0m0.091s
user 0m0.020s
sys 0m0.020s
[db2inst1@tst ~]$ time db2 "CALL test4(NULL,'\"test\"',?)" > /dev/null
real 3m56.540s
user 0m0.020s
sys 0m0.028s
[db2inst1@tst ~]$ time db2 "CALL test5(NULL,'\"test\"',?)" > /dev/null
real 0m0.111s
user 0m0.024s
sys 0m0.020s
[db2inst1@tst ~]$ time db2 "CALL test6(NULL,'\"test\"',?)" > /dev/null
real 0m0.194s
user 0m0.020s
sys 0m0.024s
Notes:
- All four tests returned exactly the same result set (7,742 rows).
- Each call was executed multiple times, and the times quoted are indicative of "typical" durations - indeed, the variation between repeated calls was very minimal.
So you can see that the final version (test6
) is still not quite as fast as the branched version (test5
) would be,
but still, it's much faster (at least 1,000 times faster) than the more obvious approach (test4
).
As a final note, these examples all use IBM's DB2 Net Search Extender with DB2 9.5. However, IBM have begun importing the NSE functions directly into DB2 (they're calling it "DB2 Text Search"), so it will be very interesting to see how much of this remains relevant when that work is complete. In particular, IBM may be successful in making the DB2 query optimiser much smarter (or at least more knowledgeable) about the text search functions.