Skip to main content

DB2 Net Search Extender with optional CONTAINS search parameters

· 8 min read
IBM DB2

I have a number of DB2 stored procedures that must use the NSE CONTAINS scalar function with optional search parameters. It turns out, that if you write such stored procedures in the most obvious way, then the performance is terrible. However, with a little insight, and some resulting tweaks, such stored procedures can be sped up immensely - eg more than 1,000 times!!

But first, let me explain what I mean by "optional search parameters". Basically, what I mean is stored procedures that return a result set based on a number of potential filter parameters, where the an NSE search string is just one of many possible filters.

For example, consider a stored procedure that returns a list of people. It might like something like this:

CREATE PROCEDURE test1 (
IN iDate DATE,
IN iSearchStr VARCHAR(30),
OUT oList XML
)
DYNAMIC RESULT SETS 0
READS SQL DATA
NOT DETERMINISTIC
LANGUAGE SQL
NO EXTERNAL ACTION
BEGIN
SET oList = (
SELECT ... FROM table
WHERE col1 = iDate AND col2 LIKE '%'||iSearchStr||'%'
);
END

Ok, so far so good, except we're not actually using db2's Net Search Extender (NSE). So, let's go ahead an replace the LIKE clause with NSE's CONTAINS scalar function.

CREATE PROCEDURE test2 (
IN iDate DATE,
IN iSearchStr VARCHAR(30),
OUT oList XML
)
DYNAMIC RESULT SETS 0
READS SQL DATA
NOT DETERMINISTIC
LANGUAGE SQL
NO EXTERNAL ACTION
BEGIN
SET oList = (
SELECT ... FROM table
WHERE col1 = iDate AND CONTAINS(col2,iSearchStr) = 1
);
END

Ok, that's better. Note, however, that now we've changed the format of iSearchStr - in the first example, it could be any text string, but now it must a valid NSE search string. I won't go into the format of NSE search strings here... if you're not already familiar with that, then this article probably won't be of much use to you anyway ;)

So, now we have a stored procedure that can search on both a date and a string, and requires both to be given. Now what if I want to be able to search on either the date or the string or both? Should be easy...

CREATE PROCEDURE test3 (
IN iDate DATE,
IN iSearchStr VARCHAR(30),
OUT oList XML
)
DYNAMIC RESULT SETS 0
READS SQL DATA
NOT DETERMINISTIC
LANGUAGE SQL
NO EXTERNAL ACTION
BEGIN
SET oList = (
SELECT ... FROM people WHERE
( iDate IS NULL OR col1 = iDate ) AND
( iSearchStr IS NULL OR CONTAINS(col2,iSearchStr) = 1 )
);
END

The logic is correct, but the syntax is not... what you may or may not realise (depending on how much you've played with NSE) is that the CONTAINS scalar function will be evaluated regardless of the value of iSearchStr, and so when iSearchStr is NULL, an error will be returned because CONTAINS cannot be called with a NULL parameter. So, we need to do one more quick bit of work before this stored procedure can be fully functional.

CREATE PROCEDURE test4 (
IN iDate DATE,
IN iSearchStr VARCHAR(30),
OUT oList XML
)
DYNAMIC RESULT SETS 0
READS SQL DATA
NOT DETERMINISTIC
LANGUAGE SQL
NO EXTERNAL ACTION
BEGIN
DECLARE searchStrLen INTEGER DEFAULT 0;
SET searchStrLen = LENGTH(iSearchStr);
IF searchStrLen < 0 THEN
SET iSearchStr = '"ignored"'; -- Must be a valid NSE search string.
END IF;

SET oList = (
SELECT ... FROM people WHERE
( iDate IS NULL OR col1 = iDate ) AND
( searchStrLen = 0 OR CONTAINS(col2,iSearchStr) = 1 )
);
END

Ok, so this stored procedure will now work - and work under all circumstances. However, it will be slow - much slower than it should be!

Now, just as an aside, you might be thinking that the procedure should really use two separate SET statements like this:

CREATE PROCEDURE test5 (
IN iDate DATE,
IN iSearchStr VARCHAR(30),
OUT oList XML
)
DYNAMIC RESULT SETS 0
READS SQL DATA
NOT DETERMINISTIC
LANGUAGE SQL
NO EXTERNAL ACTION
BEGIN
IF LENGTH(iSearchStr) = 0 THEN
SET oList = (
SELECT ... FROM people WHERE
( iDate IS NULL OR col1 = iDate )
);
ELSE
SET oList = (
SELECT ... FROM people WHERE
( iDate IS NULL OR col1 = iDate ) AND
CONTAINS(col2,iSearchStr) = 1
);
END IF;
END

Indeed, that would be acceptable for such a simple example, but what if the WHERE clause contains several other restrictions? Then management would suffer. But even more so, in this example, you may notice that the oList output parameter is of type XML... although not shown here, the SELECT ... part of the statement actually contains a very large and complex set of SQLXML clauses for generating an XML tree from the resultset. So duplicating that large set of XML clauses would be most impractical.

So, back to the non-branches example.

CREATE PROCEDURE test4 (
IN iDate DATE,
IN iSearchStr VARCHAR(30),
OUT oList XML
)
DYNAMIC RESULT SETS 0
READS SQL DATA
NOT DETERMINISTIC
LANGUAGE SQL
NO EXTERNAL ACTION
BEGIN
DECLARE searchStrLen INTEGER DEFAULT 0;
SET searchStrLen = LENGTH(iSearchStr);
IF searchStrLen < 0 THEN
SET iSearchStr = '"ignored"'; -- Must be a valid NSE search string.
END IF;

SET oList = (
SELECT ... FROM people WHERE
( iDate IS NULL OR col1 = iDate ) AND
( searchStrLen = 0 OR CONTAINS(col2,iSearchStr) = 1 )
);
END

The problem, as mentioned above, is that this procedure is quite slow... but why? Well, if you dig into the NSE documentation, you'll discover that the CONTAINS scalar function is much more than just a scalar function. In fact, the mere presence of the CONTAINS function causes DB2 to perform a separate search of the source table, followed by a join if the CONTAINS resultset table to the source table. This join should be fairly straight forward, however it seems that DB2's query optimiser gets broken in some way when the CONTAINS function is evaluated as part of an OR operator. So, after some stuffing around, I have found that the easiest solution is to simply avoid the OR/CONTAINS combination by performing an explicit JOIN (or rather a WHERE id IN subselect clause).

CREATE PROCEDURE test6 (
IN iDate DATE,
IN iSearchStr VARCHAR(30),
OUT oList XML
)
DYNAMIC RESULT SETS 0
READS SQL DATA
NOT DETERMINISTIC
LANGUAGE SQL
NO EXTERNAL ACTION
BEGIN
DECLARE searchStrLen INTEGER DEFAULT 0;
SET searchStrLen = LENGTH(iSearchStr);
IF searchStrLen < 0 THEN
SET iSearchStr = '"ignored"'; -- Must be a valid NSE search string.
END IF;

SET oList = (
SELECT ... FROM table WHERE
( iDate IS NULL OR col1 = iDate ) AND
( iSearchStrLen = 0 OR id IN
( SELECT id FROM table WHERE CONTAINS(col2,iSearchStr) = 1 )
)
);
END

Now this looks like the database now has to do considerably more work, but as I mentioned above, the database already had to do the additional JOIN effort, so this version is not really more exhausting. However, it seems that re-writing the query in this way allows the DB2 query optimiser to do a lot better job, and thus increase performance considerably.

So, just how much performance is gained? Well, of course it depends on the nature of the data you are searching against, but here's a couple of quick examples from a 3GB test set:

[db2inst1@tst ~]$ time db2 "SELECT ... FROM table WHERE CONTAINS(col1, '\"test\"') = 1" > /dev/null
real 0m0.091s
user 0m0.020s
sys 0m0.020s

[db2inst1@tst ~]$ time db2 "CALL test4(NULL,'\"test\"',?)" > /dev/null
real 3m56.540s
user 0m0.020s
sys 0m0.028s

[db2inst1@tst ~]$ time db2 "CALL test5(NULL,'\"test\"',?)" > /dev/null
real 0m0.111s
user 0m0.024s
sys 0m0.020s

[db2inst1@tst ~]$ time db2 "CALL test6(NULL,'\"test\"',?)" > /dev/null
real 0m0.194s
user 0m0.020s
sys 0m0.024s

Notes:

  • All four tests returned exactly the same result set (7,742 rows).
  • Each call was executed multiple times, and the times quoted are indicative of "typical" durations - indeed, the variation between repeated calls was very minimal.

So you can see that the final version (test6) is still not quite as fast as the branched version (test5) would be, but still, it's much faster (at least 1,000 times faster) than the more obvious approach (test4).

As a final note, these examples all use IBM's DB2 Net Search Extender with DB2 9.5. However, IBM have begun importing the NSE functions directly into DB2 (they're calling it "DB2 Text Search"), so it will be very interesting to see how much of this remains relevant when that work is complete. In particular, IBM may be successful in making the DB2 query optimiser much smarter (or at least more knowledgeable) about the text search functions.