Description
Below are two cases showing wrong results returned from RUM index with order_by_attach=TRUE
. In both test cases, if order_by_attach=FALSE
, correct results will be returned.
Tested with PostgreSQL 12.2 & newest commit e34375a.
CASE 1:
CREATE TABLE test (
id bigint NOT NULL,
folder bigint NOT NULL,
time bigint NOT NULL,
tsv tsvector NOT NULL
)
INSERT INTO test (id, folder, time, tsv) VALUES (1, 10, 100, to_tsvector('wordA'));
INSERT INTO test (id, folder, time, tsv) VALUES (2, 20, 200, to_tsvector('wordB'));
INSERT INTO test (id, folder, time, tsv) VALUES (3, 10, 300, to_tsvector('wordA'));
INSERT INTO test (id, folder, time, tsv) VALUES (4, 20, 400, to_tsvector('wordB'));
Below shows expected results when select without index:
SELECT * FROM test WHERE tsv @@ (to_tsquery('wordA')) AND (folder = 10::bigint)
Returns rows with id 1 and 3.
SELECT * FROM test WHERE tsv @@ (to_tsquery('wordA')) AND (folder = 10::bigint) ORDER BY time <=| 500::bigint
Returns rows with id 3 and 1.
After creating the following index, wrong results will be returned:
CREATE INDEX test_idx ON test USING rum(folder, tsv rum_tsvector_addon_ops, time) with (attach = 'time', to = 'tsv', order_by_attach=TRUE);
SET enable_seqscan = OFF;
SELECT * FROM test WHERE tsv @@ (to_tsquery('wordA')) AND (folder = 10::bigint)
Wrong result: Returns only row with id 1. (Expects 1 and 3)
SET enable_seqscan = OFF;
SELECT * FROM test WHERE tsv @@ (to_tsquery('wordA')) AND (folder = 10::bigint) ORDER BY time <=| 500::bigint
Wrong result: Returns nothing. (Expects rows with id 3 and 1)
CASE 2:
CREATE TABLE test2 (
id bigint NOT NULL,
time bigint NOT NULL,
tsv tsvector NOT NULL
)
CREATE OR REPLACE PROCEDURE test2_init()
AS $$
DECLARE
counter INTEGER := 1;
str TEXT;
time BIGINT;
BEGIN
WHILE counter <= 1000 LOOP
IF counter % 10 = 0 THEN
str := 'wordA wordB';
ELSEIF counter % 11 = 0 THEN
str := 'wordA wordB wordC';
ELSE
str := 'wordA wordD';
END IF;
-- insert rows with alternating time
IF counter % 2 = 0 THEN
time := counter;
ELSE
time := -counter;
END IF;
INSERT INTO test2 (id, time, tsv) VALUES (counter, time, to_tsvector(str));
counter := counter + 1;
END LOOP;
END
$$
LANGUAGE plpgsql;
CALL test2_init();
Below shows expected results when select without index:
SELECT * from test2 where tsv @@ (to_tsquery('wordA & wordB')) ORDER BY time <=| 1001::bigint
Returned 181 rows.
After creating the following index, wrong results will be returned:
CREATE INDEX test2_idx ON test2 USING rum(tsv rum_tsvector_addon_ops, time) with (attach = 'time', to = 'tsv', order_by_attach=TRUE);
SET enable_seqscan = OFF;
SELECT * from test2 where tsv @@ (to_tsquery('wordA & wordB')) ORDER BY time <=| 1001::bigint
Wrong result: Returns only 153 rows. (Expects 181 rows)
If the above query changes from ORDER BY time <=| 1001::bigint
to ORDER BY <=> 1001::bigint
, then correct number of rows will be returned.