3
\$\begingroup\$

The Problem

I am developing a search feature in PostgreSQL that involves a collection of JSONB documents stored in a table, which have been standardised. The goal is to enable clients to perform keyword searches across all documents.

The Code

The function is expected to accept a search query and return a table with columns for the document data in JSONB format, and a relevance score. The function uses the ts_rank_cd function combined with plainto_tsquery for full-text search as I thought this combination is suitable for ranking text search results based on the natural-language query. Additionally, I have set different weights (A, B, C) for various parts of the JSONB data, prioritising certain text fields over others.

DROP FUNCTION IF EXISTS search_documents;
CREATE FUNCTION search_documents(search_query text)
    RETURNS TABLE(
        raw_data JSONB,
        relevance real
    )
AS
$$
BEGIN
    RETURN QUERY
        SELECT
            mpd.payload AS raw_data,
            ts_rank_cd(
                setweight(to_tsvector('english', coalesce(jsonb_extract_path_text(mpd.payload, 'type1', 'about_me'), '')), 'A') ||
                setweight(to_tsvector('english', coalesce(jsonb_extract_path_text(mpd.payload, 'type2', 'content'), '')), 'A') ||
                setweight(to_tsvector('english', coalesce(jsonb_extract_path_text(mpd.payload, 'type3', 'title'), '')), 'A') ||
                setweight(to_tsvector('english', coalesce(jsonb_extract_path_text(mpd.payload, 'type4', 'occupation'), '')), 'B') ||
                setweight(to_tsvector('english', coalesce(jsonb_extract_path_text(mpd.payload, 'type4', 'summary'), '')), 'B') ||
                setweight(to_tsvector('english', coalesce(jsonb_extract_path_text(mpd.payload, 'type4', 'experiences'), '')), 'C'),
            plainto_tsquery('english', search_query)
            ) AS relevance
        FROM
            document mpd
        WHERE
            to_tsvector('english',
                        coalesce(jsonb_extract_path_text(mpd.payload, 'type1', 'about_me'), '') || ' ' ||
                        coalesce(jsonb_extract_path_text(mpd.payload, 'type2', 'content'), '') || ' ' ||
                        coalesce(jsonb_extract_path_text(mpd.payload, 'type3', 'title'), '') || ' ' ||
                        coalesce(jsonb_extract_path_text(mpd.payload, 'type4', 'occupation'), '') || ' ' ||
                        coalesce(jsonb_extract_path_text(mpd.payload, 'type4', 'summary'), '') || ' ' ||
                        coalesce(jsonb_extract_path_text(mpd.payload, 'type4', 'experiences'), '')
                ) @@ plainto_tsquery('english', search_query)
        ORDER BY relevance DESC;
END;
$$
LANGUAGE plpgsql;

Self Analysis

  • The function uses multiple instances of coalesce and jsonb_extract_path_text which makes the WHERE clause and SELECT clause somewhat cluttered.
  • There is no explicit error handling for potential issues like malformed JSON data or SQL execution errors.
  • Extraction logic is duplicated.
  • Potentially need to incorporate more sophisticated text analysis or different configurations of the full-text search to yield better results.
  • No indexing.

Here is a create statement for the document table. I have complete control over this and it can change.

DROP TABLE IF EXISTS public.document;
CREATE TABLE IF NOT EXISTS public.document
(
    document_id uuid NOT NULL,
    payload jsonb NOT NULL,
    created_on timestamp without time zone NOT NULL,
    CONSTRAINT document_pkey PRIMARY KEY (document_id)
)
\$\endgroup\$
1
  • 2
    \$\begingroup\$ Thanks for adding the data schema - I hope you get some good answers. Sorry I can't answer myself, but I've only ever done the simplest things in SQL (and some PostGIS, but that doesn't help here!) \$\endgroup\$ Commented Apr 26, 2024 at 6:25

1 Answer 1

2
\$\begingroup\$

multiple instances of coalesce and jsonb_extract_path_text ... somewhat cluttered

Meh, whatever. Doesn't trouble me. If you feel strongly about it, you could always fix it with a VIEW. Or consider creating a derived reporting table that turns all those NULLs into empty strings. Perhaps as a MATERIALIZED VIEW.

no explicit error handling for potential issues like ...

Meh, that's cool. Apparently such issues seldom arise. If they are troublesome, you know how to catch, and perhaps even recover from them. The more worrisome aspect would relate to the spec. Is it written down? Does the query code conform to it? Does the data conform to it? Any stack traces you see might be a gift, which help you to better understand spec conformance and to repair the underlying Root Cause.

    RETURNS TABLE( ...

... accept a search query and return a TABLE

Yeah, I can certainly understand why you might describe those result rows as being a "table". I would tend to describe it as a "relation", given that it isn't a named table which we could, for example, DROP.

No indexing.

Ok, now I'm upset! That is, you know, at least half of why we pay the ACID tax, right? A core aspect of RDBMS queries is that we typically avoid table scan and manage to leave > 99% of rows on disk, where they belong, untouched. Most queries should drag fewer than 1% of rows into RAM. We exploit indexes to accomplish such goals.

CREATE FUNCTION ...

Sorry, I don't get it. I mean, that is valid postgres syntax, and I suppose it's a nice enough function. But it's not clear to me why we need to define it. That same query could easily be embodied as a VIEW. Or as a {php, python, whatever} function that produces some DML query which we send each time to the backend. I feel that expressing it in this way makes it harder for folks to view and harder to make changes. In general, backend functions are harder to reason about than named queries (VIEWs), because of their ability to potentially have side effects. (This one has no side effects, but you have to carefully read through it to determine that.)

LIMIT

The whole business of "catenate half a dozen things with SPACE between them" seems pretty reasonable to me. And then setweight will rejigger the ordering.

Imagine that you had issued a CREATE VIEW, and the client was tacking on a LIMIT 20 clause. Would such truncation affect the rejiggering much? If all those rows were copied at midnight into a reporting table, would you maybe prefer to lump occupation + summary + experiences together?

query plan

The OP chose to omit EXPLAIN ANALYZE SELECT ... performance data, so it's hard to assess the efficiency of this query. It's unclear how much stopword filtering (e.g. "the") is warranted. Without knowing the {scoring, loss} function, it's going to be hard to tune things like indexes or a useful LIMIT parameter.

\$\endgroup\$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.