5
\$\begingroup\$

I have a series of tables and queries I run a process against and at times this process can take hours - if not multiple days - depending on the date range I select. I am running PostgreSQL 9.4.

In reviewing the below queries, does it make sense to

  1. Have a temp table and/or
  2. Should I have a nested query?
  3. Is there a better design to improve the structure?

I realize there is some back-end tuning I can optimize, but in just reviewing the below queries is there a glaring improvement I am missing or should be addressing better?

The first query:

  • I first insert the below results into a temp table
  • Running the below query can take hours
  • Then the insert can sometimes take an hour or more depending on the amount of data
  • The average amount of rows the can be insert are around two million or more.
  • There is no way to avoid this given the data set and requested results

INSERT INTO tempresults(
            lmp_date, approved_lmp_name_a, approved_lmp_name_b, 
            path, approved_a_pnode, 
            approved_b_pnode, approved_a_sink, approved_b_sink, approved_a_source, 
            approved_b_source, 
            lmp_da_a, lmp_da_b, da_spread_diff, lmp_da_spread_pl, 
            submitted_bid, filled, filled_percentage, priced_filled_at, filled_vs_settled, 
            profit_flag, filled_profit, loss_flag, filled_loss, lmp_rt_a, 
            lmp_rt_b, rt_spread_diff, dart_a, dart_b, dart_spread_diff, win_trigger, loss_trigger)

SELECT
"Selected_Period".lmp_date as "Date",
"Selected_Period".lmp_name_a as "LMP Name A", 
"Selected_Period".lmp_name_b as "LMP Name B", 
CONCAT("Selected_Period".lmp_name_a,' - ',"Selected_Period".lmp_name_b) as "Path",
"Selected_Period".approved_a_pnode, "Selected_Period".approved_b_pnode,
"Selected_Period".approved_a_sink, "Selected_Period".approved_b_sink, "Selected_Period".approved_a_source, 
"Selected_Period".approved_b_source,
   AVG("Selected_Period".lmp_da_a) as "LMP DA A Avg", 
   AVG("Selected_Period".lmp_da_b)  as "LMP DA B Avg", 
   (("Selected_Period".da_spread_diff)*-1) as "LMP DA A-B Spread", 
   (AVG(("Selected_Period".da_spread_diff)*-1)) as "LMP DA A-B Spread PL", 
     0.50 AS "Submitted Bid",
    CASE WHEN ("Selected_Period".da_spread_diff*-1)<=0.50 THEN TRUE ELSE FALSE END AS "Filled",
    CASE WHEN ("Selected_Period".da_spread_diff*-1)<=0.50 THEN 1 ELSE 0 END AS "Filled %",
    CASE WHEN ("Selected_Period".da_spread_diff*-1)<=0.50 THEN "Selected_Period".da_spread_diff*-1 END AS "Price Filled At",

    CASE WHEN ("Selected_Period".da_spread_diff*-1)<=0.50 THEN SUM(("Selected_Period".dart_spread_diff)-("Selected_Period".da_spread_diff)) END AS "Filled vs Settled",

        CASE WHEN ("Selected_Period".da_spread_diff*-1)<=0.50 THEN  CASE WHEN SUM(("Selected_Period".dart_spread_diff)-("Selected_Period".da_spread_diff)*-1) >0.50 THEN TRUE ELSE FALSE END END AS "Profit Flag",

    CASE WHEN ("Selected_Period".da_spread_diff*-1)<=0.50 THEN  CASE WHEN SUM(("Selected_Period".dart_spread_diff)-("Selected_Period".da_spread_diff)*-1) >0.50 THEN SUM((("Selected_Period".dart_spread_diff)-("Selected_Period".da_spread_diff)))+("Selected_Period".da_spread_diff) ELSE 0 END END AS "Filled Profit",

    CASE WHEN ("Selected_Period".da_spread_diff*-1)<=0.50 THEN  CASE WHEN SUM(("Selected_Period".dart_spread_diff)-("Selected_Period".da_spread_diff)*-1) >0.50 THEN FALSE ELSE TRUE END END AS "Loss Flag",

    CASE WHEN ("Selected_Period".da_spread_diff*-1)<=0.50 THEN  CASE WHEN SUM(("Selected_Period".dart_spread_diff)-("Selected_Period".da_spread_diff)*-1) >0.50 THEN 0 ELSE SUM((("Selected_Period".dart_spread_diff)-("Selected_Period".da_spread_diff)))-("Selected_Period".da_spread_diff)*-1 END END AS "Filled Loss",

   AVG("Selected_Period".lmp_rt_a) as "LMP RT A Avg",
   AVG("Selected_Period".lmp_rt_b) as "LMP RT B Avg",
   AVG(("Selected_Period".rt_spread_diff)*-1) as "LMP RT A-B Spread Avg",
   AVG("Selected_Period".dart_a)  as "LMP DART A Avg", 
   AVG("Selected_Period".dart_b)   as "LMP DART B Avg", 
   AVG("Selected_Period".dart_spread_diff) as "LMP DART A-B Spread Avg",
   CASE WHEN ("Selected_Period".da_spread_diff*-1)<=0.50 THEN  CASE WHEN SUM(("Selected_Period".dart_spread_diff)-("Selected_Period".da_spread_diff)*-1) >0.50 THEN 1 ELSE 0 END END AS "Win Trigger",
   CASE WHEN ("Selected_Period".da_spread_diff*-1)<=0.50 THEN  CASE WHEN SUM(("Selected_Period".dart_spread_diff)-("Selected_Period".da_spread_diff)*-1) >0.50 THEN 0 ELSE 1 END END AS "Loss Trigger"
FROM 
  public.darts_calculated_partitioned  "Selected_Period"
WHERE 
"Selected_Period".lmp_date >= 'XYXYXY' AND  -- starting date. user normally inputs this value
"Selected_Period".lmp_date <= 'YZYZYZ' AND -- ending date. user normally inputs this value
"Selected_Period".he = '5'  --this is hour. user normally inputs this value
GROUP BY   
"Selected_Period".lmp_date, "Selected_Period".lmp_name_a, "Selected_Period".lmp_name_b, "Selected_Period".approved_lmp_name_a, 
"Selected_Period".approved_lmp_name_b, "Selected_Period".approved_a_pnode, "Selected_Period".approved_b_pnode, 
"Selected_Period".approved_a_sink, "Selected_Period".approved_b_sink, "Selected_Period".approved_a_source, 
"Selected_Period".approved_b_source, "Selected_Period".he, "Selected_Period".lmp_da_a, "Selected_Period".lmp_da_b, 
       "Selected_Period".da_spread_diff, "Selected_Period".lmp_rt_a, "Selected_Period".lmp_rt_b, 
       "Selected_Period".rt_spread_diff, "Selected_Period".dart_a, "Selected_Period".dart_b, "Selected_Period".dart_spread_diff;

The tables above have lmp_date and he indexed.

The second part of the query then queries from the tempresults table. This can often be very slow.

-- stage 2
-- once the above data has been inserted into the temp table I then query the results. 
-- this can take anywhere from twenty minutes to hours

SELECT 
approved_lmp_name_a as "LMP A", 
approved_lmp_name_b as "LMP B", 
path as "Path", 
approved_a_pnode as "pNode A", 
approved_b_pnode as "pNode B",

  sum(tempresults.da_spread_diff) as "Total DA Spread",
  max(tempresults.da_spread_diff) as "Max DA Spread",
  min(tempresults.da_spread_diff) as "Min DA Spread",

sum(tempresults.rt_spread_diff) as "Total RT Spread",
  max(tempresults.rt_spread_diff) as "Max RT Spread",
  min(tempresults.rt_spread_diff) as "Min RT Spread",

sum(tempresults.dart_spread_diff) as "Total DART Spread",
  max(tempresults.dart_spread_diff) as "Max DART Spread",
  min(tempresults.dart_spread_diff) as "Min DART Spread",

  round(sum(filled_percentage)/count(*),2) as "Filled %",
  sum(tempresults.filled_profit) as "Total Gain",
  round(avg(tempresults.filled_profit),2) as "Average Gain",
  Max(tempresults.filled_profit) as "Max Gain",
  sum(tempresults.filled_loss) as "Total Loss",
  round(avg(tempresults.filled_loss),2) as "Average Loss",
  min(tempresults.filled_loss) as "Max Loss",
  count(*) as "Population",
  sum(tempresults.win_trigger) as "Total Winners",
  sum(tempresults.loss_trigger) as "Total Losers",
  round(sum(tempresults.win_trigger)/count(*),2) as "Total % Winners",
  round(sum(tempresults.loss_trigger)/count(*),2) as "Total % Lossers"
FROM 
  public.tempresults
 group by approved_lmp_name_a, 
approved_lmp_name_b, 
path, 
approved_a_pnode, 
approved_b_pnode;
\$\endgroup\$
1
  • \$\begingroup\$ Instead of just describing your performance woes, it would be helpful if you enhanced the title and the content of the question to include some information about the goal of the query, your schema, any indexes, and the output of EXPLAIN SELECT. \$\endgroup\$ Commented Mar 3, 2015 at 4:05

1 Answer 1

6
\$\begingroup\$

Style

Your style of writing SQL makes it very difficult to read... for example:

INSERT INTO tempresults(
            lmp_date, approved_lmp_name_a, approved_lmp_name_b, 
            path, approved_a_pnode, 
            approved_b_pnode, approved_a_sink, approved_b_sink, approved_a_source, 
            approved_b_source, 
            lmp_da_a, lmp_da_b, da_spread_diff, lmp_da_spread_pl, 
            submitted_bid, filled, filled_percentage, priced_filled_at, filled_vs_settled, 
            profit_flag, filled_profit, loss_flag, filled_loss, lmp_rt_a, 
            lmp_rt_b, rt_spread_diff, dart_a, dart_b, dart_spread_diff, win_trigger, loss_trigger)

Would read much more easily as:

INSERT INTO tempresults (

            lmp_date, 
            approved_lmp_name_a, 
            approved_lmp_name_b, 
            path, 
            approved_a_pnode, 
            approved_b_pnode, 
            approved_a_sink, 
            approved_b_sink, 
            approved_a_source, 
            approved_b_source, 
            lmp_da_a, 
            lmp_da_b, 
            da_spread_diff, 
            lmp_da_spread_pl, 
            submitted_bid, 
            filled, 
            filled_percentage,  
            priced_filled_at,  
            filled_vs_settled, 
            profit_flag,  
            filled_profit,  
            loss_flag,  
            filled_loss,  
            lmp_rt_a, 
            lmp_rt_b,  
            rt_spread_diff,  
            dart_a,  
            dart_b,  
            dart_spread_diff,  
            win_trigger,  
            loss_trigger
)

I realize this is a matter of style and doesn't affect code performance, but it's easier to maintain that way. Another example:

CASE WHEN ("Selected_Period".da_spread_diff*-1)<=0.50 THEN  CASE WHEN SUM(("Selected_Period".dart_spread_diff)-("Selected_Period".da_spread_diff)*-1) >0.50 THEN SUM((("Selected_Period".dart_spread_diff)-("Selected_Period".da_spread_diff)))+("Selected_Period".da_spread_diff) ELSE 0 END END AS "Filled Profit",

This is a bit of a nightmare to scroll horizontally to understand what is going on... How about something like this:

CASE 
  WHEN ("Selected_Period".da_spread_diff*-1)<=0.50 
  THEN  
    CASE WHEN SUM(("Selected_Period".dart_spread_diff)-("Selected_Period".da_spread_diff)*-1) >0.50 
    THEN SUM((("Selected_Period".dart_spread_diff)-("Selected_Period".da_spread_diff)))+("Selected_Period".da_spread_diff) 
    ELSE 0 
    END 
  END AS "Filled Profit",

Altogether I think it could use some documentation along with it, so the next person who has to maintain this does not go too crazy if it needs refactored. You could also do the same with your GROUP BY statement.


Stored functions

This part of the insert script caught my attention:

WHERE 
"Selected_Period".lmp_date >= 'XYXYXY' AND  -- starting date. user normally inputs this value
"Selected_Period".lmp_date <= 'YZYZYZ' AND -- ending date. user normally inputs this value
"Selected_Period".he = '5'  --this is hour. user normally inputs this value

So we have user-input values. Great! This would be a good candidate for a stored function!

Here is an example:

CREATE FUNCTION insert_into_your_table (
    start_date DATE,
    end_date DATE,
    selected_hours INT -- or perhaps TEXT in your case
) AS $$ 
BEGIN

/* 
 * massive INSERT INTO .. SELECT statement here
 */

WHERE "Selected_Period".lmp_date >= start_date
  AND  "Selected_Period".lmp_date <= end_date
  AND  "Selected_Period".he = selected_hours
/* GROUP BY stuff... */
END;
$$ LANGUAGE plpgsql;

Notice how the use of variables as input parameters also eliminates the need to explain what they are for. Then once it needs to be ran, just do:

SELECT insert_into_your_table (
    start_date := '2015-01-01',
    end_date := '2015-03-02',
    selected_hours := '5'
);

This will store the execution plan along with your query, making it faster to execute after you have called it once. You could do the same with your big SELECT query, without input parameters this time.


Calculations

I feel that your insert query has a rather... unhealthy amount of AVG calculations. AVG is expensive because it does both a SUM and COUNT together. The whole logic of your calculations seems terribly convoluted. It's hard to say how to improve it without having access to your data source, but I think it would be worth your while to take a piece of paper and try to draw out everything your query is doing (and then you'll have an idea just how convoluted it actually is).

A few things pop out immediately...

0.50 AS "Submitted Bid",

This 0.50 value is constantly used in your insert query. I'd suggest you declare it at the top and use it throughout...

submitted_bid DECIMAL(3,2) := 0.50;

Then, you write it more like:

submitted_bid AS "Submitted Bid",
CASE 
  WHEN ("Selected_Period".da_spread_diff*-1) <= submitted_bid 
  THEN TRUE 
  ELSE FALSE 
END AS "Filled",
-- etc.

This one comes up constantly in your insert query:

("Selected_Period".da_spread_diff*-1)

Put that one in a variable too... (sorry, data type is unclear, use whatever is appropriate)

period_da_spread_diff INTEGER := SELECT (da_spread_diff*-1) FROM public.darts_calculated_partitioned;

That immensely simplifies your code:

   period_da_spread_diff AS "LMP DA A-B Spread", 
   AVG(period_da_spread_diff) as "LMP DA A-B Spread PL", 
     0.50 AS "Submitted Bid",
    CASE WHEN period_da_spread_diff <=0.50 THEN TRUE ELSE FALSE END AS "Filled",
    CASE WHEN period_da_spread_diff <=0.50 THEN 1 ELSE 0 END AS "Filled %",
    CASE WHEN period_da_spread_diff <=0.50 THEN period_da_spread_diff END AS "Price Filled At",

In the same manner, I see a repetition of count(*) throughout your second query...

round(sum(filled_percentage)/count(*),2) as "Filled %",

So perhaps just declare it once at the beginning (it's not going to change while it's running, so just make it constant):

count_all CONSTANT INTEGER := SELECT COUNT(*) FROM public.tempresult;

Then that becomes:

round(sum(filled_percentage)/count_all,2) as "Filled %",

So that way you are only doing the count once and reusing the same value. You could use some of those other repetitive values and make them variables, for easier maintainability and less repetition.


Spread it out

I think it may be a good idea to split your temp table into several smaller tables, and join them back later. If you're performing calculations based on one big flat data set, rather than splitting them up into several tables, most likely the SQL engine is not taking advantage of calculations that it can pass to another thread if it were inserted elsewhere.

I have put together a while back a stored function used to insert data into multiple tables in one transaction. Perhaps it will inspire you to write something with that kind of logic, as a model.


Explicit create table in insert statement

This might be more of a micro-optimization, but you could try explicitly creating tempresults with the associated data types before inserting into it. May or may not help, but could be worth a try.

\$\endgroup\$
7
  • \$\begingroup\$ Phrancis: Great point on formatting - Had really been writing for myself and not thinking about the next person. Very good point. Let me read through your stored function page. \$\endgroup\$
    – JAS
    Commented Mar 3, 2015 at 2:26
  • \$\begingroup\$ Phrancis: To your point about AVG calculations. Because the dataset is so large and spans various time frames I need to get the averages within the selected variable (time). Im not sure how I could avoid using averages on this data set. \$\endgroup\$
    – JAS
    Commented Mar 3, 2015 at 2:34
  • \$\begingroup\$ I'm curious... What is the whole purpose of putting all this stuff in a temp table? As far as I can tell, it is all pulling from one table "Selected_Period" public.darts_calculated_partitioned so why put it in another table? \$\endgroup\$
    – Phrancis
    Commented Mar 3, 2015 at 2:51
  • \$\begingroup\$ Phrancis: Its stages. 1) A bunch of raw data over various time frames. 2) That data needs to be compiled into averages (but its the average of the specific timeframe). That data (aggregated averages) are put into the temp table. Finally (!) there are a series of calculations done on the aggregated data. \$\endgroup\$
    – JAS
    Commented Mar 3, 2015 at 3:00
  • \$\begingroup\$ Ouch... I wish I could be more help, but if this is an important calculation I would really be tempted to go back to the drawing board... Also you might consider what type of technology is actually accessing the data, perhaps there would be ways to improve its performance by using non-SQL code. \$\endgroup\$
    – Phrancis
    Commented Mar 3, 2015 at 3:09

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.