5
\$\begingroup\$

I have written the below query in order to identify how many events (occur over a week) in each hour.

select Hour, count(Hour) from (
    select
        hour(max(events.`created_at`)) as 'Hour', 
        count(*) as 'Count' 
    from events
    where created_at >= '2025-03-31' and created_at < '2025-04-06'
    group by 
        hour(events.`created_at`), schedule_id
    order by Hour
) as temp group by Hour;

The outer query is required so that it groups together all of the hours that are the same regardless of the schedule_id. Multiple records (that all are in the same hour) may share the same schedule_id, but I only want to know how many unique schedule_id instances are found. i.e...

id schedule_id created_at
1 50 2025-04-01 09:05:05
2 50 2025-04-01 09:06:05
3 51 2025-04-01 09:07:05
4 52 2025-04-01 10:44:44

would then return

Hour count(Hour)
9 2
10 1

because while there are 3 records that were created between 9am and 10am, there are only 2 unique schedules (50 and 51).

However, this query is very slow. On a table of 39 million rows, this takes 15 seconds. And the actual table that this needs to be ran on is much much much larger. Any ideas how I could improve this query?

\$\endgroup\$
5
  • \$\begingroup\$ Too many years since I dreamed in SQL... However, just to note that the "week" in this example is only six days long... It's important that all the facts are clearly stated, and an accounting given for any mismatch between what's expected and what's presented... (Since hour seems to be very important, perhaps an indexed field in each record would eliminate deriving critical information before sorting/selecting ops...) \$\endgroup\$
    – Fe2O3
    Commented yesterday
  • 1
    \$\begingroup\$ What are you trying to do with 'max' here? \$\endgroup\$ Commented yesterday
  • \$\begingroup\$ I don't think there is enough info here to adequately answer the question. What is the expected output if you have two records for the same schedule_id with two different hours? How many events records are expected to match any provided condition? What is the structure of the relevant indexes on the events table? \$\endgroup\$ Commented yesterday
  • 1
    \$\begingroup\$ From the sql tag info: Read the tag wiki's guidelines for requesting SQL reviews: 1) Provide context, 2) Include the schema, 3) If asking about performance, include indexes and the output of EXPLAIN SELECT. \$\endgroup\$
    – greybeard
    Commented 22 hours ago
  • \$\begingroup\$ The current question title, which states your concerns about the code, is too general to be useful here. Please edit to the site standard, which is for the title to simply state the task accomplished by the code. Please see How to get the best value out of Code Review: Asking Questions for guidance on writing good question titles. \$\endgroup\$ Commented 21 hours ago

3 Answers 3

5
\$\begingroup\$

The inner select has a max too much. I think that is meant to do less time conversions, or such. But it clashes with group by.

Also the hour may exceed the range 0-23, especially for time values.

select
    hour(events.`created_at`) as 'Hour', 
    count(*) as 'Count' 
from events
where created_at >= '2025-03-31' and created_at < '2025-04-06'
group by 
    hour(events.`created_at`), schedule_id
order by Hour

You might like to make a view.

For the outer:

select Hour, count(*) from (
) as temp
group by Hour;

I doubt this answer improves anything, but seems more digestible to SQL optimization.

\$\endgroup\$
1
  • \$\begingroup\$ I believe this returns a different result as it appears that the count in the outer query from the question is giving the number of distinct schedule_id values for each hour. But removing schedule_id from the group by clause and switching to count(distinct schedule_id) should be equivalent. \$\endgroup\$
    – Nelson O
    Commented 9 hours ago
4
\$\begingroup\$

I don't know much about MySQL and don't know if it's gonna be faster, but I would use COUNT(DISTINCT) for this:

select
    hour(events.`created_at`) as `Hour`, 
    count(distinct schedule_id) as `Count` 
from events
where created_at >= '2025-03-31' and created_at < '2025-04-06'
group by hour(events.`created_at`)
order by `Hour`;

Also I've consistently used backticks (`) for column names/aliases, instead of mixing them with apostrophes. Some are recommending double quotes (") (even though mysql requires ANSI_QUOTES mode for that).

\$\endgroup\$
3
\$\begingroup\$

The problem with your query is that the grouping is by a function. The database engine needs to compute all of the hour values for each row on the fly, then sort them into buckets. I don't believe that rewriting the query will be particularly helpful for performance tuning.

If you are planning on running this query fairly often, you will probably want to create a stored generated column in the events table (containing the hour of the event). Then, indexing this generated column will allow the database engine to create statistics on the hour of the event, which can be looked up a lot quicker than calculating the result on the fly.

I personally have no experience with MySQL, since I work with SQL Server, so unfortunately I can't provide any code to do the above tasks. But hopefully there's enough key words in there that you can research your own answer.

\$\endgroup\$
1
  • \$\begingroup\$ This answer addresses the real cause, As created_at is a write-once value, one could store the hour additionally as field. \$\endgroup\$
    – Joop Eggen
    Commented 18 hours ago

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.