>>12586👍 Thanks anon! There's more:
I want to see the board as it was a year ago. I need to get all of the threads that were on the board, at t = (now - 1 year).
Threads are ordered by bump time. We need to get the first, say, 400 threads if you order all the threads by bump time.
In lainchan each thread has a bump column (threads and posts share the same table called posts_a or whatever), which makes it simple to get the current top threads, but it doesn't help us calculate what the bump times were one year ago.
So first I recognize that a thread's bump time is just the time the most recent post was posted to that thread.
Maybe I can use SQL's aggregate function to get the MAX post creation time, if we group the posts by their thread_id. Then, since I have indices on the creation_time and thread_id columns, (and another index on both), if I limit the number of threads that I want then hopefully this won't be slow:
SELECT
threads.thread_id,
MAX(posts.creation_time) AS bump_time,
COUNT(*) AS post_count
FROM
threads
JOIN
posts ON threads.thread_id = posts.thread_id
WHERE
posts.creation_time < NOW() – Specify your desired time here
GROUP BY
threads.thread_id
ORDER BY
bump_time DESC
LIMIT
400;
And that took 350ms. Which is a little disappointing - it tells me that the database is not taking advantage of our indices. And if I use EXPLAIN ANALYZE then I see that it is in fact doing a full sequence scan. Actually scanning half a million rows in 350ms isn't terrible. In fact I can live with that but it will grow slower and slower with more records/time.
I was actually stuck here for a little bit. But I thought about it and noticed that if you group a smaller set of posts by thread_id, and a larger set of posts or even all of the posts you have by thread_id, then the results are the same as long as you have the posts ordered by time (descending). The set using more posts will just have more results, but we only need the first few hundred, not every thread ever.
Here's why it works:
You have all the posts created before a certain date ordered by time. So The first one is going to be roughly from 1 year ago, and let's say this p1 is from thread with id 5, or tid=5. The time p1 was created is going to be thread 5's bump time. Remember the posts are ordered by creation time, so the next post, p2, is necessarily going to be holder than p1. And it will have:
Either tid = 5, meaning p2 is another post in the same thread, then that row will get grouped with the other one when we do GROUP BY, and it's creation_time cannot be greater than p1's because we are ordering the posts by creation time.
Or p1 can have another thread id we haven't seen before, which is would be the next thread in the overboard catalog, because again p2 was created before p1, so if p2 has tid ≠ 5, say p2 has tid 6, then the catalog is {tid 5, tid 6}.
The same logic applies to the next post, p3, which has a tid either in the set {5, 6} and gets grouped with one of those threads, or it's a thread we haven't seen before, and that's the next thread in the catalog.
The point is subsequent posts will never change our existing results. So looking through more posts is only valuable until we build up the ordered list of threads that we want. Looking further is wasteful.
Let's try limiting the number of posts in a separate subquery in SQL:
SELECT max(creation_time) as bump_time, count(*), thread_id
FROM
(
SELECT thread_id, creation_time, body FROM posts
WHERE creation_time < NOW() - interval '365 day'
ORDER BY creation_time DESC LIMIT 10000
) as t
GROUP BY thread_id
ORDER BY bump_time DESC;
That gives us 591 results in 18ms. So that's what like a 20x improvement? And that shouldn't get significantly slower the more posts we add to the archive.
There's one more thing we can do to this query: to make it return a variable number of results, we can iterate over some number of posts at a time. That means in a really worst case, we will never have fewer than desired results:
CREATE TEMP TABLE IF NOT EXISTS temp_results (bump_time TIMESTAMPTZ, post_count INT, thread_id INT, last_fetched_time TIMESTAMP);
/*
* This function scans backwards from p_start_time until we get the desired number of threads.
* Since the posts are ordered by creation time, the first post encountered for a particular
* thread will have that thread's bump time. Any posts after that will have a timestamp
* further back in time, so thread ids that we encounter will not change order the
* more we scan.
*
* This is to avoid a sequence scan on the entire posts table...
*/
CREATE OR REPLACE FUNCTION fetch_top_threads(p_start_time TIMESTAMPTZ, p_desired_threads INT)
RETURNS TABLE(bump_time TIMESTAMPTZ, post_count INT, thread_id INT)
LANGUAGE plpgsql
AS $$
DECLARE
limit_size INT := 2000; -- Start with 2000 posts
max_iterations INT := 10; -- Maximum iterations to avoid endless loop in case of errors
result_count INT := 0;
last_min_time TIMESTAMP;
BEGIN
TRUNCATE temp_results; -- clear the table
FOR i IN 1..max_iterations LOOP
-- Fetch posts starting from the previous min time
INSERT INTO temp_results(bump_time, post_count, thread_id, last_fetched_time)
SELECT max(t.creation_time) as bump_time, count(*) as post_count, t.thread_id, min(t.creation_time) as last_fetched_time
FROM (
SELECT p.thread_id, p.creation_time
FROM posts p
WHERE p.creation_time < COALESCE(last_min_time, p_start_time)
ORDER BY p.creation_time DESC
LIMIT limit_size
) as t
GROUP BY t.thread_id;
-- Check if we have enough threads
SELECT COUNT(DISTINCT temp_results.thread_id) INTO result_count FROM temp_results;
IF result_count >= p_desired_threads THEN
EXIT;
END IF;
-- Get the last min time from the temp table for the next iteration
SELECT MIN(last_fetched_time) INTO last_min_time FROM temp_results;
-- Double the limit for the next iteration
limit_size := limit_size * 2;
END LOOP;
-- Return the results
RETURN QUERY SELECT temp_results.bump_time, temp_results.post_count, temp_results.thread_id FROM temp_results ORDER BY temp_results.bump_time DESC;
END;
$$;
(Though the post_count result here isn't really useful or accurate)