[ overboard / sfw / alt / cytube] [ leftypol / b / WRK / hobby / tech / edu / ga / ent / music / 777 / posad / i / a / R9K / dead ] [ meta ]

/tech/ - Technology

"Technology reveals the active relation of man to nature"
Name
Email
Subject
Comment
Captcha
Tor Only

Flag
File
Embed
Password (For file deletion.)

Matrix   IRC Chat   Mumble   Discord


File: 1696368309285.jpg ( 1.56 MB , 3089x2234 , Constructivist.jpg )

 No.12527

I'm going to work on building an archive of all your shitposts.
>>

 No.12528

File: 1696369818384.webm ( 64.62 KB , 480x360 , cleopatra2525-thumbs-up.webm )

>>

 No.12529

File: 1696369879914.jpg ( 630.79 KB , 997x673 , African_Bush_Elephant_crop.jpg )

First let's start with some some table definitions, let's use PostgreSQL:

CREATE TABLE sites
    ( site_id serial primary key
    , name text NOT NULL
    , url text NOT NULL
    );

CREATE TABLE boards
    ( board_id serial primary key
    , name text NOT NULL
    , pathpart text NOT NULL -- if it's /a/ then the pathpart is a
    , site_id int NOT NULL
    , CONSTRAINT site_fk FOREIGN KEY (site_id) REFERENCES sites (site_id) ON DELETE CASCADE
    );

CREATE TABLE threads
    ( thread_id bigserial primary key
    , board_thread_id bigint NOT NULL -- this is the id of the thread in lainchan, mysql
    , creation_time timestamp with time zone NOT NULL
    , board_id int NOT NULL
    , CONSTRAINT board_fk FOREIGN KEY (board_id) REFERENCES boards (board_id) ON DELETE CASCADE
    );

CREATE TABLE posts
    ( post_id bigserial primary key
    , board_post_id bigint NOT NULL
    , creation_time timestamp with time zone NOT NULL
    , body text
    , thread_id bigint NOT NULL
    , CONSTRAINT thread_fk FOREIGN KEY (thread_id) REFERENCES threads (thread_id) ON DELETE CASCADE
    );

CREATE TABLE attachments
    ( attachment_id bigserial primary key
    , mimetype text NOT NULL
    , creation_time timestamp with time zone NOT NULL
    , md5_hash text NOT NULL
    , post_id bigint NOT NULL
    , CONSTRAINT post_fk FOREIGN KEY (post_id) REFERENCES posts (post_id) ON DELETE CASCADE
    );


So a web site has boards, which have threads, which have posts, which have attachments
>>

 No.12530

File: 1696371040119.png ( 23.78 KB , 791x197 , 1421896390159.png )

The thread table should have a constraint to disallow posting the same thread:

CONSTRAINT unique_board_board_thread_id_constraint UNIQUE (board_id, board_thread_id)


so the pair of values (board_id, board_thread_id) will never be the same for two rows in this table.

We can index some of the columns in the threads table:

CREATE TABLE IF NOT EXISTS threads
    ( thread_id bigserial primary key
    , board_thread_id bigint NOT NULL -- this is the id of the thread in lainchan, mysql
    , creation_time timestamp with time zone NOT NULL
    , board_id int NOT NULL
    , CONSTRAINT board_fk FOREIGN KEY (board_id) REFERENCES boards (board_id) ON DELETE CASCADE
    , CONSTRAINT unique_board_board_thread_id_constraint UNIQUE (board_id, board_thread_id)
    );
CREATE INDEX threads_creation_time_idx   ON threads (creation_time);
CREATE INDEX threads_board_id_idx        ON threads (board_id);
CREATE INDEX threads_board_thread_id_idx ON threads (board_thread_id);


A similar thing needs to be done with posts, if we scrape the website we shouldn't
accidentally archive the same post that has the same id in lainchan:

CREATE TABLE IF NOT EXISTS posts
    ( post_id bigserial primary key
    , board_post_id bigint NOT NULL
    , creation_time timestamp with time zone NOT NULL
    , body text
    , body_search_index tsvector
    , thread_id bigint NOT NULL
    , CONSTRAINT unique_thread_board_id_constraint UNIQUE (thread_id, board_post_id)
    , CONSTRAINT thread_fk FOREIGN KEY (thread_id) REFERENCES threads (thread_id) ON DELETE CASCADE
    );
CREATE INDEX posts_creation_time_idx ON posts (creation_time);
CREATE INDEX posts_body_search_idx   ON posts USING GIN (body_search_index);
CREATE INDEX posts_thread_id_idx     ON posts (thread_id);
CREATE INDEX posts_board_post_id_idx ON posts (board_post_id);


Here we can add a column to index the text to have full text search on the posts,
that will be useful later. To populate the `body_search_index` column we can use
a trigger that will update it when a row is changed:


CREATE OR REPLACE FUNCTION update_post_body_search_index() RETURNS trigger AS $$
BEGIN
    NEW.body_search_index := to_tsvector('english', NEW.body);
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trigger_update_post_body_search_index
BEFORE INSERT OR UPDATE
ON posts
FOR EACH ROW
EXECUTE FUNCTION update_post_body_search_index();
>>

 No.12531

Attachments can have another special column: a bigint representing a 64-bit unsigned integer, that we compute using a perceptual image hash algorithm (there's libraries to do this). We can add a special index on this column, using the bktree postgresql extension, that will let us compare the distance between these numbers efficiently, bit by bit (called the hamming distance).

CREATE INDEX attachments_phash_bktree_index ON attachments USING spgist (phash bktree_ops);


This column will let us do reverse-image search on similar images.


Attached is the full sql file, where I have added some users and granted them permissions
>>

 No.12532

File: 1696449844187-0.png ( 2.33 KB , 383x383 , Git-Icon-1788C.png )

Next I'm going to get postgREST serving locally on port 3000. Postgrest will automatically create a REST API for our database. (https://github.com/PostgREST/postgrest)

Then I can build a script that will read in the board's json api files and populate the database. I prefer to just hit this REST api than write database queries out as strings in some way, I've tried different ways of doing it, using and ORM or a domain specific language and using postgrest was the most pain-free.

I will also throw this repo up online here: http://git.leftychan.net/0/chan-delorean
(or here http://git.wz6bnwwtwckltvkvji6vvgmjrfspr3lstz66rusvtczhsgvwdcixgbyd.onion/0/chan-delorean for onionfans)

I've also written a postgrest service for nixos if you want it: https://github.com/towards-a-new-leftypol/devops/blob/spamnoticer_deploy/nix-support-production/postgrest.nix
>>

 No.12585

File: 1697644302692.jpg ( 43.48 KB , 600x347 , fence_posts.jpg )

I wrote some more glue code to shove all of the posts from the board into the database, by simply reading the json api files, transforming the data structure slightly and passing json to PostgREST.

To not take too long I shove all of the posts on one board into the database at once: the sql code takes an array of all the posts, does a SELECT to get the ones that exist already, inserts the ones that don't yet exist, and returns all of the internal post ids. This is quite boring, you can go read the source if you care, but it takes about 2 minutes to save leftychan.

I have monthly snapshots dating back to around the bunkerchan split, so I ran those through the script as well. Perhaps not every post ever posted is there, but I have around half a million posts. Now we can get to the fun part of attempting to query the database for what the board looked like at an arbitrary point in time.
>>

 No.12586

File: 1697645062824.jpg ( 44.36 KB , 736x736 , thumbsupcat.jpg )

>>12585
that is impressive work
>>

 No.12587

File: 1697650877009.jpeg ( 59.6 KB , 640x480 , 640px-Diamond_Core.jpeg )

>>12586
👍 Thanks anon! There's more:

I want to see the board as it was a year ago. I need to get all of the threads that were on the board, at t = (now - 1 year).

Threads are ordered by bump time. We need to get the first, say, 400 threads if you order all the threads by bump time.

In lainchan each thread has a bump column (threads and posts share the same table called posts_a or whatever), which makes it simple to get the current top threads, but it doesn't help us calculate what the bump times were one year ago.

So first I recognize that a thread's bump time is just the time the most recent post was posted to that thread.

Maybe I can use SQL's aggregate function to get the MAX post creation time, if we group the posts by their thread_id. Then, since I have indices on the creation_time and thread_id columns, (and another index on both), if I limit the number of threads that I want then hopefully this won't be slow:

SELECT
threads.thread_id,
MAX(posts.creation_time) AS bump_time,
COUNT(*) AS post_count
FROM
threads
JOIN
posts ON threads.thread_id = posts.thread_id
WHERE
posts.creation_time < NOW() – Specify your desired time here
GROUP BY
threads.thread_id
ORDER BY
bump_time DESC
LIMIT
400;


And that took 350ms. Which is a little disappointing - it tells me that the database is not taking advantage of our indices. And if I use EXPLAIN ANALYZE then I see that it is in fact doing a full sequence scan. Actually scanning half a million rows in 350ms isn't terrible. In fact I can live with that but it will grow slower and slower with more records/time.

I was actually stuck here for a little bit. But I thought about it and noticed that if you group a smaller set of posts by thread_id, and a larger set of posts or even all of the posts you have by thread_id, then the results are the same as long as you have the posts ordered by time (descending). The set using more posts will just have more results, but we only need the first few hundred, not every thread ever.

Here's why it works:

You have all the posts created before a certain date ordered by time. So The first one is going to be roughly from 1 year ago, and let's say this p1 is from thread with id 5, or tid=5. The time p1 was created is going to be thread 5's bump time. Remember the posts are ordered by creation time, so the next post, p2, is necessarily going to be holder than p1. And it will have:

Either tid = 5, meaning p2 is another post in the same thread, then that row will get grouped with the other one when we do GROUP BY, and it's creation_time cannot be greater than p1's because we are ordering the posts by creation time.

Or p1 can have another thread id we haven't seen before, which is would be the next thread in the overboard catalog, because again p2 was created before p1, so if p2 has tid ≠ 5, say p2 has tid 6, then the catalog is {tid 5, tid 6}.

The same logic applies to the next post, p3, which has a tid either in the set {5, 6} and gets grouped with one of those threads, or it's a thread we haven't seen before, and that's the next thread in the catalog.

The point is subsequent posts will never change our existing results. So looking through more posts is only valuable until we build up the ordered list of threads that we want. Looking further is wasteful.

Let's try limiting the number of posts in a separate subquery in SQL:

SELECT max(creation_time) as bump_time, count(*), thread_id
FROM
(
SELECT thread_id, creation_time, body FROM posts
WHERE creation_time < NOW() - interval '365 day'
ORDER BY creation_time DESC LIMIT 10000
) as t
GROUP BY thread_id
ORDER BY bump_time DESC;


That gives us 591 results in 18ms. So that's what like a 20x improvement? And that shouldn't get significantly slower the more posts we add to the archive.

There's one more thing we can do to this query: to make it return a variable number of results, we can iterate over some number of posts at a time. That means in a really worst case, we will never have fewer than desired results:


CREATE TEMP TABLE IF NOT EXISTS temp_results (bump_time TIMESTAMPTZ, post_count INT, thread_id INT, last_fetched_time TIMESTAMP);

/*
 * This function scans backwards from p_start_time until we get the desired number of threads.
 * Since the posts are ordered by creation time, the first post encountered for a particular
 * thread will have that thread's bump time. Any posts after that will have a timestamp
 * further back in time, so thread ids that we encounter will not change order the
 * more we scan.
 *
 * This is to avoid a sequence scan on the entire posts table...
 */
CREATE OR REPLACE FUNCTION fetch_top_threads(p_start_time TIMESTAMPTZ, p_desired_threads INT)
RETURNS TABLE(bump_time TIMESTAMPTZ, post_count INT, thread_id INT)
LANGUAGE plpgsql
AS $$
DECLARE 
    limit_size INT := 2000; -- Start with 2000 posts
    max_iterations INT := 10; -- Maximum iterations to avoid endless loop in case of errors
    result_count INT := 0;
    last_min_time TIMESTAMP;
BEGIN 
    TRUNCATE temp_results; -- clear the table

    FOR i IN 1..max_iterations LOOP 
        -- Fetch posts starting from the previous min time
        INSERT INTO temp_results(bump_time, post_count, thread_id, last_fetched_time)
        SELECT max(t.creation_time) as bump_time, count(*) as post_count, t.thread_id, min(t.creation_time) as last_fetched_time
        FROM (
            SELECT p.thread_id, p.creation_time 
            FROM posts p
            WHERE p.creation_time < COALESCE(last_min_time, p_start_time)
            ORDER BY p.creation_time DESC
            LIMIT limit_size
        ) as t
        GROUP BY t.thread_id;

        -- Check if we have enough threads
       	SELECT COUNT(DISTINCT temp_results.thread_id) INTO result_count FROM temp_results;

        IF result_count >= p_desired_threads THEN
            EXIT;
        END IF;

        -- Get the last min time from the temp table for the next iteration
        SELECT MIN(last_fetched_time) INTO last_min_time FROM temp_results;

        -- Double the limit for the next iteration
        limit_size := limit_size * 2;

    END LOOP;

    -- Return the results
    RETURN QUERY SELECT temp_results.bump_time, temp_results.post_count, temp_results.thread_id FROM temp_results ORDER BY temp_results.bump_time DESC;

END;
$$;


(Though the post_count result here isn't really useful or accurate)

Unique IPs: 3

[Return][Catalog][Top][Home][Post a Reply]
Delete Post [ ]
[ overboard / sfw / alt / cytube] [ leftypol / b / WRK / hobby / tech / edu / ga / ent / music / 777 / posad / i / a / R9K / dead ] [ meta ]
ReturnCatalogTopBottomHome