Make massive DB migrations faster with lots of temp tables \ stacker news ~devs

I'm currently working on migrating millions of rows from 10 or so implicitly related tables to millions of rows in 10 so explicitly related tables. It's really freaking slow.

My naive way of doing this (which is also the correct way when fewer rows are involved) was to loop over the old tables in plpgsql inserting them into the new tables. Something like this (shield your eyes nosql people):

CREATE OR REPLACE FUNCTION migrate_thing()
RETURNS VOID AS $$
DECLARE
    old_thing "OldThing";
    new_thing_id INTEGER;
BEGIN
    FOR old_thing IN
        SELECT "Thing".*
        FROM "Thing"
    LOOP
        INSERT INTO "NewThing" (a_column) values (old_thing.a_column) RETURNING INTO new_thing_id;

        INSERT INTO "NewThingSomething" (thing_id, b_column) values (new_thing_id, old_thing.b_column);
    END LOOP;
END;
$$ LANGUAGE plpgsql;

It's not obvious from this toy example why this is required, but it's because when I'm trying to map a particular old thing to many new things, I need to know which old thing and new thing are related as I create more new, related things. While this is required for what I'm doing, this is about the slowest way to do it.

A faster way to do it is to create temporary tables with mappings from new to old as you go. I was waiting for a perilous run of my slow migration when I decided to complain to ChatGPT about it and ChatGPT knew better. It suggested something like this:

CREATE UNLOGGED TEMP TABLE _old_thing AS
SELECT id, a_column, b_column
FROM "OldThing";

CREATE INDEX ON _old_thing (id);

CREATE UNLOGGED TEMP TABLE _map_old_new_thing (old_thing_id bigint PRIMARY KEY, new_thing_id bigint NOT NULL);

WITH rows AS (
    SELECT _old_thing.id AS old_thing_id,_old_thing.a_column AS old_thing_a_column, nextval(pg_get_serial_sequence('"NewThing"', 'id'))::bigint AS new_thing_id
    FROM _old_thing
) ins AS (
INSERT INTO "NewThing" (id, a_column) values (new_thing_id, old_thing_a_column) 
FROM rows
RETURNING INTO new_thing_id
)
INSERT INTO _map_old_new_thing(old_thing_id, new_thing_id) 
SELECT old_thing_id, new_thing_id
FROM rows;

WITH rows AS (
    SELECT _old_thing.id AS old_thing_id, _old_thing.b_column AS old_thing_b_column, 
_map_old_new_thing.new_thing_id AS new_thing_id,
nextval(pg_get_serial_sequence('"NewThingSomthing"', 'id'))::bigint AS new_something_id
    FROM _map_old_new_thing
    JOIN _old_thing ON _old_thing.id = _map_old_new_thing.old_thing_id
)
INSERT INTO "NewThingSomething" (id, thing_id, b_column) values (new_something_id, new_thing_id, old_thing_b_column) 
FROM rows;

It's maybe not super obvious why this is faster, but the main difference is that we're doing operations in bulk. We do all the Step A's at once, then the Step B's, then the Step C's rather than ABC, then ABC, then ABC and so on. Computers and programmers tend to optimize these kinds of bulk, similar things better than lots of one-off, although repetitive, dissimilar things.

Before it was kind of like I were running a hamburger stand and when someone ordered 100 hamburgers, I made them one by one. I made one onion slice, then one tomato slice, cooked one patty, toasted one bun, wrapped it, then made one onion slice, one tomato slice, and so on. The temp table solution is like making 100 onion slices, 100 tomato slices, ... you get the idea.

TIL.

139 sats \ 0 replies \ @optimism 26 Sep

The second golden rule of batch migrations is: archive all the intermediate data (tables) somewhere. This helps you troubleshoot in the future, in case something went wrong.

60 sats \ 0 replies \ @DEADBEEF 26 Sep

We’re doing this exact thing at work. We have to take everything from system A and map it to system B. We are writing stored procs to select everything from a table in the old system and insert the data into a temp table with the same schema as the new system. Once the data checks out we migrate it into our live database tables. Can confirm that doing this as bulk operations is much faster than looping through the rows.

50 sats \ 0 replies \ @WeAreAllSatoshi 26 Sep

I need to read this on my laptop, not my phone. But I’m very intrigued by it

66 sats \ 1 reply \ @k00b OP 26 Sep

Also what's massive to me is probably really small to others. Even so, while on an absolute scale the row count is small, a transformation like this is massive relative to the most migrations people do (add/remove columns etc).

50 sats \ 0 replies \ @kepford 26 Sep

I am very thankful to have DBAs that do this stuff. It's super interesting though