1887 address table migration #1888

Cmdv · 2024-10-31T13:22:25Z

Description

this fixes #1887

Just need to tests if this actually works and it doesn't deal with populating the cache as that will populate as the syncing continues.

I've also not done anything page related but can see if that makes a difference.

Checklist

Commit sequence broadly makes sense
Commits have useful messages
New tests are added if needed and existing tests are updated
Any changes are noted in the changelog
Code is formatted with fourmolu on version 0.10.1.0 (which can be run with scripts/fourmolize.sh)
Self-reviewed the diff

Migrations

The pr causes a breaking change of type a,b or c
If there is a breaking change, the pr includes a database migration and/or a fix process for old values, so that upgrade is possible
Resyncing and running the migrations provided will result in the same database semantically

If there is a breaking change, especially a big one, please add a justification here. Please elaborate
more what the migration achieves, what it cannot achieve or why a migration is not possible.

infnada · 2024-11-16T22:35:17Z

This is quite slow in my case:

    updateTxOutAddressIdQuery =
      Text.unlines
        [ "UPDATE tx_out"
        , "SET address_id = a.id"
        , "FROM address a"
        , "WHERE tx_out.address = a.address"
        , "  AND tx_out.address_has_script = a.has_script"
        , "  AND COALESCE(tx_out.payment_cred, '') = COALESCE(a.payment_cred, '')"
        , "  AND COALESCE(tx_out.stake_address_id, -1) = COALESCE(a.stake_address_id, -1)"
        ]

For me is a lot faster something like the following as it only performs an initial JOIN by address HASH index (make sure its already created) and then filters it further:

WITH initial_match AS (
    -- Step 1: Perform a simple join on `address` column only
    SELECT tx_out.id AS tx_out_id, a.id AS address_id
    FROM tx_out
    JOIN address a ON tx_out.address = a.address
    WHERE tx_out.address_id IS NULL
),
filtered_match AS (
    -- Step 2: Apply additional filters
    SELECT im.tx_out_id, im.address_id
    FROM initial_match im
    JOIN tx_out t ON im.tx_out_id = t.id
    JOIN address a ON im.address_id = a.id
    WHERE t.address_has_script = a.has_script
      AND COALESCE(t.payment_cred, '') = COALESCE(a.payment_cred, '')
      AND COALESCE(t.stake_address_id, -1) = COALESCE(a.stake_address_id, -1)
)
-- Step 3: Perform the update
UPDATE tx_out
SET address_id = filtered_match.address_id
FROM filtered_match
WHERE tx_out.id = filtered_match.tx_out_id;

I'm testing it with a batching function to be able to see the progress and perform updates in little batches at a time:

CREATE OR REPLACE FUNCTION update_tx_out_address_id_in_batches(batch_size INTEGER DEFAULT 1000)
RETURNS VOID AS $$
DECLARE
    _row_count INTEGER := 0;
    _batch_number INTEGER := 0;
    _max_batches INTEGER := 10;  -- Maximum number of batches to process
BEGIN
    -- Loop to update in batches
    LOOP
        -- Increment the batch counter
        _batch_number := _batch_number + 1;

        -- Perform the update in the current batch
        WITH batch AS (
            -- Step 1: Perform a simple join using the `address` column (leveraging the HASH index)
            SELECT tx_out.id AS tx_out_id, a.id AS address_id
            FROM tx_out
            JOIN address a ON tx_out.address = a.address
            WHERE tx_out.address_id IS NULL
            LIMIT batch_size
        ),
        filtered_batch AS (
            -- Step 2: Further filter the results with additional conditions
            SELECT b.tx_out_id, b.address_id
            FROM batch b
            JOIN tx_out t ON b.tx_out_id = t.id
            JOIN address a ON b.address_id = a.id
            WHERE t.address_has_script = a.has_script
              AND COALESCE(t.payment_cred, '') = COALESCE(a.payment_cred, '')
              AND COALESCE(t.stake_address_id, -1) = COALESCE(a.stake_address_id, -1)
        )
        -- Step 3: Perform the update
        UPDATE tx_out
        SET address_id = filtered_batch.address_id
        FROM filtered_batch
        WHERE tx_out.id = filtered_batch.tx_out_id;

        -- Get the number of rows updated in this batch
        GET DIAGNOSTICS _row_count = ROW_COUNT;

        -- Send a message after each successful batch
        RAISE NOTICE '% - Batch % completed, % rows updated.', clock_timestamp(), _batch_number, _row_count;

        -- Exit the loop if fewer than batch_size rows were processed
        IF _row_count < batch_size THEN
            EXIT;
        END IF;

        -- Exit the loop if the maximum number of batches has been processed
        IF _batch_number >= _max_batches THEN
            RAISE NOTICE 'Maximum number of batches (% batches) processed. Exiting...', _max_batches;
            EXIT;
        END IF;

    END LOOP;

    -- Final message after all batches are processed
    RAISE NOTICE 'Update process completed. Total batches processed: %', _batch_number;
END;
$$ LANGUAGE plpgsql;



SELECT update_tx_out_address_id_in_batches();  -- Default batch size of 1000

This is an example with the full join query like in this MR (change it in the above function to test it), its quite slow.

WITH batch AS (
    SELECT tx_out.id AS tx_out_id, a.id AS address_id
    FROM tx_out
    JOIN address a ON tx_out.address = a.address
                   AND tx_out.address_has_script = a.has_script
                   AND COALESCE(tx_out.payment_cred, '') = COALESCE(a.payment_cred, '')
                   AND COALESCE(tx_out.stake_address_id, -1) = COALESCE(a.stake_address_id, -1)
    WHERE tx_out.address_id IS NULL
    LIMIT batch_size
)
UPDATE tx_out
SET address_id = batch.address_id
FROM batch
WHERE tx_out.id = batch.tx_out_id;

I did not compare the full tx_out table update with one or another approach as it take +1day in my case (and had to cancel it >.<)

Hope it helps.

Cmdv · 2024-11-18T11:45:38Z

@infnada that's great, that was the next step to see if doing things in batches would improve speeds. But I'm fairly skeptical, I have a feeling that a new sync might be just as quick. When I next look at this PR I'll see if what you tried is better than my scrappy attempt 😅

Cmdv added 2 commits October 31, 2024 13:20

1887 address table migration

1fdc2b6

change the address migration

3b723e2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1887 address table migration #1888

1887 address table migration #1888

Cmdv commented Oct 31, 2024 •

edited

Loading

infnada commented Nov 16, 2024

Cmdv commented Nov 18, 2024

1887 address table migration #1888

Are you sure you want to change the base?

1887 address table migration #1888

Conversation

Cmdv commented Oct 31, 2024 • edited Loading

Description

Checklist

Migrations

infnada commented Nov 16, 2024

Cmdv commented Nov 18, 2024

Cmdv commented Oct 31, 2024 •

edited

Loading