Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1887 address table migration #1888

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from
Draft

1887 address table migration #1888

wants to merge 2 commits into from

Conversation

Cmdv
Copy link
Contributor

@Cmdv Cmdv commented Oct 31, 2024

Description

this fixes #1887

Just need to tests if this actually works and it doesn't deal with populating the cache as that will populate as the syncing continues.

I've also not done anything page related but can see if that makes a difference.

Checklist

  • Commit sequence broadly makes sense
  • Commits have useful messages
  • New tests are added if needed and existing tests are updated
  • Any changes are noted in the changelog
  • Code is formatted with fourmolu on version 0.10.1.0 (which can be run with scripts/fourmolize.sh)
  • Self-reviewed the diff

Migrations

  • The pr causes a breaking change of type a,b or c
  • If there is a breaking change, the pr includes a database migration and/or a fix process for old values, so that upgrade is possible
  • Resyncing and running the migrations provided will result in the same database semantically

If there is a breaking change, especially a big one, please add a justification here. Please elaborate
more what the migration achieves, what it cannot achieve or why a migration is not possible.

@infnada
Copy link

infnada commented Nov 16, 2024

This is quite slow in my case:

    updateTxOutAddressIdQuery =
      Text.unlines
        [ "UPDATE tx_out"
        , "SET address_id = a.id"
        , "FROM address a"
        , "WHERE tx_out.address = a.address"
        , "  AND tx_out.address_has_script = a.has_script"
        , "  AND COALESCE(tx_out.payment_cred, '') = COALESCE(a.payment_cred, '')"
        , "  AND COALESCE(tx_out.stake_address_id, -1) = COALESCE(a.stake_address_id, -1)"
        ]

For me is a lot faster something like the following as it only performs an initial JOIN by address HASH index (make sure its already created) and then filters it further:

WITH initial_match AS (
    -- Step 1: Perform a simple join on `address` column only
    SELECT tx_out.id AS tx_out_id, a.id AS address_id
    FROM tx_out
    JOIN address a ON tx_out.address = a.address
    WHERE tx_out.address_id IS NULL
),
filtered_match AS (
    -- Step 2: Apply additional filters
    SELECT im.tx_out_id, im.address_id
    FROM initial_match im
    JOIN tx_out t ON im.tx_out_id = t.id
    JOIN address a ON im.address_id = a.id
    WHERE t.address_has_script = a.has_script
      AND COALESCE(t.payment_cred, '') = COALESCE(a.payment_cred, '')
      AND COALESCE(t.stake_address_id, -1) = COALESCE(a.stake_address_id, -1)
)
-- Step 3: Perform the update
UPDATE tx_out
SET address_id = filtered_match.address_id
FROM filtered_match
WHERE tx_out.id = filtered_match.tx_out_id;

I'm testing it with a batching function to be able to see the progress and perform updates in little batches at a time:

CREATE OR REPLACE FUNCTION update_tx_out_address_id_in_batches(batch_size INTEGER DEFAULT 1000)
RETURNS VOID AS $$
DECLARE
    _row_count INTEGER := 0;
    _batch_number INTEGER := 0;
    _max_batches INTEGER := 10;  -- Maximum number of batches to process
BEGIN
    -- Loop to update in batches
    LOOP
        -- Increment the batch counter
        _batch_number := _batch_number + 1;

        -- Perform the update in the current batch
        WITH batch AS (
            -- Step 1: Perform a simple join using the `address` column (leveraging the HASH index)
            SELECT tx_out.id AS tx_out_id, a.id AS address_id
            FROM tx_out
            JOIN address a ON tx_out.address = a.address
            WHERE tx_out.address_id IS NULL
            LIMIT batch_size
        ),
        filtered_batch AS (
            -- Step 2: Further filter the results with additional conditions
            SELECT b.tx_out_id, b.address_id
            FROM batch b
            JOIN tx_out t ON b.tx_out_id = t.id
            JOIN address a ON b.address_id = a.id
            WHERE t.address_has_script = a.has_script
              AND COALESCE(t.payment_cred, '') = COALESCE(a.payment_cred, '')
              AND COALESCE(t.stake_address_id, -1) = COALESCE(a.stake_address_id, -1)
        )
        -- Step 3: Perform the update
        UPDATE tx_out
        SET address_id = filtered_batch.address_id
        FROM filtered_batch
        WHERE tx_out.id = filtered_batch.tx_out_id;

        -- Get the number of rows updated in this batch
        GET DIAGNOSTICS _row_count = ROW_COUNT;

        -- Send a message after each successful batch
        RAISE NOTICE '% - Batch % completed, % rows updated.', clock_timestamp(), _batch_number, _row_count;

        -- Exit the loop if fewer than batch_size rows were processed
        IF _row_count < batch_size THEN
            EXIT;
        END IF;

        -- Exit the loop if the maximum number of batches has been processed
        IF _batch_number >= _max_batches THEN
            RAISE NOTICE 'Maximum number of batches (% batches) processed. Exiting...', _max_batches;
            EXIT;
        END IF;

    END LOOP;

    -- Final message after all batches are processed
    RAISE NOTICE 'Update process completed. Total batches processed: %', _batch_number;
END;
$$ LANGUAGE plpgsql;



SELECT update_tx_out_address_id_in_batches();  -- Default batch size of 1000

This is an example with the full join query like in this MR (change it in the above function to test it), its quite slow.

WITH batch AS (
    SELECT tx_out.id AS tx_out_id, a.id AS address_id
    FROM tx_out
    JOIN address a ON tx_out.address = a.address
                   AND tx_out.address_has_script = a.has_script
                   AND COALESCE(tx_out.payment_cred, '') = COALESCE(a.payment_cred, '')
                   AND COALESCE(tx_out.stake_address_id, -1) = COALESCE(a.stake_address_id, -1)
    WHERE tx_out.address_id IS NULL
    LIMIT batch_size
)
UPDATE tx_out
SET address_id = batch.address_id
FROM batch
WHERE tx_out.id = batch.tx_out_id;

I did not compare the full tx_out table update with one or another approach as it take +1day in my case (and had to cancel it >.<)

Hope it helps.

@Cmdv
Copy link
Contributor Author

Cmdv commented Nov 18, 2024

@infnada that's great, that was the next step to see if doing things in batches would improve speeds. But I'm fairly skeptical, I have a feeling that a new sync might be just as quick. When I next look at this PR I'll see if what you tried is better than my scrappy attempt 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Migration for address table of existing data
2 participants