Skip to content
This repository has been archived by the owner on Feb 9, 2024. It is now read-only.

Batch inserts/updates #28

Merged
merged 1 commit into from
Jan 15, 2024
Merged

Batch inserts/updates #28

merged 1 commit into from
Jan 15, 2024

Conversation

zas
Copy link
Contributor

@zas zas commented Jan 14, 2024

Changes things quite a bit in order to support chunked inserts/updates:

  • chunk size defaults to 100 (configurable by scan --chunksize option)
  • make use of peewee select() multiple to get all existing recordings at once in the chunk
  • make use of peewee insert_many().on_conflict_replace() to insert or update
  • change the way errors and status are displayed

Quick benchmark on a set of 27280 files (mainly FLAC & MP3s):

With this patch, from scratch, added in 8:03 minutes, without added in 9:59 , time is mainly spent reading audio file metadata.
A second scan (no change), is done in 9s with this patch and in 19s without this patch.

A second benchmark, using a set of 4926 mp3 files:

First scan is done in 12 seconds with this patch, 33 seconds without. Gain with MP3 files is more obvious since reading metadata from those is faster (files are much smaller).
Second scan (no change), 2 seconds with this patch, 4 seconds without.

@zas zas requested a review from mayhem January 14, 2024 16:49
@zas zas force-pushed the batch branch 7 times, most recently from 12f910d to e200bdb Compare January 14, 2024 22:45
@zas zas changed the title [WIP] Batch inserts/updates Batch inserts/updates Jan 14, 2024
@zas zas marked this pull request as ready for review January 14, 2024 22:50
@zas zas marked this pull request as draft January 15, 2024 09:58
Add scan option --chunksize (default to 100)
@zas zas marked this pull request as ready for review January 15, 2024 11:06
@zas
Copy link
Contributor Author

zas commented Jan 15, 2024

@mayhem I think it's finally ready, please test and benchmark before merging

@mayhem
Copy link
Member

mayhem commented Jan 15, 2024

Will do that after the meeting today.

@zas
Copy link
Contributor Author

zas commented Jan 15, 2024

According to my tests, the performance gains is somewhere between 20% (worst case) and 300%. Initial scan (add) speed heavily depends on metadata reading time (which depends on mutagen, so this patch doesn't have much impact on this).

The default chunk size of 100 looks like a good compromise, not much gain increasing it. If set too high it may hit limits from underlying code related to database:

"SQLite users should be aware of some caveats when using bulk inserts. Specifically, your SQLite3 version must be 3.7.11.0 or newer to take advantage of the bulk insert API. Additionally, by default SQLite limits the number of bound variables in a SQL query to 999 for SQLite versions prior to 3.32.0 (2020-05-22) and 32766 for SQLite versions after 3.32.0."

https://docs.peewee-orm.com/en/latest/peewee/querying.html

I guess that's not much an issue because SQLite 3.7.11 was released more than 10 years ago (https://www.sqlite.org/releaselog/3_7_11.html).

It should be noted that those limits may impact the delete command (so we may need to rework it in order to ensure number of bound variables stays under 999 or 32766). But this is unrelated to this patch.

Copy link
Member

@mayhem mayhem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After much testing, this is great!

@mayhem mayhem merged commit 38bafae into metabrainz:main Jan 15, 2024
3 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants