Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: richer tablet/column type info in VStream copy phase output #17143

Open
maxenglander opened this issue Nov 5, 2024 · 6 comments · May be fixed by #17150
Open

Feature Request: richer tablet/column type info in VStream copy phase output #17143

maxenglander opened this issue Nov 5, 2024 · 6 comments · May be fixed by #17150
Assignees
Labels
Component: VReplication Type: Enhancement Logical improvement (somewhere between a bug and feature)

Comments

@maxenglander
Copy link
Collaborator

maxenglander commented Nov 5, 2024

Feature Description

The output of the VStream copy phase does not include certain table/column metadata. For example it does not include column DEFAULT information.

It would be great if there were a way to access that information in the VStream copy phase output, along with any other column/table metadata that could be useful.

From discussions with @mattlord he mentioned that it would make sense to have the VStream emit CREATE TABLE statements as part of the VStream copy phase, similar to how ALTER statements are included in the output of the streaming phase.

Use Case(s)

For example, people using the Debezium MySQL connector expect to see column default information in the connector output. The Vitess connector doesn't include that information, however, which can make it more difficult for people to migrate from MySQL to Vitess.

@maxenglander maxenglander added Needs Triage This issue needs to be correctly labelled and triaged Component: VReplication Component: VTGate Type: Feature Type: Enhancement Logical improvement (somewhere between a bug and feature) and removed Component: VReplication Component: VTGate Type: Feature labels Nov 5, 2024
@mattlord mattlord added Component: VReplication and removed Needs Triage This issue needs to be correctly labelled and triaged labels Nov 5, 2024
@github-project-automation github-project-automation bot moved this to Backlog in VReplication Nov 5, 2024
@mattlord mattlord moved this from Backlog to Prioritized in VReplication Nov 5, 2024
@mattlord
Copy link
Contributor

mattlord commented Nov 5, 2024

@maxenglander do you yet have any clear idea on how this would be consumed and exposed in the connector? I'm wondering if this might be enough:

[type:DDL  statement:"CREATE TABLE `customer` (\n  `customer_id` bigint NOT NULL AUTO_INCREMENT,\n  `email` varbinary(128) DEFAULT NULL,\n  PRIMARY KEY (`customer_id`)\n) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci" type:DDL  statement:"CREATE TABLE `corder` (\n  `order_id` bigint NOT NULL AUTO_INCREMENT,\n  `customer_id` bigint DEFAULT NULL,\n  `sku` varbinary(128) DEFAULT NULL,\n  `price` bigint DEFAULT NULL,\n  PRIMARY KEY (`order_id`)\n) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci" type:DDL  statement:"CREATE TABLE `product` (\n  `sku` varbinary(128) NOT NULL,\n  `description` varbinary(128) DEFAULT NULL,\n  `price` bigint DEFAULT NULL,\n  PRIMARY KEY (`sku`)\n) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci"]

[type:BEGIN  keyspace:"commerce"  shard:"0" type:FIELD  field_event:{table_name:"commerce.customer"  fields:{name:"customer_id"  type:INT64  table:"customer"  org_table:"customer"  database:"vt_commerce"  org_name:"customer_id"  column_length:20  charset:63  flags:49667  column_type:"bigint"}  fields:{name:"email"  type:VARBINARY  table:"customer"  org_table:"customer"  database:"vt_commerce"  org_name:"email"  column_length:128  charset:63  flags:128  column_type:"varbinary(128)"}  keyspace:"commerce"  shard:"0"  enum_set_string_values:true}  keyspace:"commerce"  shard:"0"]
[type:VGTID  vgtid:{shard_gtids:{keyspace:"commerce"  shard:"0"  gtid:"MySQL56/2f92cad2-9b91-11ef-8d3d-334bde7bcc8c:1-45"}}  keyspace:"commerce"  shard:"0"]
[type:ROW  row_event:{table_name:"commerce.customer"  row_changes:{after:{lengths:1  lengths:16  values:"1alice@domain.com"}}  keyspace:"commerce"  shard:"0"}  keyspace:"commerce"  shard:"0" type:ROW  row_event:{table_name:"commerce.customer"  row_changes:{after:{lengths:1  lengths:14  values:"2bob@domain.com"}}  keyspace:"commerce"  shard:"0"}  keyspace:"commerce"  shard:"0" type:ROW  row_event:{table_name:"commerce.customer"  row_changes:{after:{lengths:1  lengths:18  values:"3charlie@domain.com"}}  keyspace:"commerce"  shard:"0"}  keyspace:"commerce"  shard:"0" type:ROW  row_event:{table_name:"commerce.customer"  row_changes:{after:{lengths:1  lengths:14  values:"4dan@domain.com"}}  keyspace:"commerce"  shard:"0"}  keyspace:"commerce"  shard:"0" type:ROW  row_event:{table_name:"commerce.customer"  row_changes:{after:{lengths:1  lengths:14  values:"5eve@domain.com"}}  keyspace:"commerce"  shard:"0"}  keyspace:"commerce"  shard:"0" type:VGTID  vgtid:{shard_gtids:{keyspace:"commerce"  shard:"0"  gtid:"MySQL56/2f92cad2-9b91-11ef-8d3d-334bde7bcc8c:1-45"  table_p_ks:{table_name:"customer"  lastpk:{fields:{name:"customer_id"  type:INT64  charset:63  flags:49667}  rows:{lengths:1  values:"5"}}}}}  keyspace:"commerce"  shard:"0" type:COMMIT  keyspace:"commerce"  shard:"0"]

[type:BEGIN  keyspace:"commerce"  shard:"0" type:VGTID  vgtid:{shard_gtids:{keyspace:"commerce"  shard:"0"  gtid:"MySQL56/2f92cad2-9b91-11ef-8d3d-334bde7bcc8c:1-45"}}  keyspace:"commerce"  shard:"0" type:COMMIT  keyspace:"commerce"  shard:"0"]

[type:COPY_COMPLETED  keyspace:"commerce"  shard:"0" type:COPY_COMPLETED]

That's the initial output when starting a vstream in copy mode for the commerce keyspace in the local examples — using the example vstream client, with streamCustomer := false.

@maxenglander
Copy link
Collaborator Author

@mattlord i wouldn't say i have a very clear idea, but i don't see why this wouldn't work just fine.

@deepthi
Copy link
Member

deepthi commented Nov 5, 2024

Is it up to the client to decide how to consume this event? If the tables were already created, applying the DDL event would cause an error. We are not sending IF NOT EXISTS in the statement.

@maxenglander
Copy link
Collaborator Author

I would agree that it's up to the client to decide how to use this event.

@mattlord mattlord self-assigned this Nov 6, 2024
@twthorn
Copy link
Contributor

twthorn commented Nov 6, 2024

This is related to a PR I am working on for debezium vitess connector that gets this info via issuing describe table queries at startup. If we can send create table statements over grpc by default that would simplify the code a decent bit, so I'm in favor of this. However, Debezium has different notions of snapshots eg sometimes it doesn't want all the data it just wants the schema (ie only the create table statements, followed by real-time data). Can we add a way for clients to initiate a "schema-only" snapshot? This would be a requirement from Debezium side to make this useful (otherwise we tightly couple data & schema info, and would need to fall back to issuing describe table statements directly to decouple them, in which case the code is not simplified).

The way Debezium-mysql-connector works is it issues these describe table statements and then parses the DDL statements to determine the schema of the tables (including default values). The PR I mentioned before would reuse that parser to get the same info. One of the key benefits of using the debezium-vitess-connector over the mysql-connector is that that it doesn't require a binlog parser and vitess parses the binlog before sending the info as grpc event. By sending create table statements we push complexity onto debezium to parse them. An alternative approach is we expand the FIELD event or add another event type that can describe the complete schema of a table (type, nullability, default value) in a structured grpc event format (not just the SQL string). Right now field events have the types, but lack this rich information. What are ya'll's thoughts on something like this? The FIELD event contains almost enough info to construct a complete picture of the schema, we just need this additional information (nullability, default value). This also would keep the API consistent and mean less parsing/complexity for clients (otherwise each may have to maintain its own DDL parser to use the statements).

One other thing: I noticed you all recently started this debezium-connector-planetscale fork. I saw some useful contribution in VitessType class (support BIT type). Do you all plan to contribute back to debezium-connector-vitess? We definitely welcome contributions and would be grateful for them.

@maxenglander
Copy link
Collaborator Author

hey @twthorn i'm the current maintainer of the debezium for planetscale fork. i'll reach out to you through another channel and we can chat about contributions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: VReplication Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
Status: Prioritized
Development

Successfully merging a pull request may close this issue.

4 participants