-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explore a Flink catalog based on Recap #407
Comments
Ok, so I sniffed around the Flink catalog stuff a bit. tl;dr This looks doable, but with some (potentially significant) caveats. A few notes/questions/thoughts:
|
Read-write would definitely be required: the idea is to use Recap as a persistent catalog when adding tables (e.g. representing a CDC source or an Elasticsearch sink connection) in an interactive Flink SQL session, making them available again after starting a new session.
Good question; how do you model this for Postgres today? There, you have a notion of "databases" (of which there could be multiple on one PG host), "schemas" (logical namespaces within a database), and "tables" (actual tables within a schema. How do Recap paths look like for that?
I don't think that's a problem whatsoever. It does expose a remote (REST) API to which Flink could talk, right? |
The registry does not impose any structure. The API just takes a "path" (foo/bar/baz). The path structure is up to the writer. That said, for Recap tooling, the path for PG is always: postgresql://host:port/db/schema/table So, I think the mapping should be: catalog -> "recap", db -> "postgresql://host:port/db", table -> "schema.table" | "schema/table" I'm also unsure why the API has ObjectPath for the write APIs and just string for the read APIs. That seems odd to me. Using ObjectPath for the read APIs would work better for Recap as you could do ObjectPath("schema", "table").
Correct. Lastly, for (4), I want to make sure: This is still useful even if Recap doesn't store function/partition/stats right? |
In Apache Flink, catalogs "provide metadata, such as databases, tables, partitions, views, and functions and information needed to access data stored in a database or other external systems".
There is an in-memory implementation which keeps any information only in the context of specific sessions. A persistent implementation is provided in form of the
HiveCatalog
, using Apache Hive (more precisely, the Hive Metastore) as the underlying storage layer.The purpose of this issue is to explore the feasibility of creating a persistent Flink catalog on top of Recap (or, specifically, its schema registry). This could be interesting to Flink users looking for an alternative to the Hive-based catalog implementation. Note the
Catalog
contract has some facets which probably are not supported in Recap, e.g. the ability to store metadata about functions and statistics. I'm not sure whether Recap's core design is extensible so that it would allow for storing this kind of additional information?The text was updated successfully, but these errors were encountered: