-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ARRAY type to SqlAlchemyReader #257
Comments
SQLAlchemy has a weird behavior when it comes to ARRAY with more than one Dimension. Technically SQLAlchemy supports multidimensional arrays url. The way to do it is by passing dimensions=N in the constructor of ARRAY. Trying to add support for Arrays in Recap, I model them as a Recap List and multidimensional arrays could be defined as lists of lists of lists ad infinitum (or almost anyway). To do that, we need access to the dimensions attribute of the ARRAY object within the SqlAlchemyReader, which should be straight forward but here's where things get weird with SQLA. It seems that when the metadata of the table are generated from the constructed engine, the dimensions attribute is gone. See the code below for a demonstration of this behavior.
And a small output sample:
As we can see in the last two parts of the output, using inspector or even the engine directly, results to the loss of the dimension information. Not sure how to address this though. One option is to only support 1D arrays which can work but it feels lazy to me. There isn't any straightforward way to keep the original metadata object in our case, that could have helped as the information is preserved there but we can't expect Recap to be operating in environments that have access to that. Maybe reach out to SQLAlchemy on this? It doesn't look like a desired behavior to me, what's the point of having less information after the engine has been created? @criccomini WDYT? |
Haven't had a chance to read full comment, but I notice you're linking to SA 2.0. Recap currently uses 1.4. I suspect there's a difference in array handling between the two. I opted for 1.4 because I assumed many SA drivers are still on 1.4 not 2.0. If snowflake, hive meta store, and BQ SA drivers all work with 2.0, then we should probably just upgrade. Not sure if they're compatible, though. |
Hmm, so the docs look the same for 1.4 and 2.0. I poked around https://github.com/sqlalchemy/sqlalchemy/issues and StackOverflow, but didn't find much. I wonder if it's trying with SA 2.0 to see if this issue is fixed? If not, I agree with you, it seems like a bug (unless I'm misunderstanding something about the semantics of Anyway, 2.0 is worth a shot, I suppose. |
Another thought: perhaps it's the driver? Wonder if psycopg3 (with SA 2.0) or pg8000 yields similar behavior? |
yeah I linked to the wrong documentation version but I'm also developing locally using 1.4.15 of SA (the one that Recap comes with). I'll try with SA 2.0 to see if anything changes. The problem with SA is that the bug might be in many different places, it might be an Engine issue for example, which then has to do with the implementation for Postgres specifically etc. etc. |
I wonder if it's because of this: It's from 2012, but the fact that PG doesn't enforce dimensions might mean the driver isn't bothering.
PG 14 docs say the same. |
I tried with SA 2.0 and it still behaves in the same way, the dimensions metadata are lost after the creation of the engine. But even if it worked, both snowflake and BQ dialects require SA <2.0 :( My limited experience so far with SA gives me mixed feelings about using it as an abstraction layer over the databases. Its type system is good but the implementation is too tightly intertwined with the different dialects and engines, which is ok when you care about the ORM part of it as you will be most probably working with a specific database anyway in your project. But our scope feels different to me. For example, you can't really work with the ARRAY type without having a postgresql around and although they have a mock_engine, it's only good for generating SQL, inspection does not work there so it's not that useful at the end. At the end, Recap cares only for the metadata right? If this is the case, does it make sense to inherit the complexity of SA tooling and functionality outside the type system? |
Yea, I went back and forth with that. I think the alternative would be implementing individual converters for every single DB we care about using their dbapi driver directly. It's definitely doable, but a bit annoying. WDYT?x |
K, I'm going to try a non-SQL Alchemy impl that goes straight from dbapi to Recap schema. Hoping ChatGPT can make the code less tedious to write. |
@cpard for Hive Metastore, how want me to implement that? I can use PyHive, but that connects to Hive/Presto directly. It looks to me like Hive Metastore has a Thrift API. There's a https://github.com/quintoandar/hive-metastore-client client, but it doesn't seem to expose metadata info, just partitions. This is kinda messy. |
If we are going to constrain our interactions to the information_schema alone and this is going to be primarily read operations (selects), having such an expressive middleware like SA is probably an overkill and adds too much dependency. How do you feel about using https://github.com/tobymao/sqlglot#sql-execution that has no dependencies and supports ~20 dialects (dbs). From there, we just restrict the interactions to the information_schema only as a start. I'll do some digging to see what extensions on top of information_schema people have made to the popular dbs, data warehouses and we'll see if it's worth adding a lot more on top. WDYT? |
I think the best way to do it is by using the Thrift API and generate the client from there. I went through the Thrift file and there are metadata info in there, from types to column statistics. I checked here https://github.com/trinodb/hive-thrift/blob/master/src/main/thrift/hive_metastore.thrift It's the version of the thrift spec Trino is currently using and I think it's the latest also. The thrift interface will also give us out of the box access to AWS Glue, Databricks Unity (they just announced an HMS API for it), Google Dataproc. Apparently, Iceberg - Tabular has a REST API for their catalog and I need to check Hudi - Onehouse but I'm pretty sure they support HMS / Glue / etc and don't have their own implementation of a catalog. I'll give it a try today to generate a client using the thrift definition that Trino is also currently using. |
@criccomini check this #260 |
Yea, I got this one in #258
Yea, I've played with it a bit. Recap was using it at one point. I don't think we need it, though. The columns that we need in information_schema are standard (I checked this already), so a simple dbapi query should work for a bunch of DBs. The type translation that sqlglot provides might be useful, but then I feel like we're just swapping SQLAlchemy for sqlglot, and the same set of problems. Notably, Oracle, Teradata, Vertica, and SQLite do not have information_schema. But again, I think that's fine. Since we're going down the path of implementing each individually (similar to what sqlglot did for SQL), we can handle these four (and others) as needed. |
Ah yeah, I didn't consider sqlglot for the translation functionality. I just liked that it can connect to all these dbs and execute queries without any additional dependencies. A simple dbapi would be ideal! |
After some discussion in #257, I decided to replace SQLAlchemy with `dbapi` readers. The SQLAlchemy types weren't mapping as nicely as I'd hoped to Recap's. Now, DB types are mapped straight to Recap types. This PR adds a `PostgresqlReader` and `SnowflakeReader`. The PG reader has a very basic test, but the Snowflake reader is untested. I plan to use `fakesnow` for Snowflake in the future.
After some discussion in #257, I decided to replace SQLAlchemy with `dbapi` readers. The SQLAlchemy types weren't mapping as nicely as I'd hoped to Recap's. Now, DB types are mapped straight to Recap types. This PR adds a `PostgresqlReader` and `SnowflakeReader`. The PG reader has a very basic test, but the Snowflake reader is untested. I plan to use `fakesnow` for Snowflake in the future. Closes #259
After some discussion in #257, I decided to replace SQLAlchemy with `dbapi` readers. The SQLAlchemy types weren't mapping as nicely as I'd hoped to Recap's. Now, DB types are mapped straight to Recap types. This PR adds a `PostgresqlReader` and `SnowflakeReader`. The PG reader has a very basic test, but the Snowflake reader is untested. I plan to use `fakesnow` for Snowflake in the future. Closes #259
After some discussion in #257, I decided to replace SQLAlchemy with `dbapi` readers. The SQLAlchemy types weren't mapping as nicely as I'd hoped to Recap's. Now, DB types are mapped straight to Recap types. This PR adds a `PostgresqlReader` and `SnowflakeReader`. The PG reader has a very basic test, but the Snowflake reader is untested. I plan to use `fakesnow` for Snowflake in the future. Closes #259
That's awesome! I'll give it a try today! |
Great, thanks! I haven't tested the Snowflake one at all, so it's likely to fail. Just LMK if/how it does and we can spruce it up. I plan to write some tests for it, but I'm kinda hoping |
Closing this issue as we're no longer using SQLAlchemy. |
See https://docs.sqlalchemy.org/en/14/core/type_basics.html#sqlalchemy.types.ARRAY for details. This isn't supported in SqlAlchemReader right now.
The text was updated successfully, but these errors were encountered: