Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Silver Transformations - Silver Dataflowspec Table handling of similar table names #112

Open
kosch34 opened this issue Oct 29, 2024 · 6 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@kosch34
Copy link

kosch34 commented Oct 29, 2024

I have three tables with the same name, each with different schema pointed to different target schemas. I have built one silver_transformation file for each table, and configured onboarding file for tables to point to appropriate silver_transformation file. When running the create_silver_dataflowspec_table() function, it seems that the silver_dataflowspec table creates duplicate records because of this, and only uses the selectExpr of the first instance of the table across all three tables. This results in schema mismatches across two tables. Seems like a potential bug to me.

For example,

schema_1.table_1 - silver_trasnformation_schema_1
schema_2.table_1 - silver_trasnformation_schema_2
schema_3.table_1 - silver_trasnformation_schema_3

Result silver_dataflowspec table results in 3 entries per group (9 total records), all records taking the selectExpr of only one of the silver_tranformation files.

Let me know if I am being clear. I can put together a more thorough example if needed

Thank you for the help

@ravi-databricks
Copy link
Contributor

ravi-databricks commented Oct 29, 2024

yeah this can be issue since this join condition code is on table name. We can add database attribute to silver transformation files so that join will be on tablename and database combination.

@ravi-databricks ravi-databricks added the bug Something isn't working label Oct 29, 2024
@mweirath
Copy link

Thanks for the quick response; this has become a blocker for us on several tables. Do we need to create a custom version of this code to handle this use case? Or do you think you will be able to create a pre-release version?

@ravi-databricks
Copy link
Contributor

ravi-databricks commented Oct 29, 2024

We can put this into issue_112 branch so you might need to work from that branch.

@ravi-databricks
Copy link
Contributor

Add attribute database to silver_transformations e.g

[
 {
   "target_table": "customers",
    "database: "uc.schemaname",
   "select_exp": [
     "address",
     "email",
     "firstname",
     "id",
     "lastname",
     "operation_date",
     "operation",
     "_rescued_data"
   ],
   "where_clause": [
     "id IS NOT NULL",
     "email is not NULL"
   ]
 }
]

change join condition during silver_onboard_dataflowspec as below:

    silver_data_flow_spec_df = silver_transformation_json_df.join(
        silver_data_flow_spec_df,
        (silver_transformation_json_df.target_table == silver_data_flow_spec_df.targetDetails["table"]) &(silver_transformation_json_df.database == silver_data_flow_spec_df.targetDetails["database"])
    )

dattawalake pushed a commit to dattawalake/dlt-meta that referenced this issue Oct 30, 2024
…flowspec Table handling of similar table names
dattawalake pushed a commit to dattawalake/dlt-meta that referenced this issue Oct 31, 2024
…flowspec Table handling of similar table names
ravi-databricks added a commit that referenced this issue Oct 31, 2024
issue #112 fix for Silver Transformations - Silver Dataflowspec Table…
@ravi-databricks
Copy link
Contributor

@kosch34 @mweirath added fix in this issue branch please see if works!

@ravi-databricks ravi-databricks self-assigned this Oct 31, 2024
@ravi-databricks ravi-databricks added this to the v0.0.9 milestone Oct 31, 2024
@mweirath
Copy link

Thanks for the fix, @ravi-databricks. It might be early next week before we get a chance to test it. We are in the middle of a release and focused on that for a couple of days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

3 participants