Skip to content
This repository has been archived by the owner on Sep 23, 2024. It is now read-only.

Integers with NULL's can lead to rounding when using the Parquet file format #404

Open
s7clarke10 opened this issue Aug 28, 2023 · 0 comments
Labels
bug Something isn't working

Comments

@s7clarke10
Copy link
Contributor

Describe the bug
In certain circumstances where a column is a INT on the source and it contains NULL values in one of the rows, there is an implicit conversion from INT to FLOAT leading to rounding.

This is specific to when Target Snowflake is using the parquet format for loading data into Snowflake. This method has some issues because a Pandas dataframe does not support integers with a NULL value. To resolve this Pandas will automatically convert any dataframe which has nulls in the column to a Float64 datatype. This issue is resolved if a Int64 datatype is used, however the default conversion to a Pandas Dataframe is the basic Int datatype (which doesn't support nulls).

A fix for this is to provide a hint in the conversion to cast every column as an object thus preventing any conversion. This seems to not affect target-snowflake ability to land data correctly.

To Reproduce
Steps to reproduce the behavior:

  1. Load data from a source system with a single row with a bigint column with a Integer value of 9223372036854775807
  2. Ingest the data. You should have the correct result in Target Snowflake
  3. Add another row in the source with a value of NULL in the bigint column
  4. Ingest the data. You will see rounding in the bigint column, the last digit will be out by a value of one due to rounding

Expected behavior
Rounding should not occur the actual integer value should be replicated from source unmodified.

Your environment

  • Version of target: latest
  • Version of python 3.8

Additional context

The issue is in line :

return pandas.DataFrame(data=flattened_records)

Before Change:
return pandas.DataFrame(data=flattened_records)

After Change:
return pandas.DataFrame(data=flattened_records,dtype='object')

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant