Integers with NULL's can lead to rounding when using the Parquet file format #404

s7clarke10 · 2023-08-28T21:08:59Z

Describe the bug
In certain circumstances where a column is a INT on the source and it contains NULL values in one of the rows, there is an implicit conversion from INT to FLOAT leading to rounding.

This is specific to when Target Snowflake is using the parquet format for loading data into Snowflake. This method has some issues because a Pandas dataframe does not support integers with a NULL value. To resolve this Pandas will automatically convert any dataframe which has nulls in the column to a Float64 datatype. This issue is resolved if a Int64 datatype is used, however the default conversion to a Pandas Dataframe is the basic Int datatype (which doesn't support nulls).

A fix for this is to provide a hint in the conversion to cast every column as an object thus preventing any conversion. This seems to not affect target-snowflake ability to land data correctly.

To Reproduce
Steps to reproduce the behavior:

Load data from a source system with a single row with a bigint column with a Integer value of 9223372036854775807
Ingest the data. You should have the correct result in Target Snowflake
Add another row in the source with a value of NULL in the bigint column
Ingest the data. You will see rounding in the bigint column, the last digit will be out by a value of one due to rounding

Expected behavior
Rounding should not occur the actual integer value should be replicated from source unmodified.

Your environment

Version of target: latest
Version of python 3.8

Additional context

The issue is in line :

pipelinewise-target-snowflake/target_snowflake/file_formats/parquet.py

Line 69 in c0806f0

return pandas.DataFrame(data=flattened_records)

Before Change:
return pandas.DataFrame(data=flattened_records)

After Change:
return pandas.DataFrame(data=flattened_records,dtype='object')

s7clarke10 added the bug Something isn't working label Aug 28, 2023

mjsqu mentioned this issue Aug 30, 2023

Feat/set dtype object in pandas data frame mjsqu/pipelinewise-target-snowflake#32

Merged

13 tasks

s7clarke10 mentioned this issue Sep 10, 2024

feat: Support Singer Decimal as a config item meltano/sdk#1890

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integers with NULL's can lead to rounding when using the Parquet file format #404

Integers with NULL's can lead to rounding when using the Parquet file format #404

s7clarke10 commented Aug 28, 2023

Integers with NULL's can lead to rounding when using the Parquet file format #404

Integers with NULL's can lead to rounding when using the Parquet file format #404

Comments

s7clarke10 commented Aug 28, 2023