Skip to content
This repository has been archived by the owner on May 17, 2024. It is now read-only.

Snowflake and Postgres dialects normalize dates differently than other database types #877

Closed
RenanBasilio opened this issue Mar 28, 2024 · 1 comment
Labels
bug Something isn't working triage

Comments

@RenanBasilio
Copy link

RenanBasilio commented Mar 28, 2024

Describe the bug
In the Snowflake and Postgres database dialects date columns are normalized differently than in other dialects. This causes a false positive diff when comparing those dates with date columns in other database types.

For example, consider the comparison between MySQL and Snowflake:

This is what the schemas look like (from logging with --verbose):

INFO     [MySQL] Schema = {
    'id': Integer(_notes=[], precision=0, python_type=<class 'int'>),
    'some_date': Date(_notes=[], precision=6, rounds=True)}
INFO     [Snowflake] Schema = {
    'ID': Decimal(_notes=[], precision=0),
    'SOME_DATE': Date(_notes=[], precision=6, rounds=False)}

This query is used when pulling a date column from the MySQL database:

SELECT
    cast(`id` as char),
    cast(cast( cast(`some_date` as datetime(6)) as datetime(6)) as char)
FROM `some_schema`.`some_table`

This returns a pair like

1, 2024-03-28 00:00:00.000000

In turn, in Snowflake the query is

SELECT
    cast(cast("ID" as decimal(38, 0)) as string),
    ("SOME_DATE"::varchar)
FROM "SOME_SCHEMA"."SOME_TABLE"

which returns a row like

1, 2024-03-28

This is then detected as a mismatch.

This seems to stem from the normalize_timestamp function in the database dialect. In the BaseDialect, it is documented that the function should return a date formatted as YYYY-MM-DD HH:mm:SS.FFFFFF, which is accurate to the MySQL result but not to the Snowflake or Postgres dialects which only cast it directly as a string.

def normalize_timestamp(self, value: str, coltype: TemporalType) -> str:
if coltype.rounds:
return self.to_string(f"cast( cast({value} as datetime({coltype.precision})) as datetime(6))")
s = self.to_string(f"cast({value} as datetime(6))")
return f"RPAD(RPAD({s}, {TIMESTAMP_PRECISION_POS+coltype.precision}, '.'), {TIMESTAMP_PRECISION_POS+6}, '0')"

def normalize_timestamp(self, value: str, coltype: TemporalType) -> str:
def _add_padding(coltype: TemporalType, timestamp6: str):
return f"RPAD(LEFT({timestamp6}, {TIMESTAMP_PRECISION_POS+coltype.precision}), {TIMESTAMP_PRECISION_POS+6}, '0')"
try:
is_date = coltype.is_date
is_time = coltype.is_time
except:
is_date = False
is_time = False
if isinstance(coltype, Date) or is_date:
return f"cast({value} as varchar)"

Describe the environment

Data Diff Open Source v0.11.1
Python 3.9.12
MacOSX Sonoma 14.2.1 (23C71)

@RenanBasilio RenanBasilio added the bug Something isn't working label Mar 28, 2024
@glebmezh
Copy link
Contributor

Hi @RenanBasilio,

Thank you for trying out data-diff and for taking the time to open this issue. We made a hard decision to sunset the data-diff package and won't provide further development or support. Diffing functionality will continue to be available in Datafold Cloud. We have completely rewritten the diffing engine in the cloud over the past few months and have solved the fundamental issues with the original algorithm used in the data-diff package. Feel free to take it for a trial or contact us at [email protected] if you have any questions.

-Gleb

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working triage
Projects
None yet
Development

No branches or pull requests

2 participants