Skip to content
This repository has been archived by the owner on May 17, 2024. It is now read-only.

diff tables reports NotImplementedError: Cannot use a column of type if the table is empty #891

Closed
jsnb-devoted opened this issue Apr 30, 2024 · 1 comment
Labels
bug Something isn't working triage

Comments

@jsnb-devoted
Copy link

Describe the bug
When performing a diff between a table in postres and snowflake -- the diff function will fail if Snowflake primary key field is a TEXT field and the table is empty.

I noticed in the logs for successful diffs that tables that were not empty were reporting that the id field was a String_UUID -- for example:

[2024-04-28, 04:20:44 UTC] {schema.py:44} INFO - [PostgreSQL] Schema = {'id': Native_UUID(_notes=[]), 'updated_at': Timestamp(_notes=[], precision=6, rounds=True)}
[2024-04-28, 04:22:38 UTC] {schema.py:44} INFO - [Snowflake] Schema = {'ID': String_UUID(_notes=[], collation=None, lowercase=True, uppercase=False), 'UPDATED_AT': Timestamp(_notes=[], precision=9, rounds=False)}

For the table that failed the log looked like this:

[2024-04-28, 08:06:27 UTC] {schema.py:44} INFO - [PostgreSQL] Schema = {'id': Native_UUID(_notes=[]), 'validation_logic_json': Text(_notes=[], collation=None), 'in_control_logic_json': Text(_notes=[], collation=None)}
[2024-04-28, 08:08:27 UTC] {schema.py:44} INFO - [Snowflake] Schema = {'ID': Text(_notes=[], collation=None), 'IN_CONTROL_LOGIC_JSON': Text(_notes=[], collation=None), 'VALIDATION_LOGIC_JSON': Text(_notes=[], collation=None)}

I dug in a little deeper and noticed that the types are determined by sample some of the data via that function _refine_coltypes. It feels a bit like without any values to sample it defaults the type to the type in the db and then eventually raises this NotImplementedError

Make sure to include the following (minus sensitive information):
We are importing the package and calling it from a python script in airflow -- the abreviated code looks like this:

table1: TableSegment = connect_to_table(**postgres_details)
table2: TableSegment = connect_to_table(**snowflake_details)
differ = HashDiffer()
diff_result_wrapper: DiffResultWrapper = differ.diff_tables(table1, table2)
...

Here is an excerpt of the traceback:

  File "/usr/local/airflow/dags/warehouse/data_engineering/avalanche_data_integrity/data_differ.py", line 60, in compare_dbs
    control_flow(
  File "/usr/local/airflow/dags/warehouse/data_engineering/avalanche_data_integrity/helpers.py", line 945, in control_flow
    diff_magic(
  File "/usr/local/airflow/dags/warehouse/data_engineering/avalanche_data_integrity/helpers.py", line 744, in diff_magic
    for sign, columns in diff_result_wrapper:
  File "/usr/local/lib/python3.11/site-packages/data_diff/diff_tables.py", line 95, in __iter__
    for i in self.diff:
  File "/usr/local/lib/python3.11/site-packages/data_diff/diff_tables.py", line 266, in _diff_tables_wrapper
    raise error
  File "/usr/local/lib/python3.11/site-packages/data_diff/diff_tables.py", line 239, in _diff_tables_wrapper
    yield from self._diff_tables_root(table1, table2, info_tree)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/data_diff/diff_tables.py", line 274, in _diff_tables_root
    return self._bisect_and_diff_tables(table1, table2, info_tree)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/data_diff/diff_tables.py", line 299, in _bisect_and_diff_tables
    raise NotImplementedError(f"Cannot use a column of type {kt} as a key")
NotImplementedError: Cannot use a column of type Text(_notes=[], collation=None) as a key

If possible, please paste these as text, and not a screenshot.

Describe the environment
Running in airflow with data-diff==0.11.1

@jsnb-devoted jsnb-devoted added the bug Something isn't working label Apr 30, 2024
@glebmezh
Copy link
Contributor

Hi @jsnb-devoted,

Thank you for trying out data-diff and for taking the time to open this issue. We made a hard decision to sunset the data-diff package and won't provide further development or support. Diffing functionality will continue to be available in Datafold Cloud. We have completely rewritten the diffing engine in the cloud over the past few months and have solved the fundamental issues with the original algorithm used in the data-diff package. Feel free to take it for a trial or contact us at [email protected] if you have any questions.

-Gleb

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working triage
Projects
None yet
Development

No branches or pull requests

2 participants