Skip to content
This repository has been archived by the owner on May 17, 2024. It is now read-only.

Version change from v0.8.5rc1 to v0.11.1 it slows down the data-diff process #882

Closed
saurasingh opened this issue Apr 4, 2024 · 1 comment
Labels
bug Something isn't working triage

Comments

@saurasingh
Copy link

Describe the bug
We use data-diff to find the difference between the table in the source which is Mysql and the destination which is Snowflake.

here is how we do it:

we create a source connector
source_con = data_diff.connect_to_table( source_connection, table_name=f"{table}", key_columns=f"{col}", thread_count=thread_count )
we create destination connector
target_con = data_diff.connect_to_table( target_connection, f"{target_db.upper()}.{target_schema.upper()}.{table.upper()}", f"{col.upper()}", thread_count=thread_count, )

and then we call data_diff.diff_tables like below

diff_table = data_diff.diff_tables( source_con, target_con, bisection_factor=bisection_factor, threaded=True, max_threadpool_size=max_threadpool_size, max_key=max_id, )

with version v0.8.5rc1 it is working fine and goes into a bisection only if there is a difference, but when we upgrade to any other version(the last version we upgraded to is v0.11.1) it runs really slow and tries to go into bisection to check for diff but do not find any.
[2024-04-04, 22:04:34 UTC] {hashdiff_tables.py:180} INFO - . . Diffing segment 39/100, key-range: (34680062)..(34696282), size <= 1621789 [2024-04-04, 22:04:35 UTC] {hashdiff_tables.py:180} INFO - . . Diffing segment 40/100, key-range: (34696282)..(34712502), size <= 1621789 [2024-04-04, 22:04:35 UTC] {hashdiff_tables.py:260} INFO - . . Diff found 0 different rows. [2024-04-04, 22:04:36 UTC] {hashdiff_tables.py:180} INFO - . . Diffing segment 41/100, key-range: (34712502)..(34728722), size <= 1621789 [2024-04-04, 22:04:36 UTC] {hashdiff_tables.py:260} INFO - . . Diff found 0 different rows. [2024-04-04, 22:04:36 UTC] {hashdiff_tables.py:180} INFO - . . Diffing segment 42/100, key-range: (34728722)..(34744942), size <= 1621789 [2024-04-04, 22:04:36 UTC] {hashdiff_tables.py:260} INFO - . . Diff found 0 different rows. [2024-04-04, 22:04:36 UTC] {hashdiff_tables.py:180} INFO - . . Diffing segment 43/100, key-range: (34744942)..(34761162), size <= 1621789 [2024-04-04, 22:04:36 UTC] {hashdiff_tables.py:260} INFO - . . Diff found 0 different rows. [2024-04-04, 22:04:36 UTC] {hashdiff_tables.py:180} INFO - . . Diffing segment 44/100, key-range: (34761162)..(34777382), size <= 1621789 [2024-04-04, 22:04:49 UTC] {hashdiff_tables.py:260} INFO - . . Diff found 0 different rows. [2024-04-04, 22:04:49 UTC] {hashdiff_tables.py:180} INFO - . . Diffing segment 45/100, key-range: (34777382)..(34793602), size <= 1621789

This behavior makes the data-diff task run for a couple of hours on the new version whereas on the older version it used to finish only in a few minutes.

A clear and concise description of what the bug is.

Make sure to include the following (minus sensitive information):

  • The command or code you used
  • The run output + error you're getting. (including tracestack)
  • Run data-diff with the -d switch for extra debug information.

If possible, please paste these as text, and not a screenshot.

Describe the environment

Describe which OS you're using, which data-diff version, and any other information that might be relevant to this bug.

@saurasingh saurasingh added the bug Something isn't working label Apr 4, 2024
@github-actions github-actions bot added the triage label Apr 4, 2024
@glebmezh
Copy link
Contributor

Hi @saurasingh,

Thank you for trying out data-diff and for taking the time to open this issue. We made a hard decision to sunset the data-diff package and won't provide further development or support. Diffing functionality will continue to be available in Datafold Cloud. We have completely rewritten the diffing engine in the cloud over the past few months and have solved the fundamental issues with the original algorithm used in the data-diff package. Feel free to take it for a trial or contact us at [email protected] if you have any questions.

-Gleb

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working triage
Projects
None yet
Development

No branches or pull requests

2 participants