Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Duckdb sqlalchemy data source #10738

Open
zschira opened this issue Dec 5, 2024 · 3 comments
Open

Enable Duckdb sqlalchemy data source #10738

zschira opened this issue Dec 5, 2024 · 3 comments
Labels
feature-request feature request

Comments

@zschira
Copy link

zschira commented Dec 5, 2024

Is your feature request related to a problem? Please describe.
duckdb-engine provides a sqlalchemy driver for duckdb, which I wanted to test with great expectations. I'm able to get a data source setup easily, but when I test an expectation I always run into the same error:

{
  "success": false,
  "expectation_config": {
    "type": "expect_column_max_to_be_between",
    "kwargs": {
      "column": "passenger_count",
      "min_value": 1.0,
      "max_value": 6.0,
      "batch_id": "duckdb-test"
    },
    "meta": {}
  },
  "result": {},
  "meta": {},
  "exception_info": {
    "('table.row_count', '02f890ec1f22e1cb927a8ed609303838', ())": {
      "exception_traceback": "Traceback (most recent call last):\n  File \"/path/to/great_expectations/execution_engine/execution_engine.py\", line 545, in _process_direct_and_bundled_metric_computation_configurations\n    self.resolve_metric_bundle(metric_fn_bundle=metric_fn_bundle_configurations)\n  File \"/path/to/great_expectations/execution_engine/sqlalchemy_execution_engine.py\", line 1042, in resolve_metric_bundle\n    f\"\"\"SqlAlchemyExecutionEngine computed {len(res[0])} metrics on domain_id \\\n                                                ~~~^^^\nIndexError: list index out of range\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File \"/path/to/great_expectations/validator/validation_graph.py\", line 276, in _resolve\n    self._execution_engine.resolve_metrics(\n  File \"/path/to/great_expectations/execution_engine/execution_engine.py\", line 279, in resolve_metrics\n    return self._process_direct_and_bundled_metric_computation_configurations(\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/path/to/great_expectations/execution_engine/execution_engine.py\", line 549, in _process_direct_and_bundled_metric_computation_configurations\n    raise gx_exceptions.MetricResolutionError(\ngreat_expectations.exceptions.exceptions.MetricResolutionError: list index out of range\n",
      "exception_message": "list index out of range",
      "raised_exception": true
    }
  }
}

I did a little bit of debugging and I think what's happening is that the db connection is being closed before calling fetchall on a query, so it ends up returning an empty list. I think if duckdb was included in _PERSISTED_CONNECTION_DIALECTS in great_expectations/execution_engine/sqlalchemy_execution_engine.py that this problem would go away. There may be other issues with the duckdb dialect, but it would at least be a start to being able to test it out.

Describe the solution you'd like
Add a new duckdb dialect to great_expectations/execution_engine/sqlalchemy_dialect.py and include this in _PERSISTED_CONNECTION_DIALECTS. This seems pretty straightforward, and I'd be happy to put a PR together to do this, but I'm not sure if there are any other considerations needed to add a new dialect like this.

Describe alternatives you've considered
An alternative option could be to add just add the OTHER dialect to _PERSISTED_CONNECTION_DIALECTS. This seems reasonable to me and might be best practice to default to keeping the engine alive while still interacting with a cursor.

Additional context
Steps to recreate the issue:

  1. Create a duckdb file with a test table
import duckdb
import pandas as pd

test_df = pd.DataFrame({"test_col": [0, 1, 2]})

with duckdb.connect("test.db") as con:
    con.sql(
        "CREATE TABLE test_table AS "
        f"SELECT * FROM test_df;"
    )
    con.table("test_table").show()
  1. Add db as a data source
import great_expectations as gx

context = gx.get_context()
context.data_sources.add_sql(
    name="duckdb", connection_string="duckdb:///test.db"
)
  1. Create a table asset/batch
data_source = context.data_sources.get("duckdb")

table_data_asset = data_source.add_table_asset(
    table_name="test_table", name="test_asset"
)

full_table_batch_definition = table_data_asset.add_batch_definition_whole_table(
    name="test_table"
)
batch = full_table_batch_definition.get_batch()
batch.head()
  1. Test an expectation
expectation = gx.expectations.ExpectColumnMaxToBeBetween(
    column="passenger_count", min_value=1, max_value=6
)
batch.validate(expectation)
@adeola-ak adeola-ak moved this from To Do to In progress in GX Core Issues Board Dec 9, 2024
@adeola-ak adeola-ak added the feature-request feature request label Dec 9, 2024
@adeola-ak
Copy link
Contributor

Hi @zschira, thank you for reaching out! I've forwarded this to the relevant team to confirm if we're currently accepting contributions for this type of work. I appreciate the effort you've already put into investigating this, and I'll get back to you with an update as soon as possible

@zschira
Copy link
Author

zschira commented Dec 13, 2024

Thanks @adeola-ak! I think duckdb support could open up a lot of possibilities for working with "medium" size data.

@adeola-ak
Copy link
Contributor

adeola-ak commented Dec 17, 2024

Hi @zschira,

Unfortunately, we’ve decided that we’re not yet ready to enable DuckDB as a data source. This decision is primarily due to the ongoing maintenance it would require on our end (e.g., documentation, PR reviews, testing).

We truly appreciate the effort you’ve put into exploring this, and I wish we could support your use case at this time. While it’s not possible right now, we do plan to accept contributions for data sources again in the future so I encourage you to check back in later if it's something you'd still want later next year. Thank you for your understanding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request feature request
Projects
Status: In progress
Development

No branches or pull requests

2 participants