Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-948287: Best practice for doing cross database joins using CORE #456

Open
jasheldo opened this issue Oct 23, 2023 · 3 comments
Open
Labels
bug Something isn't working needs triage

Comments

@jasheldo
Copy link

I am literally copying this from the ORM question because it's virtually identical but the ORM approach doesn't work for CORE.

Question

Hi there, in Snowflake we often have multiple databases, with their own set of schemas, and tables inside those schemas. Ideally data shouldn't have to move between these databases nor schemas, but in practice they often do.

One use case is when there might be valuable data between two different databases. In my example, I'm going to reference database A and database B which has schemas ant and baboons, which has tables apples and bananas, respectively.

Through a direct query using Snowflake SQL, you could implement a join like

select
    *
from A.ant.apples a
join B.baboons.bananas b
    on a.id = b.id

How can you implement a cross database join like the above example using a single session through the CORE interface? My understanding is that a session can only point towards one database. I'm not creating an app and the tables already exist so ideally this would all happen through SQLAlchemy MetaData object reflection. Below are some example of what I've tried.

The ideal approach would look like the following:

import os

from sqlalchemy import create_engine, MetaData, text, Table
from snowflake.sqlalchemy import URL
from sqlalchemy.sql import select, and_, or_, case, func, insert


# Note: no database, schema and warehouse are being selected at this time. They're not needed yet.

engine = create_engine(URL(
    account="my_awesome_account",
    user="my_user",
    password="my_password",
    role="my_awesome_role",
    ),
    )
metadata = MetaData()

metadata.reflect(engine, schema='a.ant', only=['apples',]) # doesn't work
metadata.reflect(engine, schema='a.ant', only=['a.ant.apples',]) # doesn't work
metadata.reflect(engine, only=['a.ant.apples',]) # doesn't work

Table("apples", metadata, schema='a.ant', autoload_with=engine) # doesn't work
Table("a.ant.apples", metadata, schema='a.ant', autoload_with=engine) # doesn't work

I've tried all types of combinations of cases with the the database name and schema names too. None of them work.
What I find silly about the above is I often get StackTraces saying I have to provide a fully qualified name since I'm not declaring a database. But I AM providing a fully qualified name. And not explicitly choosing a database is the point. I need to pull data from tables in multiple databases so the reflected tables NEED TO BE FULLY QUALIFIED.

Can this be done through the CORE interface without dropping to SQL? If so, what's the best practice in doing this?

Please answer these questions before submitting your issue. Thanks!

  1. What version of Python are you using?

    Python 3.9.6

  2. What operating system and processor architecture are you using?

    macOS-14.0-arm64-arm-64bit

  3. What are the component versions in the environment (pip freeze)?

    appnope==0.1.3 asn1crypto==1.5.1 asttokens==2.4.0 backcall==0.2.0 certifi==2023.7.22 cffi==1.16.0 charset-normalizer==3.3.0 comm==0.1.4 cryptography==41.0.4 debugpy==1.8.0 decorator==5.1.1 exceptiongroup==1.1.3 executing==2.0.0 filelock==3.12.4 idna==3.4 importlib-metadata==6.8.0 ipykernel==6.25.2 ipython==8.16.1 jaraco.classes==3.3.0 jedi==0.19.1 jupyter_client==8.4.0 jupyter_core==5.4.0 keyring==24.2.0 matplotlib-inline==0.1.6 more-itertools==10.1.0 nest-asyncio==1.5.8 numpy==1.26.1 oscrypto==1.3.0 packaging==23.2 pandas==2.0.3 parso==0.8.3 pexpect==4.8.0 pickleshare==0.7.5 platformdirs==3.11.0 prompt-toolkit==3.0.39 psutil==5.9.6 ptyprocess==0.7.0 pure-eval==0.2.2 pyarrow==10.0.1 pycparser==2.21 pycryptodomex==3.19.0 Pygments==2.16.1 PyJWT==2.8.0 pyOpenSSL==23.2.0 python-dateutil==2.8.2 pytz==2023.3.post1 pyzmq==25.1.1 requests==2.31.0 six==1.16.0 snowflake-connector-python==3.3.1 snowflake-sqlalchemy==1.5.0 sortedcontainers==2.4.0 SQLAlchemy==1.4.49 sqlparse==0.4.4 stack-data==0.6.3 tomlkit==0.12.1 tornado==6.3.3 traitlets==5.11.2 typing_extensions==4.8.0 tzdata==2023.3 urllib3==1.26.18 wcwidth==0.2.8 zipp==3.17.0

  4. What did you do?

    The above example should be sufficient.

  5. What did you expect to see?

    Ideally, the fully qualified table objects, including the database, would have been instantiated as SQLAlchemy Table objects. In this way, I could then create a SQLAlchemy select statement such as select(a).select_from(a.join(b, on=[a.c.col1 == b.c.col1]) and the corresponding SQL generated would be SELECT A.ant.apple.* FROM A.ant.apple JOIN B.baboons.bananas ON A.ant.apple.col1 = B.baboons.bananas.col1;.

Instead, the reflection step fails forcing me to choose a database which then prevents me from joining tables in different databases.

  1. Can you set logging to DEBUG and collect the logs?

    import logging
    import os
    
    for logger_name in ['snowflake.sqlalchemy', 'snowflake.connector']:
       logger = logging.getLogger(logger_name)
       logger.setLevel(logging.DEBUG)
       ch = logging.StreamHandler()
       ch.setLevel(logging.DEBUG)
       ch.setFormatter(logging.Formatter('%(asctime)s - %(threadName)s %(filename)s:%(lineno)d - %(funcName)s() - %(levelname)s - %(message)s'))
       logger.addHandler(ch)
    

I'm happy to provide this info but in private. It's on a company computer with company backend stuff that's not for the public eye.

@jasheldo jasheldo added bug Something isn't working needs triage labels Oct 23, 2023
@github-actions github-actions bot changed the title Best practice for doing cross database joins using CORE SNOW-948287: Best practice for doing cross database joins using CORE Oct 23, 2023
@jasheldo
Copy link
Author

jasheldo commented Nov 3, 2023

Has anyone had a chance to review?

@sfc-gh-pgifford
Copy link

Jim - Do you have access to this:
https://snowflakecomputing.atlassian.net/wiki/spaces/SKE/pages/1065292364/JOIN+Across+Databases+SYNONYM

This article provides alternatives for How To get around this using session variables and optionally identifier()

Details
The need is to join data from two databases (which are environment specific) such as:

SELECT
cd.cust_id,
ff.measure_nbr,
...
FROM ent_dev_edw.bas.cust_d cd
JOIN lcl_dev_wrk.aml.fact_f ff ON ff.cust_sk = cd.cust_sk
WHERE
...
Use Variables

set dbEntNm = 'ent_dev_edw.';
set dbLclNm = 'lcl_dev_wrk.';
set cdNm = $dbEntNm || 'bas.cust_d';
set ffNm = $dbLclNm || 'amt.fact_f';

...
FROM $cdNm cd -- or use this syntax FROM TABLE($cdNm) or this syntax FROM IDENTIFIER($cdNm)
JOIN $ffNm ff ON ff.cust_sk = cd.cust_sk
...
Honestly, it's a pain when you want flexibility for a lot of tables.

We do have a Synonym jira ticket out there and I will add your company to it.

-Pete

@jasheldo
Copy link
Author

I hacked my way around it but because of how our org is set up, I'm unable to run the test suite to verify it doesn't break anything. If someone could help out, that'd be wonderful. All edits were made to snowdialect.py, attached here.
snowdialect.py.zip

Bottom line, there are two sections in this file around reflection. I added a bit of code to check to see if the schema has a . in it. And if so, split the schema into database and schema. The information_schema is also being fully qualified in the code by changing the subsequent query to an f-string and including the database. Just run git diff on the attached zip file to see all the changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage
Projects
None yet
Development

No branches or pull requests

2 participants