Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Ambiguous column error when loading different case columns via COPY INTO #16473

Open
1 of 2 tasks
rad-pat opened this issue Sep 19, 2024 · 5 comments
Open
1 of 2 tasks
Labels
A-planner Area: planner/optimizer C-bug Category: something isn't working

Comments

@rad-pat
Copy link

rad-pat commented Sep 19, 2024

Search before asking

  • I had searched in the issues and found no similar issues.

Version

v1.2.635-nightly

What's Wrong?

When trying to load data via COPY INTO from a parquet file with case sensitive columns, an ambiguous column error is reported. Looks to be from https://github.com/datafuselabs/databend/blob/2c41faf63b5136ab22a2ab3ac972c9752dc4a7bf/src/query/sql/src/planner/binder/bind_context.rs#L301

The parquet file contains columns such as c1 and C1 which exist within the target table.

How to Reproduce?

Sorry not easy for me to write a script to replicate, but easy to describe

  1. Create parquet file with columns that are case sensitive e.g. c1 & C1
  2. Create a table to receive these columns create table t_sensitive ("c1" int, "C1" int);
  3. Stage the parquet file
  4. Load parquet file via COPY INTO
COPY INTO t_sensitive
FROM 'gcs://bucket/folder/'
CONNECTION = (ENDPOINT_URL = 'https://storage.googleapis.com', CREDENTIAL = '<snip>')
FILE_FORMAT = (TYPE = PARQUET)
FORCE = TRUE;

Are you willing to submit PR?

  • Yes I am willing to submit a PR!
@rad-pat rad-pat added the C-bug Category: something isn't working label Sep 19, 2024
Copy link

dosubot bot commented Sep 19, 2024

The ambiguous column error you're encountering when using the COPY INTO command with case-sensitive columns (c1 and C1) is likely due to the logic at line 301 in bind_context.rs. This logic checks if there are multiple tables in the metadata that are not sources of an index, and if so, it raises a SemanticError indicating that the table of the internal column is ambiguous:

if tables.len() > 1 {
    return Err(ErrorCode::SemanticError(format!(
        "The table of the internal column `{}` is ambiguous",
        column_binding.internal_column.column_name()
    )));
}

This error can occur if the column names are not being correctly distinguished based on case sensitivity during the binding process [1].

To address this issue, you may need to ensure that the normalize_identifier function correctly handles case sensitivity for both quoted and unquoted identifiers. The normalize_identifier function is used to normalize identifiers based on the case sensitivity settings defined in the NameResolutionContext:

pub fn normalize_identifier(ident: &Identifier, context: &NameResolutionContext) -> Identifier {
    if (ident.is_quoted() && context.quoted_ident_case_sensitive)
        || (!ident.is_quoted() && context.unquoted_ident_case_sensitive)
    {
        ident.clone()
    } else {
        Identifier::from_name(ident.span, ident.name.to_lowercase())
    }
}

You should verify that the NameResolutionContext is configured correctly for your use case, ensuring that quoted identifiers are treated as case-sensitive and unquoted identifiers are handled according to your requirements [2].

If the issue persists, you might need to review the logic in the COPY INTO command's processing of Parquet files to ensure that it correctly handles case-sensitive column names. The ParquetCopySource struct and its methods (event, process, and async_process) are responsible for processing Parquet files during the COPY INTO operation [3][4][5].

By ensuring that the case sensitivity settings and the identifier normalization logic are correctly implemented, you should be able to resolve the ambiguous column error.

To continue talking to Dosu, mention @dosu.

@dosubot dosubot bot added the A-planner Area: planner/optimizer label Sep 19, 2024
@databendlabs databendlabs deleted a comment Sep 19, 2024
@BohuTANG
Copy link
Member

@youngsofun might be able to assist with this.

@rad-pat
Copy link
Author

rad-pat commented Nov 20, 2024

Any ideas on this @youngsofun ? Is there any setting I could try to adjust?

@youngsofun
Copy link
Member

when copy from parquet fields are matched by name.
currently, when reading parquet file, the fields of file schema is modified that all field names are turned to lowercase.

the Available solution is to use transform

  create stage stage1  url='gcs://bucket/folder/'  CONNECTION = (ENDPOINT_URL = 'https://storage.googleapis.com', CREDENTIAL = '<snip>');
  COPY INTO t_sensitive from (select * FROM @stage1) force=true;

@youngsofun
Copy link
Member

and I am working on this #16897 for better support of this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-planner Area: planner/optimizer C-bug Category: something isn't working
Projects
Status: 📒Backlog
Development

No branches or pull requests

3 participants