bug: Ambiguous column error when loading different case columns via COPY INTO #16473

rad-pat · 2024-09-19T10:07:13Z

Search before asking

I had searched in the issues and found no similar issues.

Version

v1.2.635-nightly

What's Wrong?

When trying to load data via COPY INTO from a parquet file with case sensitive columns, an ambiguous column error is reported. Looks to be from https://github.com/datafuselabs/databend/blob/2c41faf63b5136ab22a2ab3ac972c9752dc4a7bf/src/query/sql/src/planner/binder/bind_context.rs#L301

The parquet file contains columns such as c1 and C1 which exist within the target table.

How to Reproduce?

Sorry not easy for me to write a script to replicate, but easy to describe

Create parquet file with columns that are case sensitive e.g. c1 & C1
Create a table to receive these columns create table t_sensitive ("c1" int, "C1" int);
Stage the parquet file
Load parquet file via COPY INTO

COPY INTO t_sensitive
FROM 'gcs://bucket/folder/'
CONNECTION = (ENDPOINT_URL = 'https://storage.googleapis.com', CREDENTIAL = '<snip>')
FILE_FORMAT = (TYPE = PARQUET)
FORCE = TRUE;

Are you willing to submit PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

dosubot · 2024-09-19T10:09:08Z

The ambiguous column error you're encountering when using the COPY INTO command with case-sensitive columns (c1 and C1) is likely due to the logic at line 301 in bind_context.rs. This logic checks if there are multiple tables in the metadata that are not sources of an index, and if so, it raises a SemanticError indicating that the table of the internal column is ambiguous:

if tables.len() > 1 {
    return Err(ErrorCode::SemanticError(format!(
        "The table of the internal column `{}` is ambiguous",
        column_binding.internal_column.column_name()
    )));
}

This error can occur if the column names are not being correctly distinguished based on case sensitivity during the binding process ^[1].

To address this issue, you may need to ensure that the normalize_identifier function correctly handles case sensitivity for both quoted and unquoted identifiers. The normalize_identifier function is used to normalize identifiers based on the case sensitivity settings defined in the NameResolutionContext:

pub fn normalize_identifier(ident: &Identifier, context: &NameResolutionContext) -> Identifier {
    if (ident.is_quoted() && context.quoted_ident_case_sensitive)
        || (!ident.is_quoted() && context.unquoted_ident_case_sensitive)
    {
        ident.clone()
    } else {
        Identifier::from_name(ident.span, ident.name.to_lowercase())
    }
}

You should verify that the NameResolutionContext is configured correctly for your use case, ensuring that quoted identifiers are treated as case-sensitive and unquoted identifiers are handled according to your requirements ^[2].

If the issue persists, you might need to review the logic in the COPY INTO command's processing of Parquet files to ensure that it correctly handles case-sensitive column names. The ParquetCopySource struct and its methods (event, process, and async_process) are responsible for processing Parquet files during the COPY INTO operation ^[3]^[4]^[5].

By ensuring that the case sensitivity settings and the identifier normalization logic are correctly implemented, you should be able to resolve the ambiguous column error.

To continue talking to Dosu, mention @dosu.

BohuTANG · 2024-11-15T09:47:53Z

@youngsofun might be able to assist with this.

rad-pat · 2024-11-20T11:33:13Z

Any ideas on this @youngsofun ? Is there any setting I could try to adjust?

youngsofun · 2024-11-21T03:03:52Z

when copy from parquet fields are matched by name.
currently， when reading parquet file, the fields of file schema is modified that all field names are turned to lowercase.

the Available solution is to use transform

  create stage stage1  url='gcs://bucket/folder/'  CONNECTION = (ENDPOINT_URL = 'https://storage.googleapis.com', CREDENTIAL = '<snip>');
  COPY INTO t_sensitive from (select * FROM @stage1) force=true;

youngsofun · 2024-11-21T03:11:45Z

and I am working on this #16897 for better support of this issue

rad-pat added the C-bug Category: something isn't working label Sep 19, 2024

dosubot bot added the A-planner Area: planner/optimizer label Sep 19, 2024

github-project-automation bot moved this to 📒Backlog in Databend Query Planner Sep 19, 2024

github-project-automation bot added this to Databend Query Planner Sep 19, 2024

databendlabs deleted a comment Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: Ambiguous column error when loading different case columns via COPY INTO #16473

bug: Ambiguous column error when loading different case columns via COPY INTO #16473

rad-pat commented Sep 19, 2024

dosubot bot commented Sep 19, 2024

BohuTANG commented Nov 15, 2024

rad-pat commented Nov 20, 2024

youngsofun commented Nov 21, 2024

youngsofun commented Nov 21, 2024

bug: Ambiguous column error when loading different case columns via COPY INTO #16473

bug: Ambiguous column error when loading different case columns via COPY INTO #16473

Comments

rad-pat commented Sep 19, 2024

Search before asking

Version

What's Wrong?

How to Reproduce?

Are you willing to submit PR?

dosubot bot commented Sep 19, 2024

BohuTANG commented Nov 15, 2024

rad-pat commented Nov 20, 2024

youngsofun commented Nov 21, 2024

youngsofun commented Nov 21, 2024