Feature: load parquet/ndjson support case sensitive #16897

youngsofun · 2024-11-21T03:05:11Z

Summary

currently， when reading parquet file, the fields of file schema is modified that all field names are turned to lowercase.

Solution 1

parquet/ndjson add format option case_sensitive

cons:

can not copy file ('a', 'b') into ('a', 'B')
select, infer_infer_schema not show the orignal field names
3. need to create a file_format for this purpose

Solution 2：

select, infer_infer_schema: add table function option case_sensitive, true by default
add copy option case_sensitive, false by default （for compatible）
1. table fields check: allow ('a', 'B'), not allow ('a', 'A')
2. parquet
  1. arrow_to_table_schema, to_lowercase and check for dup name
  2. impl 1 (trans to select): required fields to_lowercase
  3. impl 2 (match): table fields to_lower_case
3. ndjson:
  1. both to_lower_case when matching names

cons：

the default behavior of select and copy are not consist

pros

select, infer_infer_schema show what the original name by default (maybe we can sacrifice this and let all case_sensitive=false by default for consist)

The text was updated successfully, but these errors were encountered:

youngsofun · 2024-11-21T03:14:13Z

cc @sundy-li @everpcpc @wubx @Xuanwo

youngsofun · 2024-11-21T06:25:28Z

choose Solution 2 after discuss with @sundy-li.

to make it more clear, I propose:

add copy option COLUMN_MATCH_MODE

COLUMN_MATCH_MODE:
  CASE_SENSITIVE: Match columns by name, case-sensitive.
  CASE_INSENSITIVE: Match columns by name, case-insensitive.
  POSITION: Match columns by position instead of name.
  FORMAT_DEFAULT: Use the default matching behavior based on file format.

FILE_FORMAT:
  CSV: Default POSITION.
  Parquet/ORC/NDJson: Default CASE_INSENSITIVE.

note nota all mode for all format are supported, we will do them one by one

select, infer_infer_schema add param column_name_to_lowercase=TRUE|FALSE, default false

will remind user to use this option if there are related errors.

youngsofun added the C-feature Category: feature label Nov 21, 2024

youngsofun mentioned this issue Nov 21, 2024

bug: Ambiguous column error when loading different case columns via COPY INTO #16473

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: load parquet/ndjson support case sensitive #16897

Feature: load parquet/ndjson support case sensitive #16897

youngsofun commented Nov 21, 2024 •

edited

Loading

youngsofun commented Nov 21, 2024

youngsofun commented Nov 21, 2024 •

edited

Loading

Feature: load parquet/ndjson support case sensitive #16897

Feature: load parquet/ndjson support case sensitive #16897

Comments

youngsofun commented Nov 21, 2024 • edited Loading

Solution 1

Solution 2：

youngsofun commented Nov 21, 2024

youngsofun commented Nov 21, 2024 • edited Loading

youngsofun commented Nov 21, 2024 •

edited

Loading

youngsofun commented Nov 21, 2024 •

edited

Loading