Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand translation of Snowflake expr #351

Merged
merged 2 commits into from
May 17, 2024
Merged

Conversation

vil1
Copy link
Contributor

@vil1 vil1 commented May 15, 2024

Some bit of the grammar are still not covered, some seem to be plain wrong (such as subquery being considered a valid expr), other are "duplicated" from predicate.

@vil1 vil1 requested a review from a team as a code owner May 15, 2024 14:05
@vil1 vil1 requested a review from himanishk May 15, 2024 14:05
Copy link

codecov bot commented May 15, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.26%. Comparing base (80828ca) to head (a6f7bb4).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #351      +/-   ##
==========================================
+ Coverage   95.17%   95.26%   +0.08%     
==========================================
  Files          50       50              
  Lines        2863     2917      +54     
  Branches      407      412       +5     
==========================================
+ Hits         2725     2779      +54     
  Misses        102      102              
  Partials       36       36              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@sundarshankar89 sundarshankar89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very common use case, wonderful to see this coming along.

@jimidle
Copy link
Contributor

jimidle commented May 16, 2024

The Snowflake grammar is almost but not quite completely broken on it s definition of expressions and predicates. Those rules need to be reworked to be left recursive with precedence, like the TSQL grammar. I will add a new grammar option today that is a full TSQL grammar. However, I hink we can add to that to make it Snowflake compatible and potentially a universal parser.

So let's approve this PR and connect on how far to take the Snowflake grammar.

Copy link
Contributor

@jimidle jimidle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we accept this as is and then revisit the whole expression and predicate tree in the Snowflake grammar. It can be done without throwing away the current visitor code.

Comment on lines 3682 to 3687
: object_name DOT NEXTVAL
| expr LSB expr RSB //array access
| expr COLON expr //json access
| expr COLON json_path //json access
| expr DOT (VALUE | expr)
| expr COLLATE string
| case_expression
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

every time we change this rule, it will like break. We need to rework it to be truly left recursive with precedence corrected.

Comment on lines +3710 to +3726
predicate_partial
: IS null_not_null
| NOT? IN L_PAREN (subquery | expr_list) R_PAREN
| NOT? ( LIKE | ILIKE) expr (ESCAPE expr)?
| NOT? RLIKE expr
| NOT? (LIKE | ILIKE) ANY L_PAREN expr (COMMA expr)* R_PAREN (ESCAPE expr)?
| NOT? BETWEEN expr AND expr
;

json_path
: json_path_elem (DOT json_path_elem)*
;

json_path_elem
: ID | DOUBLE_QUOTE_ID
;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not need partial rules to avoid self recursion. The rule should look something like this:

searchCondition
    : LPAREN searchCondition RPAREN         #scPrec
    | NOT searchCondition                   #scNot
    | searchCondition AND searchCondition   #scAnd
    | searchCondition OR searchCondition    #scOr
    | predicate                             #scPred
    ;

predicate
    : EXISTS LPAREN subquery RPAREN
    | freetextPredicate
    | expression comparisonOperator expression
    | expression ME expression ////SQL-82 syntax for left outer joins; PE. See https://stackoverflow.com/questions/40665/in-sybase-sql
    | expression comparisonOperator (ALL | SOME | ANY) LPAREN subquery RPAREN
    | expression NOT* BETWEEN expression AND expression
    | expression NOT* IN LPAREN (subquery | expressionList) RPAREN
    | expression NOT* LIKE expression (ESCAPE expression)?
    | expression IS nullNotnull
    ;

Which is what TSQL looks like right now (though it will likely change)

@sundarshankar89 sundarshankar89 added this pull request to the merge queue May 17, 2024
Merged via the queue into main with commit 2ec0f04 May 17, 2024
8 checks passed
@sundarshankar89 sundarshankar89 deleted the feature/snow-expr branch May 17, 2024 15:32
nfx added a commit that referenced this pull request May 29, 2024
* Capture Reconcile metadata in delta tables for dashbaords ([#369](#369)). In this release, changes have been made to improve version control management, reduce repository size, and enhance build times. A new directory, "spark-warehouse/", has been added to the Git ignore file to prevent unnecessary files from being tracked and included in the project. The `WriteToTableException` class has been added to the `exception.py` file to raise an error when a runtime exception occurs while writing data to a table. A new `ReconCapture` class has been implemented in the `reconcile` package to capture and persist reconciliation metadata in delta tables. The `recon` function has been updated to initialize this new class, passing in the required parameters. Additionally, a new file, `recon_capture.py`, has been added to the reconcile package, which implements the `ReconCapture` class responsible for capturing metadata related to data reconciliation. The `recon_config.py` file has been modified to introduce a new class, `ReconcileProcessDuration`, and restructure the classes `ReconcileOutput`, `MismatchOutput`, and `ThresholdOutput`. The commit also captures reconcile metadata in delta tables for dashboards in the context of unit tests in the `test_execute.py` file and includes a new file, `test_recon_capture.py`, to test the reconcile capture functionality of the `ReconCapture` class.
* Expand translation of Snowflake `expr` ([#351](#351)). In this release, the translation of the `expr` category in the Snowflake language has been significantly expanded, addressing uncovered grammar areas, incorrect interpretations, and duplicates. The `subquery` is now excluded as a valid `expr`, and new case classes such as `NextValue`, `ArrayAccess`, `JsonAccess`, `Collate`, and `Iff` have been added to the `Expression` class. These changes improve the comprehensiveness and accuracy of the Snowflake parser, allowing for a more flexible and accurate translation of various operations. Additionally, the `SnowflakeExpressionBuilder` class has been updated to handle previously unsupported cases, enhancing the parser's ability to parse Snowflake SQL expressions.
* Fixed orcale missing datatypes ([#333](#333)). In the latest release, the Oracle class of the Tokenizer in the open-source library has undergone a fix to address missing datatypes. Previously, the KEYWORDS mapping did not require Tokens for keys, which led to unsupported Oracle datatypes. This issue has been resolved by modifying the test_schema_compare.py file to ensure that all Oracle datatypes, including LONG, NCLOB, ROWID, UROWID, ANYTYPE, ANYDATA, ANYDATASET, XMLTYPE, SDO_GEOMETRY, SDO_TOPO_GEOMETRY, and SDO_GEORASTER, are now mapped to the TEXT TokenType. This improvement enhances the compatibility of the code with Oracle datatypes and increases the reliability of the schema comparison functionality, as demonstrated by the test function test_schema_compare, which now returns is_valid as True and a count of 0 for is_valid = `false` in the resulting dataframe.
* Fixed the recon_config functions to handle null values ([#399](#399)). In this release, the recon_config functions have been enhanced to manage null values and provide more flexible column mapping for reconciliation purposes. A `__post_init__` method has been added to certain classes to convert specified attributes to lowercase and handle null values. A new helper method, `_get_is_string`, has been introduced to determine if a column is of string type. Additionally, new functions such as `get_tgt_to_src_col_mapping_list`, `get_layer_tgt_to_src_col_mapping`, `get_src_to_tgt_col_mapping_list`, and `get_layer_src_to_tgt_col_mapping` have been added to retrieve column mappings, enhancing the overall functionality and robustness of the reconciliation process. These improvements will benefit software engineers by ensuring more accurate and reliable configuration handling, as well as providing more flexibility in mapping source and target columns during reconciliation.
* Improve Exception handling ([#392](#392)). The commit titled `Improve Exception Handling` enhances error handling in the project, addressing issues [#388](#388) and [#392](#392). Changes include refactoring the `create_adapter` method in the `DataSourceAdapter` class, updating method arguments in test functions, and adding new methods in the `test_execute.py` file for better test doubles. The `DataSourceAdapter` class is replaced with the `create_adapter` function, which takes the same arguments and returns an instance of the appropriate `DataSource` subclass based on the provided `engine` parameter. The diff also modifies the behavior of certain test methods to raise more specific and accurate exceptions. Overall, these changes improve exception handling, streamline the codebase, and provide clearer error messages for software engineers.
* Introduced morph_sql and morph_column_expr functions for inline transpilation and validation ([#328](#328)). Two new classes, TranspilationResult and ValidationResult, have been added to the config module of the remorph package to store the results of transpilation and validation. The morph_sql and morph_column_exp functions have been introduced to support inline transpilation and validation of SQL code and column expressions. A new class, Validator, has been added to the validation module to handle validation, and the validate_format_result method within this class has been updated to return a ValidationResult object. The _query method has also been added to the class, which executes a given SQL query and returns a tuple containing a boolean indicating success, any exception message, and the result of the query. Unit tests for these new functions have been updated to ensure proper functionality.
* Output for the reconcile function ([#389](#389)). A new function `get_key_form_dialect` has been added to the `config.py` module, which takes a `Dialect` object and returns the corresponding key used in the `SQLGLOT_DIALECTS` dictionary. Additionally, the `MorphConfig` dataclass has been updated to include a new attribute `__file__`, which sets the filename to "config.yml". The `get_dialect` function remains unchanged. Two new exceptions, `WriteToTableException` and `InvalidInputException`, have been introduced, and the existing `DataSourceRuntimeException` has been modified in the same module to improve error handling. The `execute.py` file's reconcile function has undergone several changes, including adding imports for `InvalidInputException`, `ReconCapture`, and `generate_final_reconcile_output` from `recon_exception` and `recon_capture` modules, and modifying the `ReconcileOutput` type. The `hash_query.py` file's reconcile function has been updated to include a new `_get_with_clause` method, which returns a `Select` object for a given DataFrame, and the `build_query` method has been updated to include a new query construction step using the `with_clause` object. The `threshold_query.py` file's reconcile function's output has been updated to include query and logger statements, a new method for allowing user transformations on threshold aliases, and the dialect specified in the sql method. A new `generate_final_reconcile_output` function has been added to the `recon_capture.py` file, which generates a reconcile output given a recon_id and a SparkSession. New classes and dataclasses, including `SchemaReconcileOutput`, `ReconcileProcessDuration`, `StatusOutput`, `ReconcileTableOutput`, and `ReconcileOutput`, have been introduced in the `reconcile/recon_config.py` file. The `tests/unit/reconcile/test_execute.py` file has been updated to include new test cases for the `recon` function, including tests for different report types and scenarios, such as data, schema, and all report types, exceptions, and incorrect report types. A new test case, `test_initialise_data_source`, has been added to test the `initialise_data_source` function, and the `test_recon_for_wrong_report_type` test case has been updated to expect an `InvalidInputException` when an incorrect report type is passed to the `recon` function. The `test_reconcile_data_with_threshold_and_row_report_type` test case has been added to test the `reconcile_data` method of the `Reconciliation` class with a row report type and threshold options. Overall, these changes improve the functionality and robustness of the reconcile process by providing more fine-grained control over the generation of the final reconcile output and better handling of exceptions and errors.
* Threshold Source and Target query builder ([#348](#348)). In this release, we've introduced a new method, `build_threshold_query`, that constructs a customizable threshold query based on a table's partition, join, and threshold columns configuration. The method identifies necessary columns, applies specified transformations, and includes a WHERE clause based on the filter defined in the table configuration. The resulting query is then converted to a SQL string using the dialect of the source database. Additionally, we've updated the test file for the threshold query builder in the reconcile package, including refactoring of function names and updated assertions for query comparison. We've added two new test methods: `test_build_threshold_query_with_single_threshold` and `test_build_threshold_query_with_multiple_thresholds`. These changes enhance the library's functionality, providing a more robust and customizable threshold query builder, and improve test coverage for various configurations and scenarios.
* Unpack nested alias ([#336](#336)). This release introduces a significant update to the 'lca_utils.py' file, addressing the limitation of not handling nested aliases in window expressions and where clauses, which resolves issue [#334](#334). The `unalias_lca_in_select` method has been implemented to recursively parse nested selects and unalias lateral column aliases, thereby identifying and handling unsupported lateral column aliases. This method is utilized in the `check_for_unsupported_lca` method to handle unsupported lateral column aliases in the input SQL string. Furthermore, the 'test_lca_utils.py' file has undergone changes, impacting several test functions and introducing two new ones, `test_fix_nested_lca` and 'test_fix_nested_lca_with_no_scope', to ensure the code's reliability and accuracy by preventing unnecessary assumptions and hallucinations. These updates demonstrate our commitment to improving the library's functionality and test coverage.
@nfx nfx mentioned this pull request May 29, 2024
nfx added a commit that referenced this pull request May 29, 2024
* Capture Reconcile metadata in delta tables for dashbaords
([#369](#369)). In this
release, changes have been made to improve version control management,
reduce repository size, and enhance build times. A new directory,
"spark-warehouse/", has been added to the Git ignore file to prevent
unnecessary files from being tracked and included in the project. The
`WriteToTableException` class has been added to the `exception.py` file
to raise an error when a runtime exception occurs while writing data to
a table. A new `ReconCapture` class has been implemented in the
`reconcile` package to capture and persist reconciliation metadata in
delta tables. The `recon` function has been updated to initialize this
new class, passing in the required parameters. Additionally, a new file,
`recon_capture.py`, has been added to the reconcile package, which
implements the `ReconCapture` class responsible for capturing metadata
related to data reconciliation. The `recon_config.py` file has been
modified to introduce a new class, `ReconcileProcessDuration`, and
restructure the classes `ReconcileOutput`, `MismatchOutput`, and
`ThresholdOutput`. The commit also captures reconcile metadata in delta
tables for dashboards in the context of unit tests in the
`test_execute.py` file and includes a new file, `test_recon_capture.py`,
to test the reconcile capture functionality of the `ReconCapture` class.
* Expand translation of Snowflake `expr`
([#351](#351)). In this
release, the translation of the `expr` category in the Snowflake
language has been significantly expanded, addressing uncovered grammar
areas, incorrect interpretations, and duplicates. The `subquery` is now
excluded as a valid `expr`, and new case classes such as `NextValue`,
`ArrayAccess`, `JsonAccess`, `Collate`, and `Iff` have been added to the
`Expression` class. These changes improve the comprehensiveness and
accuracy of the Snowflake parser, allowing for a more flexible and
accurate translation of various operations. Additionally, the
`SnowflakeExpressionBuilder` class has been updated to handle previously
unsupported cases, enhancing the parser's ability to parse Snowflake SQL
expressions.
* Fixed orcale missing datatypes
([#333](#333)). In the
latest release, the Oracle class of the Tokenizer in the open-source
library has undergone a fix to address missing datatypes. Previously,
the KEYWORDS mapping did not require Tokens for keys, which led to
unsupported Oracle datatypes. This issue has been resolved by modifying
the test_schema_compare.py file to ensure that all Oracle datatypes,
including LONG, NCLOB, ROWID, UROWID, ANYTYPE, ANYDATA, ANYDATASET,
XMLTYPE, SDO_GEOMETRY, SDO_TOPO_GEOMETRY, and SDO_GEORASTER, are now
mapped to the TEXT TokenType. This improvement enhances the
compatibility of the code with Oracle datatypes and increases the
reliability of the schema comparison functionality, as demonstrated by
the test function test_schema_compare, which now returns is_valid as
True and a count of 0 for is_valid = `false` in the resulting dataframe.
* Fixed the recon_config functions to handle null values
([#399](#399)). In this
release, the recon_config functions have been enhanced to manage null
values and provide more flexible column mapping for reconciliation
purposes. A `__post_init__` method has been added to certain classes to
convert specified attributes to lowercase and handle null values. A new
helper method, `_get_is_string`, has been introduced to determine if a
column is of string type. Additionally, new functions such as
`get_tgt_to_src_col_mapping_list`, `get_layer_tgt_to_src_col_mapping`,
`get_src_to_tgt_col_mapping_list`, and
`get_layer_src_to_tgt_col_mapping` have been added to retrieve column
mappings, enhancing the overall functionality and robustness of the
reconciliation process. These improvements will benefit software
engineers by ensuring more accurate and reliable configuration handling,
as well as providing more flexibility in mapping source and target
columns during reconciliation.
* Improve Exception handling
([#392](#392)). The
commit titled `Improve Exception Handling` enhances error handling in
the project, addressing issues
[#388](#388) and
[#392](#392). Changes
include refactoring the `create_adapter` method in the
`DataSourceAdapter` class, updating method arguments in test functions,
and adding new methods in the `test_execute.py` file for better test
doubles. The `DataSourceAdapter` class is replaced with the
`create_adapter` function, which takes the same arguments and returns an
instance of the appropriate `DataSource` subclass based on the provided
`engine` parameter. The diff also modifies the behavior of certain test
methods to raise more specific and accurate exceptions. Overall, these
changes improve exception handling, streamline the codebase, and provide
clearer error messages for software engineers.
* Introduced morph_sql and morph_column_expr functions for inline
transpilation and validation
([#328](#328)). Two new
classes, TranspilationResult and ValidationResult, have been added to
the config module of the remorph package to store the results of
transpilation and validation. The morph_sql and morph_column_exp
functions have been introduced to support inline transpilation and
validation of SQL code and column expressions. A new class, Validator,
has been added to the validation module to handle validation, and the
validate_format_result method within this class has been updated to
return a ValidationResult object. The _query method has also been added
to the class, which executes a given SQL query and returns a tuple
containing a boolean indicating success, any exception message, and the
result of the query. Unit tests for these new functions have been
updated to ensure proper functionality.
* Output for the reconcile function
([#389](#389)). A new
function `get_key_form_dialect` has been added to the `config.py`
module, which takes a `Dialect` object and returns the corresponding key
used in the `SQLGLOT_DIALECTS` dictionary. Additionally, the
`MorphConfig` dataclass has been updated to include a new attribute
`__file__`, which sets the filename to "config.yml". The `get_dialect`
function remains unchanged. Two new exceptions, `WriteToTableException`
and `InvalidInputException`, have been introduced, and the existing
`DataSourceRuntimeException` has been modified in the same module to
improve error handling. The `execute.py` file's reconcile function has
undergone several changes, including adding imports for
`InvalidInputException`, `ReconCapture`, and
`generate_final_reconcile_output` from `recon_exception` and
`recon_capture` modules, and modifying the `ReconcileOutput` type. The
`hash_query.py` file's reconcile function has been updated to include a
new `_get_with_clause` method, which returns a `Select` object for a
given DataFrame, and the `build_query` method has been updated to
include a new query construction step using the `with_clause` object.
The `threshold_query.py` file's reconcile function's output has been
updated to include query and logger statements, a new method for
allowing user transformations on threshold aliases, and the dialect
specified in the sql method. A new `generate_final_reconcile_output`
function has been added to the `recon_capture.py` file, which generates
a reconcile output given a recon_id and a SparkSession. New classes and
dataclasses, including `SchemaReconcileOutput`,
`ReconcileProcessDuration`, `StatusOutput`, `ReconcileTableOutput`, and
`ReconcileOutput`, have been introduced in the
`reconcile/recon_config.py` file. The
`tests/unit/reconcile/test_execute.py` file has been updated to include
new test cases for the `recon` function, including tests for different
report types and scenarios, such as data, schema, and all report types,
exceptions, and incorrect report types. A new test case,
`test_initialise_data_source`, has been added to test the
`initialise_data_source` function, and the
`test_recon_for_wrong_report_type` test case has been updated to expect
an `InvalidInputException` when an incorrect report type is passed to
the `recon` function. The
`test_reconcile_data_with_threshold_and_row_report_type` test case has
been added to test the `reconcile_data` method of the `Reconciliation`
class with a row report type and threshold options. Overall, these
changes improve the functionality and robustness of the reconcile
process by providing more fine-grained control over the generation of
the final reconcile output and better handling of exceptions and errors.
* Threshold Source and Target query builder
([#348](#348)). In this
release, we've introduced a new method, `build_threshold_query`, that
constructs a customizable threshold query based on a table's partition,
join, and threshold columns configuration. The method identifies
necessary columns, applies specified transformations, and includes a
WHERE clause based on the filter defined in the table configuration. The
resulting query is then converted to a SQL string using the dialect of
the source database. Additionally, we've updated the test file for the
threshold query builder in the reconcile package, including refactoring
of function names and updated assertions for query comparison. We've
added two new test methods:
`test_build_threshold_query_with_single_threshold` and
`test_build_threshold_query_with_multiple_thresholds`. These changes
enhance the library's functionality, providing a more robust and
customizable threshold query builder, and improve test coverage for
various configurations and scenarios.
* Unpack nested alias
([#336](#336)). This
release introduces a significant update to the 'lca_utils.py' file,
addressing the limitation of not handling nested aliases in window
expressions and where clauses, which resolves issue
[#334](#334). The
`unalias_lca_in_select` method has been implemented to recursively parse
nested selects and unalias lateral column aliases, thereby identifying
and handling unsupported lateral column aliases. This method is utilized
in the `check_for_unsupported_lca` method to handle unsupported lateral
column aliases in the input SQL string. Furthermore, the
'test_lca_utils.py' file has undergone changes, impacting several test
functions and introducing two new ones, `test_fix_nested_lca` and
'test_fix_nested_lca_with_no_scope', to ensure the code's reliability
and accuracy by preventing unnecessary assumptions and hallucinations.
These updates demonstrate our commitment to improving the library's
functionality and test coverage.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants