Expand translation of Snowflake `expr` #351

vil1 · 2024-05-15T14:05:31Z

Some bit of the grammar are still not covered, some seem to be plain wrong (such as subquery being considered a valid expr), other are "duplicated" from predicate.

codecov · 2024-05-15T14:08:26Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.26%. Comparing base (80828ca) to head (a6f7bb4).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #351      +/-   ##
==========================================
+ Coverage   95.17%   95.26%   +0.08%     
==========================================
  Files          50       50              
  Lines        2863     2917      +54     
  Branches      407      412       +5     
==========================================
+ Hits         2725     2779      +54     
  Misses        102      102              
  Partials       36       36

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

sundarshankar89

This is very common use case, wonderful to see this coming along.

jimidle · 2024-05-16T14:05:37Z

The Snowflake grammar is almost but not quite completely broken on it s definition of expressions and predicates. Those rules need to be reworked to be left recursive with precedence, like the TSQL grammar. I will add a new grammar option today that is a full TSQL grammar. However, I hink we can add to that to make it Snowflake compatible and potentially a universal parser.

So let's approve this PR and connect on how far to take the Snowflake grammar.

jimidle

I think we accept this as is and then revisit the whole expression and predicate tree in the Snowflake grammar. It can be done without throwing away the current visitor code.

jimidle · 2024-05-16T14:07:42Z

core/src/main/antlr4/com/databricks/labs/remorph/parsers/snowflake/SnowflakeParser.g4

 : object_name DOT NEXTVAL
 | expr LSB expr RSB //array access
- | expr COLON expr //json access
+ | expr COLON json_path //json access
 | expr DOT (VALUE | expr)
 | expr COLLATE string
 | case_expression


every time we change this rule, it will like break. We need to rework it to be truly left recursive with precedence corrected.

jimidle · 2024-05-16T14:10:37Z

core/src/main/antlr4/com/databricks/labs/remorph/parsers/snowflake/SnowflakeParser.g4

+predicate_partial
+ : IS null_not_null
+ | NOT? IN L_PAREN (subquery | expr_list) R_PAREN
+ | NOT? ( LIKE | ILIKE) expr (ESCAPE expr)?
+ | NOT? RLIKE expr
+ | NOT? (LIKE | ILIKE) ANY L_PAREN expr (COMMA expr)* R_PAREN (ESCAPE expr)?
+ | NOT? BETWEEN expr AND expr
+ ;
+
+json_path
+ : json_path_elem (DOT json_path_elem)*
+ ;
+
+json_path_elem
+ : ID | DOUBLE_QUOTE_ID
+ ;
+


We should not need partial rules to avoid self recursion. The rule should look something like this:

searchCondition : LPAREN searchCondition RPAREN #scPrec | NOT searchCondition #scNot | searchCondition AND searchCondition #scAnd | searchCondition OR searchCondition #scOr | predicate #scPred ; predicate : EXISTS LPAREN subquery RPAREN | freetextPredicate | expression comparisonOperator expression | expression ME expression ////SQL-82 syntax for left outer joins; PE. See https://stackoverflow.com/questions/40665/in-sybase-sql | expression comparisonOperator (ALL | SOME | ANY) LPAREN subquery RPAREN | expression NOT* BETWEEN expression AND expression | expression NOT* IN LPAREN (subquery | expressionList) RPAREN | expression NOT* LIKE expression (ESCAPE expression)? | expression IS nullNotnull ;

Which is what TSQL looks like right now (though it will likely change)

* Capture Reconcile metadata in delta tables for dashbaords ([#369](#369)). In this release, changes have been made to improve version control management, reduce repository size, and enhance build times. A new directory, "spark-warehouse/", has been added to the Git ignore file to prevent unnecessary files from being tracked and included in the project. The `WriteToTableException` class has been added to the `exception.py` file to raise an error when a runtime exception occurs while writing data to a table. A new `ReconCapture` class has been implemented in the `reconcile` package to capture and persist reconciliation metadata in delta tables. The `recon` function has been updated to initialize this new class, passing in the required parameters. Additionally, a new file, `recon_capture.py`, has been added to the reconcile package, which implements the `ReconCapture` class responsible for capturing metadata related to data reconciliation. The `recon_config.py` file has been modified to introduce a new class, `ReconcileProcessDuration`, and restructure the classes `ReconcileOutput`, `MismatchOutput`, and `ThresholdOutput`. The commit also captures reconcile metadata in delta tables for dashboards in the context of unit tests in the `test_execute.py` file and includes a new file, `test_recon_capture.py`, to test the reconcile capture functionality of the `ReconCapture` class. * Expand translation of Snowflake `expr` ([#351](#351)). In this release, the translation of the `expr` category in the Snowflake language has been significantly expanded, addressing uncovered grammar areas, incorrect interpretations, and duplicates. The `subquery` is now excluded as a valid `expr`, and new case classes such as `NextValue`, `ArrayAccess`, `JsonAccess`, `Collate`, and `Iff` have been added to the `Expression` class. These changes improve the comprehensiveness and accuracy of the Snowflake parser, allowing for a more flexible and accurate translation of various operations. Additionally, the `SnowflakeExpressionBuilder` class has been updated to handle previously unsupported cases, enhancing the parser's ability to parse Snowflake SQL expressions. * Fixed orcale missing datatypes ([#333](#333)). In the latest release, the Oracle class of the Tokenizer in the open-source library has undergone a fix to address missing datatypes. Previously, the KEYWORDS mapping did not require Tokens for keys, which led to unsupported Oracle datatypes. This issue has been resolved by modifying the test_schema_compare.py file to ensure that all Oracle datatypes, including LONG, NCLOB, ROWID, UROWID, ANYTYPE, ANYDATA, ANYDATASET, XMLTYPE, SDO_GEOMETRY, SDO_TOPO_GEOMETRY, and SDO_GEORASTER, are now mapped to the TEXT TokenType. This improvement enhances the compatibility of the code with Oracle datatypes and increases the reliability of the schema comparison functionality, as demonstrated by the test function test_schema_compare, which now returns is_valid as True and a count of 0 for is_valid = `false` in the resulting dataframe. * Fixed the recon_config functions to handle null values ([#399](#399)). In this release, the recon_config functions have been enhanced to manage null values and provide more flexible column mapping for reconciliation purposes. A `__post_init__` method has been added to certain classes to convert specified attributes to lowercase and handle null values. A new helper method, `_get_is_string`, has been introduced to determine if a column is of string type. Additionally, new functions such as `get_tgt_to_src_col_mapping_list`, `get_layer_tgt_to_src_col_mapping`, `get_src_to_tgt_col_mapping_list`, and `get_layer_src_to_tgt_col_mapping` have been added to retrieve column mappings, enhancing the overall functionality and robustness of the reconciliation process. These improvements will benefit software engineers by ensuring more accurate and reliable configuration handling, as well as providing more flexibility in mapping source and target columns during reconciliation. * Improve Exception handling ([#392](#392)). The commit titled `Improve Exception Handling` enhances error handling in the project, addressing issues [#388](#388) and [#392](#392). Changes include refactoring the `create_adapter` method in the `DataSourceAdapter` class, updating method arguments in test functions, and adding new methods in the `test_execute.py` file for better test doubles. The `DataSourceAdapter` class is replaced with the `create_adapter` function, which takes the same arguments and returns an instance of the appropriate `DataSource` subclass based on the provided `engine` parameter. The diff also modifies the behavior of certain test methods to raise more specific and accurate exceptions. Overall, these changes improve exception handling, streamline the codebase, and provide clearer error messages for software engineers. * Introduced morph_sql and morph_column_expr functions for inline transpilation and validation ([#328](#328)). Two new classes, TranspilationResult and ValidationResult, have been added to the config module of the remorph package to store the results of transpilation and validation. The morph_sql and morph_column_exp functions have been introduced to support inline transpilation and validation of SQL code and column expressions. A new class, Validator, has been added to the validation module to handle validation, and the validate_format_result method within this class has been updated to return a ValidationResult object. The _query method has also been added to the class, which executes a given SQL query and returns a tuple containing a boolean indicating success, any exception message, and the result of the query. Unit tests for these new functions have been updated to ensure proper functionality. * Output for the reconcile function ([#389](#389)). A new function `get_key_form_dialect` has been added to the `config.py` module, which takes a `Dialect` object and returns the corresponding key used in the `SQLGLOT_DIALECTS` dictionary. Additionally, the `MorphConfig` dataclass has been updated to include a new attribute `__file__`, which sets the filename to "config.yml". The `get_dialect` function remains unchanged. Two new exceptions, `WriteToTableException` and `InvalidInputException`, have been introduced, and the existing `DataSourceRuntimeException` has been modified in the same module to improve error handling. The `execute.py` file's reconcile function has undergone several changes, including adding imports for `InvalidInputException`, `ReconCapture`, and `generate_final_reconcile_output` from `recon_exception` and `recon_capture` modules, and modifying the `ReconcileOutput` type. The `hash_query.py` file's reconcile function has been updated to include a new `_get_with_clause` method, which returns a `Select` object for a given DataFrame, and the `build_query` method has been updated to include a new query construction step using the `with_clause` object. The `threshold_query.py` file's reconcile function's output has been updated to include query and logger statements, a new method for allowing user transformations on threshold aliases, and the dialect specified in the sql method. A new `generate_final_reconcile_output` function has been added to the `recon_capture.py` file, which generates a reconcile output given a recon_id and a SparkSession. New classes and dataclasses, including `SchemaReconcileOutput`, `ReconcileProcessDuration`, `StatusOutput`, `ReconcileTableOutput`, and `ReconcileOutput`, have been introduced in the `reconcile/recon_config.py` file. The `tests/unit/reconcile/test_execute.py` file has been updated to include new test cases for the `recon` function, including tests for different report types and scenarios, such as data, schema, and all report types, exceptions, and incorrect report types. A new test case, `test_initialise_data_source`, has been added to test the `initialise_data_source` function, and the `test_recon_for_wrong_report_type` test case has been updated to expect an `InvalidInputException` when an incorrect report type is passed to the `recon` function. The `test_reconcile_data_with_threshold_and_row_report_type` test case has been added to test the `reconcile_data` method of the `Reconciliation` class with a row report type and threshold options. Overall, these changes improve the functionality and robustness of the reconcile process by providing more fine-grained control over the generation of the final reconcile output and better handling of exceptions and errors. * Threshold Source and Target query builder ([#348](#348)). In this release, we've introduced a new method, `build_threshold_query`, that constructs a customizable threshold query based on a table's partition, join, and threshold columns configuration. The method identifies necessary columns, applies specified transformations, and includes a WHERE clause based on the filter defined in the table configuration. The resulting query is then converted to a SQL string using the dialect of the source database. Additionally, we've updated the test file for the threshold query builder in the reconcile package, including refactoring of function names and updated assertions for query comparison. We've added two new test methods: `test_build_threshold_query_with_single_threshold` and `test_build_threshold_query_with_multiple_thresholds`. These changes enhance the library's functionality, providing a more robust and customizable threshold query builder, and improve test coverage for various configurations and scenarios. * Unpack nested alias ([#336](#336)). This release introduces a significant update to the 'lca_utils.py' file, addressing the limitation of not handling nested aliases in window expressions and where clauses, which resolves issue [#334](#334). The `unalias_lca_in_select` method has been implemented to recursively parse nested selects and unalias lateral column aliases, thereby identifying and handling unsupported lateral column aliases. This method is utilized in the `check_for_unsupported_lca` method to handle unsupported lateral column aliases in the input SQL string. Furthermore, the 'test_lca_utils.py' file has undergone changes, impacting several test functions and introducing two new ones, `test_fix_nested_lca` and 'test_fix_nested_lca_with_no_scope', to ensure the code's reliability and accuracy by preventing unnecessary assumptions and hallucinations. These updates demonstrate our commitment to improving the library's functionality and test coverage.

vil1 requested a review from a team as a code owner May 15, 2024 14:05

vil1 requested a review from himanishk May 15, 2024 14:05

sundarshankar89 approved these changes May 15, 2024

View reviewed changes

vil1 added 2 commits May 16, 2024 12:09

Expand translation of Snowflake expr

b8d9f79

Translate predicate-like exprs without code duplication

a6f7bb4

vil1 force-pushed the feature/snow-expr branch from 7676275 to a6f7bb4 Compare May 16, 2024 10:10

vil1 requested a review from jimidle May 16, 2024 10:22

jimidle approved these changes May 16, 2024

View reviewed changes

sundarshankar89 added this pull request to the merge queue May 17, 2024

Merged via the queue into main with commit 2ec0f04 May 17, 2024
8 checks passed

sundarshankar89 deleted the feature/snow-expr branch May 17, 2024 15:32

nfx mentioned this pull request May 29, 2024

Release v0.2.0 #404

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expand translation of Snowflake `expr` #351

Expand translation of Snowflake `expr` #351

vil1 commented May 15, 2024

codecov bot commented May 15, 2024 •

edited

sundarshankar89 left a comment

jimidle commented May 16, 2024

jimidle left a comment

jimidle May 16, 2024

jimidle May 16, 2024

Expand translation of Snowflake expr #351

Expand translation of Snowflake expr #351

Conversation

vil1 commented May 15, 2024

codecov bot commented May 15, 2024 • edited

Codecov Report

sundarshankar89 left a comment

Choose a reason for hiding this comment

jimidle commented May 16, 2024

jimidle left a comment

Choose a reason for hiding this comment

jimidle May 16, 2024

Choose a reason for hiding this comment

jimidle May 16, 2024

Choose a reason for hiding this comment

Expand translation of Snowflake `expr` #351

Expand translation of Snowflake `expr` #351

codecov bot commented May 15, 2024 •

edited