Skip to content

Latest commit

 

History

History
341 lines (330 loc) · 35.5 KB

CHANGELOG.md

File metadata and controls

341 lines (330 loc) · 35.5 KB

Change log

Generated on 2024-12-16

Release 24.12

Features

#11630 [FEA] enable from_json and json scan by default
#11709 [FEA] Add support for MonthsBetween
#11666 [FEA] support task limit profiling for specified stages
#11662 [FEA] Support Apache Spark 3.4.4
#11657 [FEA] Support format 'yyyyMMdd HH:mm:ss' for legacy mode
#11419 [FEA] Support Spark 3.5.3 release
#11505 [FEA] Support yyyymmdd format for GetTimestamp for LEGACY mode.

Performance

#8391 [FEA] Do a hash based re-partition instead of a sort based fallback for hash aggregate
#11560 [FEA] Improve GpuJsonToStructs performance
#11458 [FEA] enable prune_columns for from_json

Bugs Fixed

#10907 from_json function parses a column containing an empty array, throws an exception.
#11793 [BUG] "Time in Heuristic" should not include previous operator's compute time
#11798 [BUG] mismatch CPU and GPU result in test_months_between_first_day[DATAGEN_SEED=1733006411, TZ=Africa/Casablanca]
#11790 [BUG] test_hash_* failed "java.util.NoSuchElementException: head of empty list" or "Too many times of repartition, may hit a bug?"
#11643 [BUG] Support AQE with Broadcast Hash Join and DPP on Databricks 14.3
#10910 from_json, when input = empty object, rapids throws an exception.
#10891 Parsing a column containing invalid json into StructureType with schema throws an Exception.
#11741 [BUG] Fix spark400 build due to writeWithV1 return value change
#11533 Fix JSON Matrix tests on Databricks 14.3
#11722 [BUG] Spark 4.0.0 has moved NullIntolerant and builds are breaking because they are unable to find it.
#11726 [BUG] Databricks 14.3 nightly deploy fails due to incorrect DB_SHIM_NAME
#11293 [BUG] A user query with from_json failed with "JSON Parser encountered an invalid format at location"
#9592 [BUG][JSON] from_json to Map type should produce null for invalid entries
#11715 [BUG] parquet_testing_test.py failed on "AssertionError: GPU and CPU boolean values are different"
#11716 [BUG] delta_lake_write_test.py failed on "AssertionError: GPU and CPU boolean values are different"
#11684 [BUG] 24.12 Precommit fails with wrong number of arguments in GpuDataSource
#11168 [BUG] reserve allocation should be displayed when erroring due to lack of memory on startup
#7585 [BUG] [Regexp] Line anchor '$' incorrect matching of unicode line terminators
#11622 [BUG] GPU Parquet scan filter pushdown fails with timestamp/INT96 column
#11646 [BUG] NullPointerException in GpuRand
#10498 [BUG] Unit tests failed: [INTERVAL_ARITHMETIC_OVERFLOW] integer overflow. Use 'try_add' to tolerate overflow and return NULL instead
#11659 [BUG] parse_url throws exception if partToExtract is invalid while Spark returns null
#10894 Parsing a column containing a nested structure to json thows an exception
#10895 Converting a column containing a map into json throws an exception
#10896 Converting an column containing an array into json throws an exception
#10915 to_json when converts an array will throw an exception:
#10916 to_json function doesn't support map[string, struct] to json conversion.
#10919 to_json converting map[string, integer] to json, throws an exception
#10920 to_json converting an array with maps throws an exception.
#10921 to_json - array with single map
#10923 [BUG] Spark UT framework: to_json function to convert the array with a single empty row to a JSON string throws an exception.
#10924 [BUG] Spark UT framework: to_json when converts an empty array into json throws an exception.
#11024 Fix tests failures in parquet_write_test.py
#11174 Opcode Suite fails for Scala 2.13.8+
#10483 [BUG] JsonToStructs fails to parse all empty dicts and invalid lines
#10489 [BUG] from_json does not support input with \n in it.
#10347 [BUG] Failures in Integration Tests on Dataproc Serverless
#11021 Fix tests failures in orc_cast_test.py
#11609 [BUG] test_hash_repartition_long_overflow_ansi_exception failed on 341DB
#11600 [BUG] regex_test failed mismatched cpu and gpu values in UT and IT
#11611 [BUG] Spark 4.0 build failure - value cannotSaveIntervalIntoExternalStorageError is not a member of object org.apache.spark.sql.errors.QueryCompilationErrors
#10922 from_json cannot support line separator in the input string.
#11009 Fix tests failures in cast_test.py
#11572 [BUG] MultiFileReaderThreadPool may flood the console with log messages

PRs

#11874 Remove 350db143 shim's build [skip ci]
#11851 Update latest changelog [skip ci]
#11849 Update rapids JNI and private dependency to 24.12.0
#11841 [DOC] update doc for 24.12 release [skip ci]
#11857 Increase the pre-merge CI timeout to 6 hours
#11845 Fix leak in isTimeStamp
#11823 Fix for LEAD/LAG window function test failures.
#11832 Fix leak in GpuBroadcastNestedLoopJoinExecBase
#11763 Orc writes don't fully support Booleans with nulls
#11794 exclude previous operator's time out of firstBatchHeuristic
#11802 Fall back to CPU for non-UTC months_between
#11792 [BUG] Fix issue 11790
#11768 Fix dpp_test.py failures on 14.3
#11752 Ability to decompress snappy and zstd Parquet files via CPU
#11777 Append knoguchi22 to blossom-ci whitelist [skip ci]
#11712 repartition-based fallback for hash aggregate v3
#11771 Fix query hang when using rapids multithread shuffle manager with kudo
#11759 Avoid using StringBuffer in single-threaded methods.
#11766 Fix Kudo batch serializer to only read header in hasNext
#11730 Add support for asynchronous writing for parquet
#11750 Fix aqe_test failures on 14.3.
#11753 Enable JSON Scan and from_json by default
#11733 Print out the current attempt object when OOM inside a retry block
#11618 Execute from_json with struct schema using JSONUtils.fromJSONToStructs
#11725 host watermark metric
#11746 Remove batch size bytes limits
#11723 Add NVIDIA Copyright
#11721 Add a few more JSON tests for MAP<STRING,STRING>
#11744 Do not package the Databricks 14.3 shim into the dist jar [skip ci]
#11724 Integrate with kudo
#11739 Update to Spark 4.0 changing signature of SupportsV1Write.writeWithV1
#11737 Add in support for months_between
#11700 Fix leak with RapidsHostColumnBuilder in GpuUserDefinedFunction
#11727 Widen type promotion for decimals with larger scale in Parquet Read
#11719 Skip from_json overflow tests for 14.3
#11708 Support profiling for specific stages on a limited number of tasks
#11731 Add NullIntolerantShim to adapt to Spark 4.0 removing NullIntolerant
#11413 Support multi string contains
#11728 Change Databricks 14.3 shim name to spark350db143 [skip ci]
#11702 Improve JSON scan and from_json
#11635 Added Shims for adding Databricks 14.3 Support
#11714 Let AWS Databricks automatically choose an Availability Zone
#11703 Simplify $ transpiling and fix newline character bug
#11707 impalaFile cannot be found by UT framework.
#11697 Make delta-lake shim dependencies parametrizable
#11710 Add shim version 344 to LogicalPlanShims.scala
#11706 Add retry support in sub hash join
#11673 Fix Parquet Writer tests on 14.3
#11669 Fix string_test for 14.3
#11692 Add Spark 3.4.4 Shim
#11695 Fix spark400 build due to LogicalRelation signature changes
#11689 Update the Maven repository to download Spark JAR files [skip ci]
#11670 Fix misc_expr_test for 14.3
#11652 Fix skipping fixed_length_char ORC tests on > 13.3
#11644 Skip AQE-join-DPP tests for 14.3
#11667 Preparation for the coming Kudo support
#11685 Exclude shimplify-generated files from scalastyle
#11282 Reserve allocation should be displayed when erroring due to lack of memory on startup
#11671 Use the new host memory allocation API
#11682 Fix auto merge conflict 11679 [skip ci]
#11663 Simplify Transpilation of $ with Extended Line Separator Support in cuDF Regex
#11672 Fix race condition with Parquet filter pushdown modifying shared hadoop Configuration
#11596 Add a new NVTX range for task GPU ownership
#11664 Fix orc_write_test.py for 14.3
#11656 [DOC] update the supported OS in download page [skip ci]
#11665 Generate classes identical up to the shim package name
#11647 Fix a NPE issue in GpuRand
#11658 Support format 'yyyyMMdd HH:mm:ss' for legacy mode
#11661 Support invalid partToExtract for parse_url
#11520 UT adjust override checkScanSchemata & enabling ut of exclude_by_suffix fea.
#11634 Put DF_UDF plugin code into the main uber jar.
#11522 UT adjust test SPARK-26677: negated null-safe equality comparison
#11521 Datetime rebasing issue fixed
#11642 Update to_json to be more generic and fix some bugs
#11615 Spark 4 parquet_writer_test.py fixes
#11623 Fix collection_ops_test for 14.3
#11553 Fix udf-compiler scala2.13 internal return statements
#11640 Disable date/timestamp types by default when parsing JSON
#11570 Add support for Spark 3.5.3
#11591 Spark UT framework: Read Parquet file generated by parquet-thrift Rapids, UT case adjust.
#11631 Update JSON tests based on a closed/fixed issues
#11617 Quick fix for the build script failure of Scala 2.13 jars [skip ci]
#11614 Ensure repartition overflow test always overflows
#11612 Revert "Disable regex tests to unblock CI (#11606)"
#11597 install_deps changes for Databricks 14.3
#11608 Use mvn -f scala2.13/ in the build scripts to build the 2.13 jars
#11610 Change DataSource calendar interval error to fix spark400 build
#11549 Adopt JSONUtils.concatenateJsonStrings for concatenating JSON strings
#11595 Remove an unused config shuffle.spillThreads
#11606 Disable regex tests to unblock CI
#11605 Fix auto merge conflict 11604 [skip ci]
#11587 avoid long tail tasks due to PrioritySemaphore, remaing part
#11574 avoid long tail tasks due to PrioritySemaphore
#11559 [Spark 4.0] Address test failures in cast_test.py
#11579 Fix merge conflict with branch-24.10
#11571 Log reconfigure multi-file thread pool only once
#11564 Disk spill metric
#11561 Add in a basic plugin for dataframe UDF support in Apache Spark
#11563 Fix the latest merge conflict in integration tests
#11542 Update rapids JNI and private dependency to 24.12.0-SNAPSHOT [skip ci]
#11493 Support legacy mode for yyyymmdd format

Release 24.10

Features

#11525 [FEA] If dump always is enabled dump before decoding the file
#11461 [FEA] Support non-UTC timezone for casting from date to timestamp
#11445 [FEA] Support format 'yyyyMMdd' in GetTimestamp operator
#11442 [FEA] Add in support for setting row group sizes for parquet
#11330 [FEA] Add companion metrics for all nsTiming metrics to measure time elapsed excluding semaphore wait
#5223 [FEA] Support array_join
#10968 [FEA] support min_by function
#10437 [FEA] Add Spark 3.5.2 snapshot support

Performance

#10799 [FEA] Optimize count distinct performance optimization with null columns reuse and post expand coalesce
#8301 [FEA] semaphore prioritization
#11234 Explore swapping build table for left outer joins
#11263 [FEA] Cluster/pack multi_get_json_object paths by common prefixes

Bugs Fixed

#11558 [BUG] test_sortmerge_join_ridealong fails on DB 13.3
#11573 [BUG] very long tail task is observed when many tasks are contending for PrioritySemaphore
#11367 [BUG] Error "table_view.cpp:36: Column size mismatch" when using approx_percentile on a string column
#11543 [BUG] test_yyyyMMdd_format_for_legacy_mode[DATAGEN_SEED=1727619674, TZ=UTC] failed GPU and CPU are not both null
#11500 [BUG] dataproc serverless Integration tests failing in json_matrix_test.py
#11384 [BUG] "rs. shuffle write time" negative values seen in app history log
#11509 [BUG] buildall no longer works
#11501 [BUG] test_yyyyMMdd_format_for_legacy_mode failed in Dataproc Serverless integration tests
#11502 [BUG] IT script failed get jars as we stop deploying intermediate jars since 24.10
#11479 [BUG] spark400 build failed do not conform to class UnaryExprMeta's type parameter
#8558 [BUG] from_json generated inconsistent result comparing with CPU for input column with nested json strings
#11485 [BUG] Integration tests failing in join_test.py
#11481 [BUG] non-utc integration tests failing in json_test.py
#10911 from_json: when input is a bad json string, rapids would throw an exception.
#10457 [BUG] ScanJson and JsonToStructs allow unquoted control chars by default
#10479 [BUG] JsonToStructs and ScanJson should return null for non-numeric, non-boolean non-quoted strings
#10534 [BUG] Need Improved JSON Validation
#11436 [BUG] Mortgage unit tests fail with RAPIDS shuffle manager
#11437 [BUG] array and map casts to string tests failed
#11463 [BUG] hash_groupby_approx_percentile failed assert is None
#11465 [BUG] java.lang.NoClassDefFoundError: org/apache/spark/BuildInfo$ in non-databricks environment
#11359 [BUG] a couple of arithmetic_ops_test.py cases failed mismatching cpu and gpu values with [DATAGEN_SEED=1723985531, TZ=UTC, INJECT_OOM]
#11392 [AUDIT] Handle IgnoreNulls Expressions for Window Expressions
#10770 [BUG] Slow/no progress with cascaded pandas udfs/mapInPandas in Databricks
#11397 [BUG] We should not be using copyWithBooleanColumnAsValidity unless we can prove it is 100% safe
#11372 [BUG] spark400 failed compiling datagen_2.13
#11364 [BUG] Missing numRows in the ColumnarBatch created in GpuBringBackToHost
#11350 [BUG] spark400 compile failed in scala213
#11346 [BUG] databrick nightly failing with not able to get spark-version-info.properties
#9604 [BUG] Delta Lake metadata query detection can trigger extra file listing jobs
#11318 [BUG] GPU query is case sensitive on Hive text table's column name
#10596 [BUG] ScanJson and JsonToStructs does not deal with escaped single quotes properly
#10351 [BUG] test_from_json_mixed_types_list_struct failed
#11294 [BUG] binary-dedupe leaves around a copy of "unshimmed" class files in spark-shared
#11183 [BUG] Failed to split an empty string with error "ai.rapids.cudf.CudfException: parallel_for failed: cudaErrorInvalidDevice: invalid device ordinal"
#11008 Fix tests failures in ast_test.py
#11265 [BUG] segfaults seen in cuDF after prefetch calls intermittently
#11025 Fix tests failures in date_time_test.py
#11065 [BUG] Spark Connect Server (3.5.1) Can Not Running Correctly

PRs

#11683 [DOC] update download page for 2410 hot fix release [skip ci]
#11680 Update latest changelog [skip ci]
#11678 Update version to 24.10.1-SNAPSHOT [skip ci]
#11676 Fix race condition with Parquet filter pushdown modifying shared hadoop Configuration
#11626 Update latest changelog [skip ci]
#11624 Update the download link [skip ci]
#11577 Update latest changelog [skip ci]
#11576 Update rapids JNI and private dependency to 24.10.0
#11582 [DOC] update doc for 24.10 release [skip ci]
#11414 Fix collection_ops_tests for Spark 4.0
#11588 backport fixes of #11573 to branch 24.10
#11569 Have "dump always" dump input files before trying to decode them
#11544 Update test case related to LEACY datetime format to unblock nightly CI
#11567 Fix test case unix_timestamp(col, 'yyyyMMdd') failed for Africa/Casablanca timezone and LEGACY mode
#11519 Spark 4: Fix parquet_test.py
#11496 Update test now that code is fixed
#11548 Fix negative rs. shuffle write time
#11545 Update test case related to LEACY datetime format to unblock nightly CI
#11515 Propagate default DIST_PROFILE_OPT profile to Maven in buildall
#11497 Update from_json to use new cudf features
#11516 Deploy all submodules for default sparkver in nightly [skip ci]
#11484 Fix FileAlreadyExistsException in LORE dump process
#11457 GPU device watermark metrics
#11507 Replace libmamba-solver with mamba command [skip ci]
#11503 Download artifacts via wget [skip ci]
#11490 Use UnaryLike instead of UnaryExpression
#10798 Optimizing Expand+Aggregate in sqls with many count distinct
#11366 Enable parquet suites from Spark UT
#11477 Install cuDF-py against python 3.10 on Databricks
#11462 Support non-UTC timezone for casting from date type to timestamp type
#11449 Support yyyyMMdd in GetTimestamp operator for LEGACY mode
#11456 Enable tests for all JSON white space normalization
#11483 Use reusable auto-merge workflow [skip ci]
#11482 Fix a json test for non utc time zone
#11464 Use improved CUDF JSON validation
#11474 Enable tests after string_split was fixed
#11473 Revert "Skip test_hash_groupby_approx_percentile byte and double test…
#11466 Replace scala.util.Try with a try statement in the DBR buildinfo
#11469 Skip test_hash_groupby_approx_percentile byte and double tests tempor…
#11429 Fixed some of the failing parquet_tests
#11455 Log DBR BuildInfo
#11451 xfail array and map cast to string tests
#11331 Add companion metrics for all nsTiming metrics without semaphore
#11421 [DOC] remove the redundant archive link [skip ci]
#11308 Dynamic Shim Detection for build Process
#11427 Update CI scripts to work with the "Dynamic Shim Detection" change [skip ci]
#11425 Update signoff usage [skip ci]
#11420 Add in array_join support
#11418 stop using copyWithBooleanColumnAsValidity
#11411 Fix asymmetric join crash when stream side is empty
#11395 Fix a Pandas UDF slowness issue
#11371 Support MinBy and MaxBy for non-float ordering
#11399 stop using copyWithBooleanColumnAsValidity
#11389 prevent duplicate queueing in the prio semaphore
#11291 Add distinct join support for right outer joins
#11396 Drop cudf-py python 3.9 support [skip ci]
#11393 Revert work-around for empty split-string
#11334 Add support for Spark 3.5.2
#11388 JSON tests for corrected date, timestamp, and mixed types
#11375 Fix spark400 build in datagen and tests
#11376 Create a PrioritySemaphore to back the GpuSemaphore
#11383 Fix nightly snapshots being downloaded in premerge build
#11368 Move SparkRapidsBuildInfoEvent to its own file
#11329 Change reference to MapUtils into JSONUtils
#11365 Set numRows for the ColumnBatch created in GpuBringBackToHost
#11363 Fix failing test compile for Spark 4.0.0
#11362 Add tests for repeated JSON columns/keys
#11321 conform dependency list in 341db to previous versions style
#10604 Add string escaping JSON tests to the test_json_matrix
#11328 Swap build side for outer joins when natural build side is explosive
#11358 Fix download doc [skip ci]
#11357 Fix auto merge conflict 11354 [skip ci]
#11347 Revert "Fix the mismatching default configs in integration tests (#11283)"
#11323 replace inputFiles with location.rootPaths.toString
#11340 Audit script - Check commits from sql-hive directory [skip ci]
#11283 Fix the mismatching default configs in integration tests
#11327 Make hive column matches not case-sensitive
#11324 Append ustcfy to blossom-ci whitelist [skip ci]
#11325 Fix auto merge conflict 11317 [skip ci]
#11319 Update passing JSON tests after list support added in CUDF
#11307 Safely close multiple resources in RapidsBufferCatalog
#11313 Fix auto merge conflict 10845 11310 [skip ci]
#11312 Add jihoonson as an authorized user for blossom-ci [skip ci]
#11302 Fix display issue of lore.md
#11301 Skip deploying non-critical intermediate artifacts [skip ci]
#11299 Enable get_json_object by default and remove legacy version
#11289 Use the new chunked API from multi-get_json_object
#11295 Remove redundant classes from the dist jar and unshimmed list
#11284 Use distinct count to estimate join magnification factor
#11288 Move easy unshimmed classes to sql-plugin-api
#11285 Remove files under tools/generated_files/spark31* [skip ci]
#11280 Asynchronously copy table data to the host during shuffle
#11258 Explicitly disable ANSI mode for ast_test.py
#11267 Update the rapids JNI and private dependency version to 24.10.0-SNAPSHOT

Older Releases

Changelog of older releases can be found at docs/archives