Skip to content

tupol/spark-tools

Repository files navigation

Spark Tools

Maven Central   GitHub   Travis (.org)   Codecov   Javadocs   Gitter   Twitter  

Description

This project contains some basic runnable tools that can help with various tasks around a Spark based project.

The main tools available:

  • FormatConverter Converts any acceptable file format into a different file format, providing also partitioning support.
  • SimpleSqlProcessor Applies a given SQL to the input files which are being mapped into tables.
  • StreamingFormatConverter Converts any acceptable data stream format into a different data stream format, providing also partitioning support.
  • SimpleFileStreamingSqlProcessor Applies a given SQL to the input files streams which are being mapped into file output streams.

This project is also trying to create and encourage a friendly yet professional environment for developers to help each other, so please do no be shy and join through gitter, twitter, issue reports or pull requests.

Prerequisites

  • Java 8 or higher
  • Scala 2.11 or 2.12
  • Apache Spark 2.4.X

Getting Spark Tools

Spark Tools is published to Maven Central and Spark Packages:

where the latest artifacts can be found.

  • Group id / organization: org.tupol
  • Artifact id / name: spark-tools
  • Latest version is 0.4.1

Usage with SBT, adding a dependency to the latest version of tools to your sbt build definition file:

libraryDependencies += "org.tupol" %% "spark-tools" % "0.4.1"

Include this package in your Spark Applications using spark-shell or spark-submit with Scala 2.11

$SPARK_HOME/bin/spark-shell --packages org.tupol:spark-tools_2.11:0.4.1

or with Scala 2.12

$SPARK_HOME/bin/spark-shell --packages org.tupol:spark-tools_2.12:0.4.1

What's new?

0.4.1

  • Added StreamingFormatConverter
  • Added FileStreamingSqlProcessor, SimpleFileStreamingSqlProcessor
  • Bumped spark-utils dependency to 0.4.2
  • The project compiles with both Scala 2.11.12 and 2.12.12
  • Updated Apache Spark to 2.4.6
  • Updated delta.io to 0.6.1
  • Updated the spark-xml library to 0.10.0
  • Removed the com.databricks:spark-avro dependency, as avro support is now built into Apache Spark
  • Updated the spark-utils dependency to the latest available snapshot

For previous versions please consult the release notes.

License

This code is open source software licensed under the MIT License.

About

Executable Apache Spark Tools: Format Converter & SQL Processor

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Languages