SparkETL

SparkETL is a library for performing ETL on Apache Spark using python.

It provides with tools to Read Data from various sources, Apply Transformations on the data read and then load them to the destination.

io

It Currently can read -

Delimited Files
JSONS
Fixed length record files
ZIP files which contains delimited text files
Avro | dependency - databricks-avro jar
Parquet
Hive Tables

It Currently can write to -

Delimited Files
JSONs
Avro | dependency - databricks-avro jar
ORC
Parquet
Hive Tables

It handles mismatch in schema of Hive tables while writing out of the box.

Transformation

CDC to capture changes in dimension.

Performs Upsert
doesn't support delete of records.
-- More like SCD I

Transformer - applies tranformation on columns

apply udf to all string columns
single columns transformations
drop multiple columns
keep columns
outlier detection and handling
missing value imputation with - mean, meadian, mode, constant

Quality Assuarance

Report on all columns
comparison of data in two dataframes

It is a work in progress

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
splumbr		splumbr
.gitignore		.gitignore
README.md		README.md
build.py		build.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

splumbr

splumbr

.gitignore

.gitignore

README.md

README.md

build.py

build.py

setup.py

setup.py

Repository files navigation

SparkETL

SparkETL is a library for performing ETL on Apache Spark using python.

It provides with tools to Read Data from various sources, Apply Transformations on the data read and then load them to the destination.

io

It Currently can read -

It Currently can write to -

It handles mismatch in schema of Hive tables while writing out of the box.

Transformation

CDC to capture changes in dimension.

Transformer - applies tranformation on columns

Quality Assuarance

About

Releases

Packages

Contributors 2

Languages

puneettripathi/splumbr

Folders and files

Latest commit

History

Repository files navigation

SparkETL

SparkETL is a library for performing ETL on Apache Spark using python.

It provides with tools to Read Data from various sources, Apply Transformations on the data read and then load them to the destination.

io

It Currently can read -

It Currently can write to -

It handles mismatch in schema of Hive tables while writing out of the box.

Transformation

CDC to capture changes in dimension.

Transformer - applies tranformation on columns

Quality Assuarance

About

Topics

Resources

Stars

Watchers

Forks

Languages