Skip to content

puneettripathi/splumbr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SparkETL

SparkETL is a library for performing ETL on Apache Spark using python.
It provides with tools to Read Data from various sources, Apply Transformations on the data read and then load them to the destination.

io

It Currently can read -

  • Delimited Files
  • JSONS
  • Fixed length record files
  • ZIP files which contains delimited text files
  • Avro | dependency - databricks-avro jar
  • Parquet
  • Hive Tables

It Currently can write to -

  • Delimited Files
  • JSONs
  • Avro | dependency - databricks-avro jar
  • ORC
  • Parquet
  • Hive Tables
It handles mismatch in schema of Hive tables while writing out of the box.

Transformation

CDC to capture changes in dimension.

  • Performs Upsert
  • doesn't support delete of records.
  • -- More like SCD I

Transformer - applies tranformation on columns

  • apply udf to all string columns
  • single columns transformations
  • drop multiple columns
  • keep columns
  • outlier detection and handling
  • missing value imputation with - mean, meadian, mode, constant

Quality Assuarance

  • Report on all columns
  • comparison of data in two dataframes

It is a work in progress