#

Apache Spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

Here are 8,269 public repositories matching this topic...

broadinstitute / gatk

Official code repository for GATK versions 4 and up

science bioinformatics spark genomics genome ngs sequencing dna gatk

Updated May 17, 2024
Java

NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs

big-data spark gpu rapids

Updated May 17, 2024
Scala

delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

big-data spark analytics acid delta-lake

Updated May 17, 2024
Scala

MauricioVazquezM / Spark_BigData_Architecture_Project

Final project for the course 'Architecture for Large Data Volumes', taught in the Bachelor's program in Data Science at ITAM

python spark time-series pyspark data-streaming data-stream-processing

Updated May 17, 2024

HsiehShuJeng / cdk-emrserverless-with-delta-lake

This construct builds some elements for you to quickly launch an EMR Serverless application. After submitting the Emr Serverless job, you could also launch an EMR notebook via cluster template to check the outcome from the EMR Serverless application.

python java golang aws spark serverless dotnet javacript aws-cloudformation emr-notebooks delta-lake aws-service-catalog cdk-constructs projen emr-studio emr-serverless

Updated May 17, 2024
TypeScript

apache / spark

Apache Spark - A unified analytics engine for large-scale data processing

python java r scala sql big-data spark jdbc

Updated May 17, 2024
Scala

apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator

rust spark arrow datafusion

Updated May 17, 2024
Rust

kamu-data / kamu-cli

New generation decentralized data lake and a streaming data pipeline

data-science sql spark jupyter blockchain open-data data-management flink data-as-code datafusion kamu open-data-fabric

Updated May 16, 2024
Rust

ev2900 / Glue_Spark_History_Server

Host a Docker container for the Spark history server / Spark UI of AWS Glue jobs

aws spark glue spark-history-server spark-ui

Updated May 16, 2024
Dockerfile

rxmi-bkd / olist-end-to-end-data-engineering-project

In this project, we focusing on generating an orders fact table from the dataset provided by Olist in order to analyze the sales performance of the company.

python docker airflow spark

Updated May 16, 2024
Python

ytsaurus / ytsaurus

YTsaurus is a scalable and fault-tolerant open-source big data platform.

sql big-data spark clickhouse distributed-database lakehouse olap-database ytsaurus

Updated May 16, 2024
C++

flyteorg / flytekit

Extensible Python SDK for developing Flyte tasks and workflows. Simple to get started and learn and highly extensible.

python data-science data automation sdk spark pypi extensible workflows hacktoberfest flyte mlops flyte-tasks

Updated May 16, 2024
Python

mage-ai / mage-ai

🧙 Build, run, and manage data pipelines for integrating and transforming data.

python data-science data machine-learning sql spark pipeline etl pipelines orchestration artificial-intelligence data-engineering data-integration dbt elt transformation data-pipelines reverse-etl

Updated May 17, 2024
Python

apache / doris

Apache Doris is an easy-to-use, high performance and unified analytics database.

bigquery real-time sql database spark hive hadoop etl snowflake olap query-engine redshift dbt elt iceberg hudi delta-lake lakehouse

Updated May 17, 2024
Java

LKochan123 / Eksploracja-danych

Academic course

data-mining spark

Updated May 16, 2024
Java

apache / zeppelin

Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

javascript java scala database big-data spark nosql flink zeppelin

Updated May 16, 2024
Java

FranzDiebold / docker-datascience-ultimate

Customized Jupyter Spark Docker images with everything you need

python docker spark jupyter pyspark jupyterlab polars

Updated May 16, 2024
Dockerfile

awslabs / data-on-eks

DoEKS is a tool to build, deploy and scale Data & ML Platforms on Amazon EKS

kubernetes spark terraform ml jupyterhub ray kubeflow aws-eks eks mlflow

Updated May 17, 2024
HCL

SynapseML

microsoft / SynapseML

Simple and Distributed Machine Learning

Updated May 16, 2024
Scala

big-data-team / big-data-course

Practice course on Big Data

big-data spark cassandra yarn hive nosql pyspark hdfs mapreduce

Updated May 16, 2024
Jupyter Notebook

Created by Matei Zaharia

Released May 26, 2014

Followers: 414 followers
Repository: apache/spark
Website: spark.apache.org
Wikipedia: Wikipedia

Related Topics