Official code repository for GATK versions 4 and up
-
Updated
May 17, 2024 - Java
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
Official code repository for GATK versions 4 and up
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
Final project for the course 'Architecture for Large Data Volumes', taught in the Bachelor's program in Data Science at ITAM
This construct builds some elements for you to quickly launch an EMR Serverless application. After submitting the Emr Serverless job, you could also launch an EMR notebook via cluster template to check the outcome from the EMR Serverless application.
New generation decentralized data lake and a streaming data pipeline
Host a Docker container for the Spark history server / Spark UI of AWS Glue jobs
YTsaurus is a scalable and fault-tolerant open-source big data platform.
Extensible Python SDK for developing Flyte tasks and workflows. Simple to get started and learn and highly extensible.
🧙 Build, run, and manage data pipelines for integrating and transforming data.
DoEKS is a tool to build, deploy and scale Data & ML Platforms on Amazon EKS
Simple and Distributed Machine Learning
Created by Matei Zaharia
Released May 26, 2014