-
Notifications
You must be signed in to change notification settings - Fork 105
Core Concept
The goal of Rakam is to make it easily to create analytics services based on your needs. We found that there are Analytics SaaS providers that does great job but you often need to use at least a few analytics services and there are a few disadvantages of that. It means that you need to share your data with a few 3th party applications, pay each of them independently even though they usually have have similar infrastructures even though they use completely different technologies. The other common case is that these analytics services often specialize at one subject (web analytics, mobile analytics, real-time analytics, customer analytics etc.) so they may not solve your problem. We want to develop a modular and extensible analytics platform that you can use to create your custom analytics solutions easily.
We provide various ways to collect your events with Collection API. Currently you can use client libraries various platforms, send them in JSON format or write a module that consume events from different data sources. Rakam takes care of schema evolution if you wish, it automatically alters the schema at runtime when it encounters new fields. Depending on the deployment type, the event dataset will be stored in a (or a few, if you wish) database and the Analysis API uses SQL query language in order to analyze event dataset. It's possible to accomplish almost all analytics features such as funnel and retention analyses. We we also provide materialized query tables for caching and event flows and continuous query tables for real-time event processing.
Currently, we provide three different solutions for different use cases: Postgresql, Kafka & PrestoDB & (Distributed file-system or S3) and Amazon Kinesis & S3 & Redshift. You can use one of these deployment types or extend Rakam by developing deployment modules for your specific needs. We're also evaluating other solutions such as Elasticsearch, InfluxDB and Pinot.
If the volume of data is not that big to not fit in a single node we suggest you to use Postgresql deployment type because it's the easiest and feature-complete deployment type. You can also tune your Postgresql database by creating INDEXes, installing a columnar storage FDW extension such as cstore_fdw so think twice before using other complex deployment types. Setup Guide
The other deployment type is Kafka & PrestoDB & (Distributed file-system or S3) and it basically includes open-source components that you can install them in your cluster and maintain yourself. If you're in big data era and maintain your own cluster this is the deployment type for you. The events will be directly sent to Kafka that acts as a distributed commit-log. Then, we process data in Kafka in small batches using PrestoDB, which is a distributed query executor. PrestoDB fetches data from Kafka and save it in a distributed file-system (Hadoop etc.) you want in columnar format. Then, you can execute SQL queries on that dataset. Setup Guide
The last deployment type we maintain Amazon Kinesis & S3 & Redshift. All the components in this deployment type are scalable AWS products. Kinesis is much like Kafka, we push event data serialized with Avro and consume the data parallelly by worker nodes. Currently, worker nodes process the event data and push it to Amazon S3 and we set up a data pipeline task that COPYes data from S3 to Amazon Redshift periodically. Redshift consume that data in small batches, save in as columnar format in AWS cluster, and allow us to analyze data. Setup Guide