Skip to content

ppatrzyk/wiki-events

Repository files navigation

wiki-events

System that consumes data from Wikipedia EventStreams service and exposes analytics dashboard with insights about activity on different wikipedia instances. See:

Project submitted for Data Engineering Zoomcamp 2024.

Architecture

Guiding principles of tech choices:

  • avoiding vendor lock in,
  • preferring lightweight tools without redudant features,
  • configuration kept in repo.

This repo contains example terraform config for Deployment using Hetzner Cloud but in principle everything can work on any linux server, hosted anywhere. Also, no proprietary tools are used.

Components

Component Description
wiki_sse_reader Python service reading Server-sent events from wikipedia source
wiki_dbt Models (i.e. SQL code) for transforming data within database
wiki_dash BI dashboard app defined in Python (Dash)
RabbitMQ Message queue that handles events
Clickhouse Main OLAP database to store data and run analytics queries

Diagram

diagram

DB tables

table level (medallion architecture) description
wiki_raw bronze Raw ingested wiki data
wiki silver Parsed and filtered wiki data
wiki_minutely_summary gold Event count by minute (total)
wiki_hourly_summary gold Event count by hour (total)
wiki_minutely_bywiki_summary gold Event count by minute (by wiki)
wiki_hourly_bywiki_summary gold Event count by hour (by wiki)
wiki_weekdays_summary gold Average event count for specific times of the week
wiki_bywiki_summary gold Total event counts by wiki

About

Consume wikipedia streaming data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published