The group works on the data engineering team at HealthGen, a leading company in the field of genomics and personalized medicine research. Genomics, which is the in-depth investigation of an organism's complete genome, serves as a cornerstone in personalized medicine and advanced biomedical initiatives. Through it, it is possible to dissect DNA, discovering genetic variants and mutations that may be linked to various diseases, providing a path to shaping treatments that specifically align with the patient's genetic profile.
With this in mind, this project uses modern tools such as Apache Kafka for real-time data streaming, the Databricks platform for collaborative analysis and the power of Apache Spark for distributed processing of big data. This system is capable of collecting, processing and analyzing the latest information and news about genomics and personalized medicine.
Api: https://newsapi.org/v2/everything
To run the producer and consumer it is necessary to perform a few steps first, in the project files.
1 - To install the kafka
, rotate the first cell of the notebook:
%sh
sudo wget https://downloads.apache.org/kafka/3.4.1/kafka_2.12-3.5.1.tgzv
2 - Extract the downloaded file
%sh
tar -xvf kafka_2.12-3.5.1.tgz
3 - Initialize Kafka
%sh
./kafka_2.12-3.5.1/bin/kafka-server-start.sh ./kafka_2.12-3.5.1/config/server.properties
1 - Run zookeeper-server-start
, you need to install kafka before performing this step.
%sh
./kafka_2.12-3.5.1/bin/zookeeper-server-start.sh ./kafka_2.12-3.5.1/config/zookeeper.properties
1 - After kafka server and zookeeper are initialized, install kafka-python
.
%pip install kafka-python
2 - Rotate the topic creation cell.
%sh
./kafka_2.12-3.5.1/bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic topic_news --partitions 1 --replication-factor 1