This project is designed to extract data from Twitter using the Twitter API and load it into an AWS S3 bucket in CSV format. The pipeline is managed and orchestrated using Airflow.
🌟 For those who crave more, discover the Advanced Twitter Data Pipeline right here. 🌟
The Twitter API is used to collect data from Twitter. The data collected from the Twitter API is saved locally before being transferred to the landing bucket on AWS S3. The ETL jobs are written in Python and scheduled in Apache Airflow.
I used a local machine with the following specifications:
Ubuntu OS
4 vCores, 4 GiB Memory
Storage: 32 GiB
- A Twitter developer account and API key
- An AWS account and S3 bucket set up
- Python 3
- Airflow
- tweepy
- pandas
- boto3
- Clone the repository
git clone https://github.com/kishlayjeet/Twitter-Data-Pipeline-using-Airflow-and-AWS-S3.git
- Install the required packages
pip install -r requirements.txt
- Create a
config.py
file in the root directory of the project and add the following variables:
AccessKey = "[Your Twitter API key]"
AccessSecret = "[Your Twitter API secret key]"
BearerToken = "[Your Twitter API bearer token]"
ConsumerKey = "[Your Twitter access token]"
ConsumerSecret = "[Your Twitter access token secret]"
KEY = "[Your AWS access key]"
SECRET = "[Your AWS secret key]"
- Update the
airflow.cfg
file to point to yourconfig.py
file and set the appropriate S3 bucket name.
- Start the Airflow webserver
airflow standalone
-
Access Airflow through your browser using the default
8080
port. -
In the Airflow UI, enable the
twitter_data_pipeline
DAG. -
The pipeline will run on the schedule defined in the DAG and load the data into the specified S3 bucket in CSV format.
Example output:
user | username | text | like_count | reply_count | retweet_count | created_at |
---|---|---|---|---|---|---|
elonmusk | Elon Musk | Je me souviens Pluto 🥹 | 23794 | 4036 | 1802 | 2023-01-21 09:07:40+00:00 |
elonmusk | Elon Musk | Twitter is the source of truth | 35278 | 2525 | 3756 | 2023-01-21 08:25:50+00:00 |
elonmusk | Elon Musk | Notes Extremely important | 1116 | 74 | 75 | 2023-01-21 07:19:35+00:00 |
elonmusk | Elon Musk | I’m reviewing it next week for possible release | 2134 | 166 | 165 | 2023-01-21 03:13:31+00:00 |
elonmusk | Elon Musk | The debt trend is 🤯🤯 | 2255 | 314 | 193 | 2023-01-21 00:27:49+00:00 |
- The pipeline is currently set to extract data from a specific Twitter handle. You can update the
run_twitter_etl
function indags/twitter_etl.py
to extract data from other sources. - The pipeline is currently set to load the data into a specific S3 bucket. You can update the
bucket_name
variable indags/twitter_etl.py
to load the data into other S3 buckets or in a different format. - The pipeline is currently scheduled to run every seven days. You can update the
schedule_interval
in the DAG to run more or less frequently.
- You can also check the output dataframe for a specific task by using
xcom_pull
from thetask_instance
. - You can also check the S3 CSV file using the S3 bucket URL.
- You can also set up Airflow on other cloud platforms like GCP and Azure.
- Airflow provides a UI for monitoring the status of tasks and DAG runs. In case of task failure, the UI displays the error message and the traceback, which can be used to troubleshoot the issue.
- Airflow also provides the option to retry failed tasks a certain number of times before marking them as failures. This can be configured in the DAG definition.
- In the
run_twitter_etl
function indags/twitter_etl.py
, you can include try-except blocks to catch and handle any errors that may occur while extracting data from the Twitter API. - In case of issues with the S3 bucket, such as access denied or invalid credentials, check if the
config.py
file contains the correct AWS access key and secret key, and that the S3 bucket name is correctly configured in thetwitter_etl.py
file. - Logs for the pipeline can also be found in the
logs/
directory. These logs can be useful in troubleshooting issues with the pipeline.
In conclusion, this project provides a detailed guide on how to build a data pipeline using Airflow to extract data from the Twitter API, process it using pandas, and load it into an AWS S3 bucket in CSV format. By following the step-by-step instructions in this README, you should be able to set up and run the pipeline successfully.
It is important to note that this project is intended for educational or testing purposes. Before using the data in production, it is crucial to comply with Twitter's terms of service and any other relevant laws and regulations. It is also important to consider data privacy, security, and compliance in any production implementation.
This project provides a solid foundation that can be further enhanced by adding more functionalities such as data cleaning, transformation, and analysis. It can also be integrated with other systems and services, such as data warehousing, machine learning, and visualization tools, to enable more advanced use cases.
Thank you for using this data pipeline. If you have any questions, suggestions, or feedback, please don't hesitate to contact me at [email protected].