Technology Trends Statistician is your go-to tool for real-time insights into the ever-changing technology landscape, combining web scraping and data analysis to track and analyze the latest trends in development job descriptions.
- Scraping jobs from Djinni by several specialization categories (e.g. Python, Java, DevOps, etc.).
- Mongo client singleton.
- Ability to work with local and cloud MongoDB, as well as with regular CSV files.
- Using
Pydantic
models instead of standard items for better data validation. - Database templates to simplify connection to MongoDB.
- Two pipelines (Mongo and CSV).
- CSV pipeline that covers the entire ETL process.
- Data Wrangling. Clean up text and extract technology statistics.
NOTE: Python version >3.8 is required.
Clone the repository:
git clone https://github.com/AndriyKy/tech-trend-stat.git
cd tech-trend-stat
Create a virtual environment, install dependencies and set the PYTHONPATH
environment variable:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
export PYTHONPATH="$(pwd):$(pwd)/techtrendanalysis:$(pwd)/techtrendanalysis"
Create a copy of the file .env.copy
-> .env
and set the appropriate variables (in the case of working with MongoDB).
If you decide to work with MongoDB, here is a tutorial on how to install it locally in a Docker container.
Here is also the instruction on how to create a cluster on the cloud.
Once the database has been successfully installed, you just need to run the following command to scrape the vacancies using the scrapy
spider along with the Mongo pipeline:
scrapy crawl djinni -a categories="Python"
You can substitute "Python" for any other category, or a stack of categories separated by a " | ". See available specializations (categories) on the Djinni website.
To extract statistics from job descriptions, run the wrangler
file, passing the desired category name.
If you can't install MongoDB, just run the crawler
script. It will scrape jobs in the category you passed and save them to the appropriate CSV file. After that, it will pull job descriptions from the generated file, extract the technology stack and write it to another CSV file.
To see the visualization of the extracted statistics, please, head over to the analysis
file and follow the instructions given there.