Indeed.nl Jobs Scraping App

Overview

Program to scrape and store posted jobs in the Netherlands from www.indeed.nl

Gets the next information from the website:

original id generated by Indeed;
job title (job_title)
posting date (job_date)
location (job_loc)
short description (job_summary)
salary (or salary range) in a list format (job_salary)
url of the job (job_url)
company name (company_name)
company type: recruiter or direct employer (company_type)

Getting Started

Install all required packages from requirements.txt.
$ pip install -r requirements.txt
Save credentials for your POSTGRESQL database in the .env file:
POSTGRESQL_USER = '<username>'
POSTGRESQL_PASSWORD = '<password>'
POSTGRESQL_HOST = '<host>'
POSTGRESQL_PORT = '<port>>'
POSTGRESQL_NAME = '<db name>'
Set up the tables in the database
First option:
Run the db_scheme.py:
$ python3 db_scheme.py

Second option:
Create tables manually in the POSTGRESQL admin tool using sql scripts from the **`db_scheme.sql`** Then, add the data to the *`cities`* table from **`/data/cities_nl.csv`**.

How to use

Assign search parameters in the parameters.py:

positions should be a list of strings with all positions names or key-words for search. Even if there is one word, keep it in the list: positions = ["auditor"]
company_types by default company_types = ["employer", "recruiter"]. it helps to differentiate companies, which posted vacancies. Can be also chosen only one of the types.
education_level has two options 'master' for a positions required a master degree or 'any' for all positions.
red_flags is a list of key-words. It doesn't impact the scraping, but then adds extra parameter for each found job as 'qualified' / 'not qualified'. The principle is: if a key word appears in the job title or job description, this job will be marked as 'not qualified'. Can be also set as an empty list: red_flags = []

Run the app.py
$ python3 app.py

Functionality:

Scraping jobs by the key parameters: search key-words, company type: direct employer or agency and education level.
Cleaning / formatting data.
Qualifying correctness of the search by the words-combinations in the job description. Adds a mark 'qualified' / ' non qualified' to each found position based on the found red_flags in the title or the job summary.
Connects each found job with a city in the Netherlands, which helps to define precise geolocation (city name, province, longitude and latitude)
Each scraping session saves the results as a csv data dump to the data_dumps/ folder.
Data Dump is saved into the POSTGRESQL database, preliminary excluding already existent in the database records.
Each step of the scraping is logged into the log.txt with printing the outcomes in the console.

Architecture:

app.py - enter point
main.py - the main workflow of the program
indeed_nl_scraper.py - scraping functionality module
dumping.py - data cleaning / formatting module + saving data dumps
logger.py - logging functionality
database - connection and communication with the POSTGRESQL database.
.env - POSTGRESQL database engine credentials
parameters.py - keeping scraping parameters in separate module for easy access.

Additional:

db_scheme.py or db_scheme.sql for initial database setup.
requirements.txt required python packages.

Requirements:

python 3 postgresql engine

Packages:

pandas 1.4.2
requests 2.28.0
beautifulsoup4 4.11.1
python-dotenv 0.20.0
SQLAlchemy 1.4.37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Indeed.nl Jobs Scraping App

Overview

Getting Started

How to use

Functionality:

Architecture:

Requirements:

About

Releases

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data		data
data_dumps		data_dumps
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
app.py		app.py
database.py		database.py
db_scheme.py		db_scheme.py
db_scheme.sql		db_scheme.sql
dumping.py		dumping.py
indeed_nl_scraper.py		indeed_nl_scraper.py
logger.py		logger.py
main.py		main.py
parameters.py		parameters.py
requirements.txt		requirements.txt
updater.py		updater.py

PSavvateev/JobScrapingApp_Indeed.nl

Folders and files

Latest commit

History

Repository files navigation

Indeed.nl Jobs Scraping App

Overview

Getting Started

How to use

Functionality:

Architecture:

Requirements:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Languages