Program to scrape and store posted jobs in the Netherlands from www.indeed.nl
Gets the next information from the website:
- original id generated by Indeed;
- job title (
job_title
) - posting date (
job_date
) - location (
job_loc
) - short description (
job_summary
) - salary (or salary range) in a list format (
job_salary
) - url of the job (
job_url
) - company name (
company_name
) - company type: recruiter or direct employer (
company_type
)
- Install all required packages from requirements.txt.
$ pip install -r requirements.txt
- Save credentials for your POSTGRESQL database in the
.env
file:
POSTGRESQL_USER = '<username>'
POSTGRESQL_PASSWORD = '<password>'
POSTGRESQL_HOST = '<host>'
POSTGRESQL_PORT = '<port>>'
POSTGRESQL_NAME = '<db name>'
- Set up the tables in the database
First option:
Run thedb_scheme.py
:
$ python3 db_scheme.py
Second option:
Create tables manually in the POSTGRESQL admin tool using sql scripts from the **`db_scheme.sql`** Then, add the data to the *`cities`* table from **`/data/cities_nl.csv`**.
- Assign search parameters in the
parameters.py
:
positions
should be a list of strings with all positions names or key-words for search. Even if there is one word, keep it in the list:positions = ["auditor"]
company_types
by defaultcompany_types = ["employer", "recruiter"]
. it helps to differentiate companies, which posted vacancies. Can be also chosen only one of the types.education_level
has two options 'master' for a positions required a master degree or 'any' for all positions.red_flags
is a list of key-words. It doesn't impact the scraping, but then adds extra parameter for each found job as 'qualified' / 'not qualified'. The principle is: if a key word appears in the job title or job description, this job will be marked as 'not qualified'. Can be also set as an empty list:red_flags = []
- Run the
app.py
$ python3 app.py
- Scraping jobs by the key parameters: search key-words, company type: direct employer or agency and education level.
- Cleaning / formatting data.
- Qualifying correctness of the search by the words-combinations in the job description. Adds a mark 'qualified' / ' non qualified' to each found position based on the found red_flags in the title or the job summary.
- Connects each found job with a city in the Netherlands, which helps to define precise geolocation (city name, province, longitude and latitude)
- Each scraping session saves the results as a csv data dump to the
data_dumps/
folder. - Data Dump is saved into the POSTGRESQL database, preliminary excluding already existent in the database records.
- Each step of the scraping is logged into the
log.txt
with printing the outcomes in the console.
app.py
- enter pointmain.py
- the main workflow of the programindeed_nl_scraper.py
- scraping functionality moduledumping.py
- data cleaning / formatting module + saving data dumpslogger.py
- logging functionalitydatabase
- connection and communication with the POSTGRESQL database..env
- POSTGRESQL database engine credentialsparameters.py
- keeping scraping parameters in separate module for easy access.
Additional:
db_scheme.py
ordb_scheme.sql
for initial database setup.requirements.txt
required python packages.
python 3
postgresql engine
Packages:
pandas 1.4.2
requests 2.28.0
beautifulsoup4 4.11.1
python-dotenv 0.20.0
SQLAlchemy 1.4.37