Skip to content

Automated Python script leveraging Pyppeteer to scrape Indeed for salary data, including aggregated salary statistics and details on top companies for specified job titles and locations.

Notifications You must be signed in to change notification settings

noahminds/indeed_salary_scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Indeed.com Salary Scraper

Overview

This project offers a powerful tool for scraping salary data from Indeed, including average, low, and high base salaries, along with detailed company information for specified job titles and locations. Utilizing Python with asyncio and Pyppeteer, it enhances the efficiency and speed of web scraping across multiple queries.

Demo

Indeed.Salary.Scraper.Demo.mov

Features

  • Dynamic Search Query Input: Automatically extracts job titles and locations to search for from a provided CSV file (searches.csv), allowing for easy customization and batch processing of multiple queries.
  • Aggregated Base Salary Scraping: Retrieves aggregated salary information including the low, average, and high base salaries for specified job titles and locations from Indeed, offering insights into salary expectations across different roles and areas.
  • Top Company Information Extraction: Gathers detailed information on top companies for each job title and location, including company name, aggregate rating, average salary, number of reviews, and salaries reported, helping identify leading employers in that role and area.
  • CSV Data Export: Outputs scraped data into CSV files for each search query, facilitating easy access, analysis, and storage of the results. Two CSV files are generated: one for aggregated base salaries (base_salary.csv) and another for top companies information (top_companies.csv).
  • Error Handling and Logging: Incorporates robust error handling to manage and log invalid searches and unexpected issues, ensuring the scraper continues to process valid queries without crashing.

Requirements

  • Python 3.7+
  • asyncio
  • Pyppeteer
  • CSV module (built-in)

Setup

  1. Install Dependencies: Ensure you have Python installed on your system and the necessary libraries. You can install pyppeteer using pip:
pip install pyppeteer
  1. Prepare Input File: Modify or create a CSV file named searches.csv with the job titles and locations you want to search for, formatted as "Job Title,Location". Ensure the CSV file is saved in the "Comma Separated Values (.csv)" format (not CSV UTF-8) to avoid encoding issues.

Usage

  1. Run the Scraper: Execute the script from the command line.
python indeed_scraper.py
  1. Output: The script will create or append to two CSV files:
  • base_salary.csv: Contains aggregated salary information for each job title and location specified by the user in searches.csv.
  • top_companies.csv: Contains information on the top companies for each job title and location specified by the user in searches.csv.

Notes

  • Headless Mode: The script is designed to support headless operation for enhanced performance and resource efficiency during scraping tasks. However, no successful operation in headless mode has been observed yet. Users may experience better stability and visual debugging capabilities by running the scraper with the headless option set to False. This adjustment allows for real-time monitoring of the scraping process but might require additional resources. I am working on improving the headless mode compatibility and hope to resolve this limitation in future updates.

  • CSV File Formats: It's crucial to ensure that the input CSV file (searches.csv) is saved in the correct format. For optimal compatibility, especially on macOS, the file should be saved as "Comma Separated Values (.csv)" and not "CSV UTF-8". Using the correct format ensures the scraper correctly interprets job titles and locations without encountering encoding issues or unexpected characters.

    Example Input Format

    Your searches.csv should list job titles and locations, separated by a comma, with each query on a new line like so:

    Software Engineer,New York, NY
    Data Scientist,San Francisco, CA
    Product Manager,Boston, MA
    

    This format allows the scraper to accurately process each job title and location pair.

  • Invalid Searches: The scraper logs invalid searches to the console and continues processing valid queries. If you encounter an invalid search, check the console output for details and try reformulating the search query in searches.csv to ensure it adheres to the expected format. Highly specific job titles may result in invalid searches, so consider reformulating to broader terms or alternative job titles if issues arise.

  • Location Limitation: The default configuration of the scraper is set to scrape salary data from the US version of Indeed. Therefore, it can only return results for job titles and locations within the United States. If you wish to scrape salary data from other countries, please refer to the note on "Alternative Geographies" for instructions on how to modify the scraper to work with country-specific versions of Indeed.

  • Alternative Geographies: The scraper can also be used to perform searches on versions of Indeed for other countries. However, caution should be exercised when doing so, as this feature is still in testing and may be prone to errors. To perform searches outside of US geographies, replace the page URL in indeed_scraper.py with the country-specific version of the Indeed 'Find Salary [or equivalent]' webpage. Please note that data availability may vary by region, and certain information such as low and high base salaries may not be available in all areas.

  • Concurrency: While the current version of the scraper does not implement concurrency, future updates may include asynchronous processing to enhance efficiency and speed, particularly when handling multiple search queries.

Consider Legal and Ethical Considerations

Understanding the legal and ethical implications of web scraping is paramount. This script is a powerful tool for quickly gathering salary information across different job titles and locations. It's ideal for job seekers, researchers, or anyone interested in labor market trends. Remember to use it responsibly and consider Indeed's terms of service regarding automated access and data usage.

About

Automated Python script leveraging Pyppeteer to scrape Indeed for salary data, including aggregated salary statistics and details on top companies for specified job titles and locations.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages