This repository is dedicated to scraping www.praktycznyegzamin.pl exam for questions with answers using gocolly.
- Go programming language installed on your machine.
- Access to the internet to download dependencies.
- Clone or download the Go web scraper repository.
- Navigate to the root directory of the project.
Run the following command to download the necessary dependencies:
make init
or
go mod download
You can run web scraper with following command:
make run
- Description: Defines the output directory where the scraped data will be stored.
- Example:
OUTPUTFILE=_out
- Description: Specifies the URL of the website that will be scraped.
- Example:
URL=https://www.praktycznyegzamin.pl/inf04/teoria/wszystko/
- Description: Determines whether the scraper should remove title prefixes during scraping.
true
: Title prefixes will be removed.false
: Title prefixes will not be removed.
- Example:
REMOVETITLEPREFIX=true
- Description: Controls whether the scraper should remove answer prefixes during scraping.
true
: Answer prefixes will be removed.false
: Answer prefixes will not be removed.
- Example:
REMOVEANSWERPREFIX=true
The scraper will generate the following structure in the specified output directory:
_out
│ ├── images
│ │ ├── <image1>.jpg
│ │ ├── <image2>.jpg
│ │ ├── ...
│ ├── questions.json
│ └── videos
│ ├── <video1>.mp4
│ ├── <video2>.mp4
│ ├── ...
questions.json file will contain list of Question with titles, answers and sometimes images or videos for additional context.