Web Crawler for Tunisian Government Websites

This repository contains a web crawler designed to extract sub URLs from a list of 20 Tunisian government websites, and then scrape their contents to construct a small dataset of web pages that could possibly be used for natural language processing tasks.

Run Locally

If you want to re-run the crawling process, follow these steps:

Install the required Python packages using pip:

  pip install -r requirements.txt

Update the urls.txt file with a list of Tunisian government website URLs you wish to crawl.
Run crawler.py to start the web crawler.

  python crawler.py

The URLs that were successfully extracted from the websites can be found in the "/data/urls.json" file. The contents of each URL are stored in the "/data/content.json" file.

Process the raw contents using spaCy to extract and filter sentences.

  python clean.py

The fileterd contents of each URL are stored in the "/data/filtered_content.json" file.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
clean.py		clean.py
crawler.py		crawler.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Crawler for Tunisian Government Websites

Run Locally

About

License

abmami/Web-Crawler-for-Tunisian-Government-Websites

Folders and files

Latest commit

History

Repository files navigation

Web Crawler for Tunisian Government Websites

Run Locally

About

Topics

Resources

License

Stars

Watchers

Forks