This project is a tool to generate code for web scrapping using ChatGPT. The idea is to use the power of the GPT models to generate code for web scrapping projects. The tech stack used includes langchain, streamlit, and openai.
Try it: Space 🤗
(Recomended to use a virtual environment, see Venv for more information about)
pip install -m requirements.txt
Create a config.ini with the following information on your root directory
Visit OpenAI to get your API Key
[DEFAULT]
API-KEY = {fill the value with your OPENAI API Key}
Run the app
streamlit run app.py
The idea of the project is to use GPT to automatize code generation for web scrapping.
-
The tool will return a method to be used in web scrapping projects.
-
The first bot (GPT chain) will return a JSON with the information of the fields to be extracted.
-
The second bot will return a function called extract_info.
-
The function will receive the HTML of the page and will return the information extracted from the page.
Watch the full video on YouTube.
For now, the workflow has 2 manual steps, but the idea is to automatize the process in the future.
- Inspect the element from which you want to get the information
- Copy the HTML element and paste it into the input of the app
- Click on generate code
Here the first chain will generate a JSON with the information of the fields to be extracted That JSON will be used in the second chain as expected output format to generate the code
- Copy the HTML of the entire page
- Paste it in the second input of the app to test it
- Click on test code
If it was successful you will see a table with the information extracted from the page.