The following Bash/Python script fills the table InputFilteredWeb
with product descriptions from the web. The line defining the variable "content" is key, as it needs to be adapted to where in the HTML the actual description can be found: the example assumes it to be in a div with class "👉product-description".
It would be simpler to get the whole page content, but that will dilute it with navigational elements etc., may structurally sabotage the prompt and ultimately lead to the maximal number of input tokens to be exceeded.
dataset="[👉DATASET]"
source_table="InputFiltered"
dest_table="InputFilteredWeb"
query="SELECT id, `👉web_url` FROM $dataset.$source_table"
bq query --use_legacy_sql=false --format=csv $query | tail -n +2 > ids_urls.csv
> ids_contents.csv
while IFS=, read -r id url; do
echo "Fetching: $url (ID: $id)"
content=$(python3 - << EOF
import requests
import time
from bs4 import BeautifulSoup
time.sleep(0.5)
response = requests.get("$url")
soup = BeautifulSoup(response.content, 'html.parser')
content = " ".join(div.get_text() for
div in soup.find_all('div', class_='👉product-description'))
print(content.replace('"', '""').replace('\n', ' ').replace('\r', ''))
EOF
)
echo "$id,\"$content\"" >> ids_contents.csv
done < ids_urls.csv
bq load --replace --source_format=CSV --schema='id:STRING,content:STRING' $dataset.$dest_table ids_contents.csv
rm ids_urls.csv ids_contents.csv
pip3 install bs4
".
bash [scriptname]
".