docs: Add question answering over a website to web scraping (#6)

Co-authored-by: davidjohnbarton <[email protected]>
apify · Sep 15, 2023 · a8464df · a8464df
1 parent f9f1340
commit a8464df
Showing 1 changed file with 61 additions and 3 deletions.
diff --git a/docs/extras/use_cases/web_scraping.ipynb b/docs/extras/use_cases/web_scraping.ipynb
@@ -453,11 +453,11 @@
     "\n",
     "Related to scraping, we may want to answer specific questions using searched content.\n",
     "\n",
-    "We can automate the process of [web research](https://blog.langchain.dev/automating-web-research/) using a retriver, such as the `WebResearchRetriever` ([docs](https://python.langchain.com/docs/modules/data_connection/retrievers/web_research)).\n",
+    "We can automate the process of [web research](https://blog.langchain.dev/automating-web-research/) using a retriever, such as the `WebResearchRetriever` ([docs](https://python.langchain.com/docs/modules/data_connection/retrievers/web_research)).\n",
     "\n",
     "![Image description](/img/web_research.png)\n",
     "\n",
-    "Copy requirments [from here](https://github.com/langchain-ai/web-explorer/blob/main/requirements.txt):\n",
+    "Copy requirements [from here](https://github.com/langchain-ai/web-explorer/blob/main/requirements.txt):\n",
     "\n",
     "`pip install -r requirements.txt`\n",
     " \n",
@@ -571,6 +571,64 @@
     "result"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "312c399e",
+   "metadata": {},
+   "source": [
+    "## Question answering over a website\n",
+    "\n",
+    "To answer questions over a specific website, you can use Apify's [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor, which can deeply crawl websites such as documentation, knowledge bases, help centers, or blogs,\n",
+    "and extract text content from the web pages.\n",
+    "\n",
+    "In the example below, we will deeply crawl the Python documentation of LangChain's Chat LLM models and answer a question over it.\n",
+    "\n",
+    "First, install the requirements\n",
+    "`pip install apify-client openai langchain chromadb tiktoken`\n",
+    " \n",
+    "Next, set `OPENAI_API_KEY` and `APIFY_API_TOKEN` in your environment variables.\n",
+    "\n",
+    "The full code follows:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "9b08da5e",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      " Yes, LangChain offers integration with OpenAI chat models. You can use the ChatOpenAI class to interact with OpenAI models.\n"
+     ]
+    }
+   ],
+   "source": [
+    "from langchain.docstore.document import Document\n",
+    "from langchain.indexes import VectorstoreIndexCreator\n",
+    "from langchain.utilities import ApifyWrapper\n",
+    "\n",
+    "apify = ApifyWrapper()\n",
+    "# Call the Actor to obtain text from the crawled webpages\n",
+    "loader = apify.call_actor(\n",
+    "    actor_id=\"apify/website-content-crawler\",\n",
+    "    run_input={\"startUrls\": [{\"url\": \"https://python.langchain.com/docs/integrations/chat/\"}]},\n",
+    "    dataset_mapping_function=lambda item: Document(\n",
+    "        page_content=item[\"text\"] or \"\", metadata={\"source\": item[\"url\"]}\n",
+    "    ),\n",
+    ")\n",
+    "\n",
+    "# Create a vector store based on the crawled data\n",
+    "index = VectorstoreIndexCreator().from_loaders([loader])\n",
+    "\n",
+    "# Query the vector store\n",
+    "query = \"Are any OpenAI chat models integrated in LangChain?\"\n",
+    "result = index.query(query)\n",
+    "print(result)"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "ff62e5f5",
@@ -598,7 +656,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.1"
+   "version": "3.9.16"
   }
  },
  "nbformat": 4,