Skip to content

Latest commit

Β 

History

History
58 lines (43 loc) Β· 1.2 KB

README.md

File metadata and controls

58 lines (43 loc) Β· 1.2 KB

DiscovAI Crawl API πŸ•·οΈπŸ”

One API to scrape everything you need from URLs for your AI tool and vector database.

🚧 Work in Progress 🚧

🌟 Features

Our API provides a comprehensive suite of data extraction and processing capabilities:

  • 🧼 Clean HTML (JavaScript and CSS removed)
  • πŸ“ LLM-friendly Markdown conversion
  • 🚫 Ad-free, cookie banner-free, and dialog-free content
  • πŸ“Έ Website screenshots (auto-saved to AWS S3 or Cloudflare R2)
  • πŸ€– LLM-generated SEO-friendly content
  • πŸ”‘ LLM-extracted key information (summary, features, FAQs, etc.)
  • 🧠 Ready-to-use embeddings for vector database integration (auto-saved to db)

πŸ”§ Installation

pnpm i
cd apps/api && pnpm exec playwright install

πŸš€ Usage

pnpm dev
open http://localhost:3000

πŸ“¦ API Response Structure

{
  "clean_html": "...",
  "LLM_friendly_markdown": "...",
  "clean_text": "...",
  "screenshot_url": "...",
  "llm_extracts_key_info": {
    "what": "...",
    "summary": "...",
    "features": ["...", "..."],
    "faqs": [{"q": "...", "a": "..."}]
  },
  "llm_summarized_detail": "...",
  "embeddings": [...]
}

πŸ“š Documentation

TODO

🀝 Contributing

TODO