Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Enhanced Web Crawler with Markdown Output and Customization #169

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

robin-collins
Copy link

Added @mozilla/readability, jsdom, and turndown dependencies to enable markdown conversion and improve content extraction.
Updated versions for various @crawlee, @apify, and other dependencies to leverage the latest features and improvements.
Removed old versions of chalk, cli-width, figures, and other dependencies to streamline the project.
Added command-line options (-f, --outputFileFormat) and interactive prompts for output file format (JSON, markdown, human-readable markdown) and name, providing flexibility and customization.
Implemented enhanced markdown conversion with better handling of HTML elements (lists, links) for cleaner and more accurate output.
Added a "human-readable markdown" option (-f human_readable_markdown) with table of contents and "Back to Top" links for easier navigation in larger documents.
Refined URL exclusion logic in core.ts to handle query parameters containing &do= more effectively, allowing for more precise control over which pages are crawled.
Updated server (server.ts) to set content type to text/markdown for markdown output, ensuring correct rendering in browsers and other applications.

below is an example of a ./config.ts file that takes advantage of the enhancements / updates.

import { Config } from "./src/config";

export const defaultConfig: Config = {
  url: "https://docs.dopus.com/doku.php?id=scripting",
  match: "https://docs.dopus.com/doku.php?id=scripting**",
  exclude: [
    "**&do=resendpwd**",
    "**&do=register**",
    "**&do=login**",
    "**&do=logout**",
    "**&do=profile**",
    "**&do=edit**",
    "**&do=diff**",
    "**&do=revisions**",
  ],
  maxPagesToCrawl: 75,
  selector: ".page",
  outputFileName: "directory_opus.md",
  outputFileFormat: "markdown",
  maxTokens: 2000000,
};

- Updated .gitignore to include .history and crawled directories.
- Modified config.ts:
  - Changed URL and match pattern
  - Added exclusion patterns for various actions (resendpwd, register, login, logout, profile, edit, diff, revisions)
  - Increased maxPagesToCrawl to 75
  - Updated selector and output format
- Added @mozilla/readability, jsdom, and turndown dependencies to package-lock.json
- Updated versions for various @crawlee, @apify, and other dependencies in package-lock.json
- Removed old versions of chalk, cli-width, figures, and other dependencies from package-lock.json
type: "list",
name: "outputFileFormat",
message: messages.outputFileFormat,
choices: ["json", "markdown", "human_readable_markdown"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: It occurs to me that we could store this array as a constant somewhere, so we can use it on this line and on src/config.ts.

if (config.exclude && Array.isArray(config.exclude)) {
const url = new URL(req.url);
for (const pattern of config.exclude) {
if (typeof pattern === "string" && pattern.includes("&do=")) {
Copy link
Contributor

@marcelovicentegc marcelovicentegc Jul 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose &do= is specific to some test case? Shouldn't the exclude config option take care of excluding declared patterns?

Copy link
Contributor

@marcelovicentegc marcelovicentegc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @robin-collins 👋 ! Thanks for opening this PR. I've left a couple comments. Simple stuff! Out of pure curiosity: what is the use case for markdown output?

@robin-collins
Copy link
Author

the markdown output has two uses for me, a) in Anthropic Claude 'projects', and I also use the markdown version with the self-hosted libre chat that the AI has it as part of the RAG system. The human-readable format is for me to have a nice n neat markdown document for my own reference. I prefer it than hitting an online documentation site regularly.
the exclude patterns were me trying to get a nice grab of the Directory Opus documentation, perhaps a little specific but the extended exclude options make it nice n easy to keep the crawl very specific.

@frei-x
Copy link

frei-x commented Aug 9, 2024

👋 嘿 !感谢您打开此 PR。我留下了几条评论。简单的东西!纯粹出于好奇:markdown 输出的用例是什么?

I need to get the image src

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants