GPT Crawler

Crawl a site to generate knowledge files to create your own custom GPT from one or multiple URLs

Example
Get started
Contributing

Example

Here is a custom GPT that I quickly made to help answer questions about how to use and integrate Builder.io by simply providing the URL to the Builder docs.

This project crawled the docs and generated the file that I uploaded as the basis for the custom GPT.

Try it out yourself by asking questions about how to integrate Builder.io into a site.

Note that you may need a paid ChatGPT plan to access this feature

Get started

Running locally

Clone the repository

Be sure you have Node.js >= 16 installed.

git clone https://github.com/builderio/gpt-crawler

Install dependencies

npm i

Configure the crawler

Open config.ts and edit the url and selectors properties to match your needs.

E.g. to crawl the Builder.io docs to make our custom GPT you can use:

export const defaultConfig: Config = {
  url: "https://www.builder.io/c/docs/developers",
  match: "https://www.builder.io/c/docs/**",
  selector: `.docs-builder-container`,
  maxPagesToCrawl: 50,
  outputFileName: "output.json",
};

See config.ts for all available options. Here is a sample of the common config options:

type Config = {
  /**
   * URL to start the crawl, if url is a sitemap, it will crawl all pages in the sitemap
   * @example "https://www.builder.io/c/docs/developers"
   * @example "https://www.builder.io/sitemap.xml"
   * @default ""
   * @required
   */
  url: string;
  /**
   * Pattern to match against for links on a page to subsequently crawl
   * @example "https://www.builder.io/c/docs/**"
   * @default ""
   */
  match: string;
  /**
   * Selector to grab the inner text from
   * @example ".docs-builder-container"
   * @default ""
   * @required
   */
  selector: string;
  /**
   * Don't crawl more than this many pages
   * @default 50
   */
  maxPagesToCrawl: number;
  /**
   * File name for the finished data
   * @example "output.json"
   */
  outputFileName: string;
  /**
   * Cookie to be set. E.g. for Cookie Consent
   */
  cookie?: {
    name: string,
    value: string,
    url: string,
  };
  /**
   * Function to run for each page found
   */
  onVisitPage?: (page: object, data: string);
  /**
   * Timeout to wait for a selector to appear
   */
  waitForSelectorTimeout: object;
  /**
   * Resource file extensions to exclude from crawl
   * @example
   * ['png','jpg','jpeg','gif','svg','css','js','ico','woff','woff2','ttf','eot','otf','mp4','mp3','webm','ogg','wav','flac','aac','zip','tar','gz','rar','7z','exe','dmg','apk','csv','xls','xlsx','doc','docx','pdf','epub','iso','dmg','bin','ppt','pptx','odt','avi','mkv','xml','json','yml','yaml','rss','atom','swf','txt','dart','webp','bmp','tif','psd','ai','indd','eps','ps','zipx','srt','wasm','m4v','m4a','webp','weba','m4b','opus','ogv','ogm','oga','spx','ogx','flv','3gp','3g2','jxr','wdp','jng','hief','avif','apng','avifs','heif','heic','cur','ico','ani','jp2','jpm','jpx','mj2','wmv','wma','aac','tif','tiff','mpg','mpeg','mov','avi','wmv','flv','swf','mkv','m4v','m4p','m4b','m4r','m4a','mp3','wav','wma','ogg','oga','webm','3gp','3g2','flac','spx','amr','mid','midi','mka','dts','ac3','eac3','weba','m3u','m3u8','ts','wpl','pls','vob','ifo','bup','svcd','drc','dsm','dsv','dsa','dss','vivo','ivf','dvd','fli','flc','flic','flic','mng','asf','m2v','asx','ram','ra','rm','rpm','roq','smi','smil','wmf','wmz','wmd','wvx','wmx','movie','wri','ins','isp','acsm','djvu','fb2','xps','oxps','ps','eps','ai','prn','svg','dwg','dxf','ttf','fnt','fon','otf','cab']
   */
  resourceExclusions?: string[];
  /**
   * Maximum file size in megabytes to include in the output file
   * @example 1
   */
  maxFileSize?: number;
  /**
   * The maximum number tokens to include in the output file
   * @example 5000
   */
  maxTokens?: number;
  /**
   * Maximum concurent parellel requets at a time Maximum concurent parellel requets at a time
   * @example
   * Specific number of parellel requests
   * ```ts
   * maxConcurrency: 2;
   * ```
   * @example
   *  0 = Unlimited, Doesn't stop until cancelled
   * text outside of the code block as regular text.
   * ```ts
   * maxConcurrency: 0;
   * ```
   * @example
   * undefined = max parellel requests possible
   * ```ts
   * maxConcurrency: undefined;
   * ```
   * @default 1
   */
  maxConcurrency?: number;
  /**
   * Range for random number of milliseconds between **min** and **max** to wait after each page crawl
   * @default {min:1000,max:1000}
   * @example {min:1000, max:2000}
   */
  waitPerPageCrawlTimeoutRange?: {
    min: number,
    max: number,
  };

  /** Optional - Boolean parameter to use PlayWright with displayed browser or headless ( default headless=True ). */
  /**
  * Headless mode
  * @default true
  */
  headless?: boolean;
};

Run your crawler

npm start

Alternative methods

Running in a container with Docker

To obtain the output.json with a containerized execution. Go into the containerapp directory. Modify the config.ts same as above, the output.jsonfile should be generated in the data folder. Note : the outputFileName property in the config.ts file in containerapp folder is configured to work with the container.

Running as a CLI

To run the ./dist/cli.ts command line interface, follow these instructions:

Open a terminal.
Navigate to the root directory of the project.
Run the following command: ./dist/cli.ts [arguments] Replace [arguments] with the appropriate command line arguments for your use case.
The CLI will execute the specified command and display the output in the terminal.

Note: Make sure you have the necessary dependencies installed and the project has been built before running the CLI.

Development

Instructions for Development will go here...

Upload your data to OpenAI

The crawl will generate a file called output.json at the root of this project. Upload that to OpenAI to create your custom assistant or custom GPT.

Create a custom GPT

Use this option for UI access to your generated knowledge that you can easily share with others

Note: you may need a paid ChatGPT plan to create and use custom GPTs right now

Go to https://chat.openai.com/
Click your name in the bottom left corner
Choose "My GPTs" in the menu
Choose "Create a GPT"
Choose "Configure"
Under "Knowledge" choose "Upload a file" and upload the file you generated
if you get an error about the file being too large, you can try to split it into multiple files and upload them separately using the option maxFileSize in the config.ts file or also use tokenization to reduce the size of the file with the option maxTokens in the config.ts file

Create a custom assistant

Use this option for API access to your generated knowledge that you can integrate into your product.

Go to https://platform.openai.com/assistants
Click "+ Create"
Choose "upload" and upload the file you generated

Contributing

Know how to make this project better? Send a PR!

Name		Name	Last commit message	Last commit date
Latest commit History 128 Commits
.github/workflows		.github/workflows
containerapp		containerapp
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
.prettierignore		.prettierignore
.releaserc		.releaserc
Dockerfile		Dockerfile
License		License
README.md		README.md
config.ts		config.ts
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPT Crawler

Example

Get started

Running locally

Clone the repository

Install dependencies

Configure the crawler

Run your crawler

Alternative methods

Running in a container with Docker

Running as a CLI

Development

Upload your data to OpenAI

Create a custom GPT

Create a custom assistant

Contributing

About

Releases

Packages

Languages

License

cpdata/gpt-crawler

Folders and files

Latest commit

History

Repository files navigation

GPT Crawler

Example

Get started

Running locally

Clone the repository

Install dependencies

Configure the crawler

Run your crawler

Alternative methods

Running in a container with Docker

Running as a CLI

Development

Upload your data to OpenAI

Create a custom GPT

Create a custom assistant

Contributing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages