Skip to content

Redislabs-Solution-Architects/doc-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Example of a Web Crawler with Redis Indexing

Summary

API server implementation of a web crawler. Apache Tika is leveraged to extract text from crawled docs (html, pdf, etc). The extracted text is then stored in Redis as JSON and indexed via RediSearch.

Architecture

High Level

High-level Architecture

Detailed

Detailed Architecture

Application Flow

Application Flow

Features

  • Implements a simple web crawler (cheerio-based)
  • Utilizes Apache Tika server for mime-type detection and text extraction
  • Utilizes RedisJSON for document storage and RediSearch for indexing.

Prerequisites

  • Docker
  • Node.js
  • npm
  • Apache Tika
  • Redis w/RediSearch and RedisJSON modules

Installation

  1. Clone this repo.

  2. Go to doc-crawler folder.

cd doc-crawler
  1. Install Node.js requirements
npm install
  1. Build and start docker containers
docker compose up

Usage

Test Client

npm run test

CURL

#app status
curl -X GET http://localhost:8000

{"status":"app running"}

#start a crawl task
curl -X POST http://localhost:8000/crawl \
-H 'Content-Type: application/json' \
-d '{"fqdn":"developer.redis.com"}'

{"taskID":"ec135f4c-d3f5-4a9f-bafb-4eb90bfd8273"}

#check status on a crawl task
curl -X GET http://localhost:8000/status/tasks/ec135f4c-d3f5-4a9f-bafb-4eb90bfd8273

{"status":"active"}

curl -X GET http://localhost:8000/status/tasks/ec135f4c-d3f5-4a9f-bafb-4eb90bfd8273

{"indexed":159,"errors":0,"time":56.81,"status":"complete"}

#document search
curl -X PUT http://localhost:8000/search \
-H 'Content-Type: application/json' \
-d '{"term":"Node.js"}'

{"docs":["developer.redis.com/develop/node","developer.redis.com/develop/node/node-crash-course","developer.redis.com/develop/java/redis-and-spring-course/lesson_8","developer.redis.com/develop/node/nodecrashcourse/runningtheapplication","developer.redis.com/develop/node/nodecrashcourse/welcome","developer.redis.com/develop/node/nodecrashcourse/coursewrapup","developer.redis.com/develop/node/nodecrashcourse/redisbloom","developer.redis.com/develop/node/nodecrashcourse/sessionstorage","developer.redis.com/develop/node/nodecrashcourse/checkinswithstreams","developer.redis.com/develop/node/nodecrashcourse/redisearch"]}