Skip to content

nitingupta910/wikiparser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

wikiparser

Convert wikitext to HTML (using parsoid)

There are two components:

  • A frontend (wikipedia.go) which parses wikipedia dump and extracts articles in wikitext format. These articles in wikitext format are sent over to the backend (server.js). HTML documents received as response from this service are simply dumped to stdout (TODO: store in some KV-store).
  • A backend service (server.js) which uses Wikimedia's parsoid parser to convert wikitext to HTML format.

Installation and Usage

  • In project source root, do: npm install
    • This installs parsoid parser nodejs module
  • Run backend service: node server
  • Start parsing wikipedia dump: go run wikipedia.go enwiki-latest-pages-articles.xml.bz2
    • Where enwiki...bz2 is wikipedia dump as downloaded from http://meta.wikimedia.org/wiki/Data_dump_torrents#enwiki
    • Dump reader can directly read compressed dump
    • Each article is parsed one-by-one and it's HTML version is dumped to stdout

About

Convert wikitext to HTML (using parsoid)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published