Skip to content

Command line tool to extract plain text from Wikipedia database dumps

License

Notifications You must be signed in to change notification settings

afuschetto/wiki-extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

Wiki Extractor

wiki-extractor.py is command line tool that extracts plain text from a given Wikipedia database dump.

It processes the original Wikipedia documents contained in the database dump and produces a series of text files containing the same documents but cleaned of the Wiki syntax markups. These files can be used by any subsequent processing that requires a significant amount of good quality documents in plain text format.

License

This code is licensed under the GNU General Public License v3.0.

Credits

This tool was implemented in 2007 as a need in the context of a research at the University of Pisa (in collaboration with Yahoo! Research) on innovative techniques to build a system of answering questions based on semantic relationships.

Many other versions of the toll have been developed over the years starting from this implementation. It would be great for significant evolutions to merge into this repository. As far as I am concerned, I will do my best during my free time, and thanks to your contributions, to resume the evolution of this useful and nice tool.

About

Command line tool to extract plain text from Wikipedia database dumps

Topics

Resources

License

Stars

Watchers

Forks

Languages