Some Faroese language statistics taken from fo.wikipedia.org content dump
-
Updated
Dec 8, 2022 - Python
Some Faroese language statistics taken from fo.wikipedia.org content dump
Russian Wikipedia movie parser
A complete search engine experience built on top of 75 GB Wikipedia corpus with subsecond latency for searches. Results contain wiki pages ordered by TF/IDF relevance based on given search word/s. From an optimized code to the K-Way mergesort algorithm, this project addresses latency, indexing, and big data challenges.
Index and Search wikiDump
Command line tool to extract plain text from Wikipedia database dumps
Python | Pandas | Wikipedia | Analysis | Contribution | Gini-Coefficient | Lorenz curve
A Search Engine built based on Wikipedia dump of 75GB. Involves creation of Index file and returns search results in real time
Generates a JSON file with F1 Driver stats from a given year based on its wikipedia page
Wikipedia dataset creator
wikititle - script for printing list all Wikipedia title in few language
Implemented a search engine on the wikipedia dump of size 73.4 GB. In order to retrieve result faster and relevant, indexing and ranking is implemented. Relevance ranking algorithm is implemented using TF-IDF score to rank documents. Creating index takes around 14 hr on a given wikipedia dump. Result is retrieved in less than 1 second.
Imports a Wikipedia xml dump into a Postgres database
An example of spark-wikipedia-dump-loader
Python implementation for inverted index creation and a search engine designed for a wikipedia dump
Map/Reduce jobs for extracting data from the English language Wikipedia dump
Use the Word2Vec proposed by Google to train models (vectors) to be used in any word2vec application.
Generates tags cloud using MediaWiki XML content dump
Add a description, image, and links to the wikipedia-dump topic page so that developers can more easily learn about it.
To associate your repository with the wikipedia-dump topic, visit your repo's landing page and select "manage topics."