PRtest

This is an experiment to calculate PageRank over a wide range of real websites crawled by BDCS.

Requirements

After you have stopped the spider run ./pagerank.py to calculate the PageRank for all the pages in the collection.
- By default it does 14 iterations but you can change this at the top of the script if you want.
After you have calculated PageRanks you can search the database using ./search.py "<search query>".
Seach querys consist of any number of globs in the form <element>:<term>.
For example a search in the form h1:channing h1:tatum will find websites with <h1> elements containting the words channing and tatum.
You can also combine elements to fomulate a search like t:channing h1:tatum w:news, which will return all the pages in the collection with page titles containing the word channing, <h1> elements containting the word tatum, and <p> elements containing the word news.
The list of valid elements is as follows:
- h(n) where n is a number 1-6 - For all header tags <h1> through <h6>
- t - For the title of a page
- w - For all <p> tags in a page

See the License section of the BDCS Readme for more info on the AGPL license.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
LICENSE		LICENSE
pagerank.py		pagerank.py
readme.md		readme.md
search.py		search.py