This is an experiment to calculate PageRank over a wide range of real websites crawled by BDCS.
- pyssdb
- A database of websites generated by BDCS
- After you have stopped the spider run
./pagerank.py
to calculate the PageRank for all the pages in the collection.- By default it does 14 iterations but you can change this at the top of the script if you want.
- After you have calculated PageRanks you can search the database using
./search.py "<search query>"
. - Seach querys consist of any number of globs in the form
<element>:<term>
. - For example a search in the form
h1:channing h1:tatum
will find websites with<h1>
elements containting the wordschanning
andtatum
. - You can also combine elements to fomulate a search like
t:channing h1:tatum w:news
, which will return all the pages in the collection with page titles containing the wordchanning
,<h1>
elements containting the wordtatum
, and<p>
elements containing the wordnews
. - The list of valid elements is as follows:
h(n)
wheren
is a number 1-6 - For all header tags<h1>
through<h6>
t
- For the title of a pagew
- For all<p>
tags in a page
- I don't even know if I implemented PageRank right.
See the License section of the BDCS Readme for more info on the AGPL license.