Skip to content

araknast/prtest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PRtest

This is an experiment to calculate PageRank over a wide range of real websites crawled by BDCS.

Requirements

  • pyssdb
  • A database of websites generated by BDCS

Using

  • After you have stopped the spider run ./pagerank.py to calculate the PageRank for all the pages in the collection.
    • By default it does 14 iterations but you can change this at the top of the script if you want.
  • After you have calculated PageRanks you can search the database using ./search.py "<search query>".
  • Seach querys consist of any number of globs in the form <element>:<term>.
  • For example a search in the form h1:channing h1:tatum will find websites with <h1> elements containting the words channing and tatum.
  • You can also combine elements to fomulate a search like t:channing h1:tatum w:news, which will return all the pages in the collection with page titles containing the word channing, <h1> elements containting the word tatum, and <p> elements containing the word news.
  • The list of valid elements is as follows:
    • h(n) where n is a number 1-6 - For all header tags <h1> through <h6>
    • t - For the title of a page
    • w - For all <p> tags in a page

Gotchas

  • I don't even know if I implemented PageRank right.

License

See the License section of the BDCS Readme for more info on the AGPL license.

About

A very basic python search engine using PageRank

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages