Skip to content

This Java app crawls the website knowyourmeme.com's confirmed memes and stores them into a knowledge graph (populated ontology); the ontology is called MEMO- the meme ontology.

Notifications You must be signed in to change notification settings

PR0CK0/knowyourmeme.com-Crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

knowyourmeme.com Meme Crawler and CSV/RDF converter

A sanavesa and Psychobagger project. This app allows the user to crawl the website knowyourmeme.com's confirmed memes and stores the meme: title, content origin year, content origin, meme origin year, meme origin, tags, category(ies), text body, image links and links to related memes. Additionally, the user can crawl any specific meme (confirmed or not) by simply entering the meme's URL. It uses the jsoup API. Written in Java. A lot of HTML parsing; the code in these classes is as decoupled and cohesive as possible, but some of it is inherently hardcoded.

Basic Info

If you just want the finished meme ontology, go to the /resources folder and grab the most updated version.

The initial meme information is taken from a table on the right-hand side of each meme on a specific knowyourmeme.com page. A second pass for more specific information is done by parsing the user-written text for the meme. This allows the app to collect content/meme origin and year, if these are all available.

In addition, the app converts the plaintext results of each meme parse into CSV and RDF (for ontologies) form. All of these can be exported, and multiple memes can be crawled at once.

Make sure to include the jsoup.jar to your buildpath.

everyConfirmedMeme.html is the html file containing links to every confirmed meme (as of February 2019) on knowyourmeme.com. This file was attained by navigating to the knowyourmeme.com page where all confirmed memes were listed and scrolling down until all of them were loaded, then copying the html. A bit brutish, but it sufficed.

Github Navigation

The /cluster folder contains code pertaining to the clustering of vectorized memes. Used in tandem with: https://github.com/Sanavesa/Memester to produce the vectors from the walks. The /model folder contains code for the crawler functionality. The /view folder contains code for the crawler interface. The /resources folder contains the memo ontology versions (1.0 being the final one).

A Word of Caution

It should be noted that if this project is intended to be used some time well beyond January 2020, the authors cannot guarantee that the HTML-parsing code is correct. The website structure may have changed in that time. Refer to classes MemeCrawler, MemeURLGrabber and MemeBODYCrawlerUpdated to see the HTML-parsing code.

About

This Java app crawls the website knowyourmeme.com's confirmed memes and stores them into a knowledge graph (populated ontology); the ontology is called MEMO- the meme ontology.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published