A project by Michael McRoskey (mmcrosk1) and Maggie Thomann (mthomann)
Operating Systems Project 4 is an SEO-Optimization tool that tracks how often specified keywords appear on certain websites by constantly fetching and parsing those websites via libcurl. The project utilizes multi-threading, mutex locks, and conditional variables to ensure thread-safety with global resources such as boolean values and queues. A user will edit configuration files, specify the terms to search and URLs to access, and then run site-tester
to constantly output processed word count data into CSV files.
site-tester.cpp
: Given a configuration file argument, it uses producer/consumer threads to fetch websites and parse fetched content to count the occurrences of defined queries in real timeMakefile
: Running the commandmake
in this directory will properly compilesite-tester.cpp
ConfigFile.h
: Loads in configuration parameters and sets defaults if necessaryLibCurl.h
: Grabs a webpage via libcurl and stores into a C++ stringConfig.txt
: Example configuration plain text file. Lists arguments forsite-tester
in the formPARAMETER=VALUE
Search.txt
: Example search terms plain text file with each term on its own lineSites.txt
: Example sites plain text file with eachhttp://
-prefixed URL on its own lineREADME.md
: Describes how to build, run, and configure codeEC1.txt
: Extra credit descriptionhtml/
1.html
: Webpage to view results of1.csv
. Similarly,n.html
where n is an integer will show the results ofn.csv
append.txt
: file with minified html that gets appended to output to form1.html
,2.html
, etc.styles.css
: CSS styles for html output
csv/
1.csv
: first CSV file -- all generated CSV files will go here
images/
ec_screenshot.png
: screenshot of extra credit at http://localhost:8000/html/1.html
System should have a g++
compiler installed and be able to compile with the following flags:
-g
-Wall
for errors-std=c++11
for C++11-lpthread
for threading-lcurl
for libcurl library
- Edit
Config.txt
,Search.txt
, and/orSites.txt
accordingly to configure options for number of threads, fetch period, URLs to parse, and search terms. See File Requirements below. - Run
$ make
to build the executables. - Run
$ site-tester Config.txt
to begin fetching/parsing URLs. The program will display to stdout each thread's actions. - After the first period,
site-tester
will output1.csv
,2.csv
, and so on in the current directory. You can view the word counts for various sites. - To end the run,
CTRL-C
in the command line. - Run
$ make clean
to delete*.csv
files and executables.
- Run steps 1-5 above.
$ cd root_project_directory
$ python -m SimpleHTTPServer
- Navigate to http://localhost:8000/html/1.html in a web browser and replace
1.html
with any html that's been generated.
Config.txt
See Config.txt
for working example
PERIOD_FETCH=<int::period>
NUM_FETCH=<int::number_of_fetch_threads>
NUM_PARSE=<int::number_of_parse_threads>
SEARCH_FILE=<string::search_terms_filename>
SITE_FILE=<string::sites_filename>
Search.txt
See Search.txt
for working example
<string::search_term_1>
<string::search_term_2>
...
<string::search_term_n>
Sites.txt
See Sites.txt
for working example
<string::url_to_fetch_1>
<string::url_to_fetch_2>
...
<string::url_to_fetch_3>
Below is a pseudocode version of the main()
function in site-tester.cpp
int main():
catch_signals()
initialize_config_parameters()
config.display()
num = 1
# Every period
while(1):
fetch_queue.lock()
fetch_queue.push(urls[])
fetch_queue.unlock()
# Fetch Threads
for thread in num_fetch_threads:
create_thread(fetch())
stop_fetching(); # use condition variable to stop fetching
for thread in num_fetch_threads:
join_thread()
delete fetch_threads
search_terms[] = load_search_terms()
# Parse Threads
for thread in num_parse_threads:
create_thread(parse())
stop_parsing(); # use condition variable to stop fetching
for thread in num_parse_threads:
join_thread()
delete parse_threads
output_to_file(to_string(num++) + ".csv");
sleep_this_thread(config.period);
The fetch()
and parse()
functions each make use of unique_lock<mutex>
to lock and unlock queues and two condition_variable
variables to alert the other theads when they can attempt to acquire the mutex again.
Rubric Qualifications
Feature | Description |
---|---|
Coding style/Formatting | Good! (at least this README is well documented!) |
Correct config parsing | Yes (gives warning for unknown params, sets defaults) |
Single-site testing | Passes |
Multiple-site testing | Passes |
Error-prone site testing | Yes (FOLLOWLOCATION and TIMEOUT_MS curl flags specified) |
Graceful exit (SIGHUP/CTRL-C) | Yes (will not exit while writing to file) |
Config error protection | Yes (exceptions for I/O errors, default values, thread limits) |
Single output file per fetch | Yes (1.csv , 2.csv , ...) |
Thread variations work | Yes (default # threads, works with 1-8 threads for parse/fetch threads) |
Multi-thread, multi-site, multi-search works | Yes (tested with 5 sites, 5 search terms, 3 fetch threads, 2 parse threads) |
Use of condition variables | Yes (fetch_cv and parse_cv ) |
Use of threading | Yes (used <std::thread> ) |