Skip to content

A C++ program to store and schedule URLs for web-crawling

License

Notifications You must be signed in to change notification settings

juanmbraga/url-scheduler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project 01 | DCC205 - Data Structures | March 2022, Federal University of Minas Gerais

Url Scheduler

Overview

In this project for Data Structures class students were meant to implement a url scheduler that is meant to receive, sort and provide URLs for a web crawler. It was to be written in C/C++ and should feature data structures implemented from scratch, have performance metrics and analysis and be fully documented. Despite the seemingly simple data structure, it was quite the challenge to put it all together.

Implementation objectives and specifications

  • Implement from scratch a data structure to organize received URLs using a depth-first policy (prioritize hosts that were found first).
  • Write code to recognize and decompose URL addresses based on their hosts and subdomain levels, as well as detect malformed ones. I used regular expressions here.
  • Implent support for reading input data from files indicated by command-line arguments and flags (with getopt, mirror).
  • Generate multiple performance metrics for analysis, such as execution time, storage use, spacial locality and more. Time and space complexity were also considered.
  • Write up a full documentation for the project.

A url-what?

A web crawler is an essential part of a search engine that visits URLs and downloads their content, which is subsequently processed, indexed, and used by the search engine to decide which results to show the users. As the Web is becoming increasingly dynamic, in addition to discovering new web pages a crawler needs to keep revisiting those already in the search engine’s index, in order to keep the index fresh by picking up the pages’ changed content. The problem of crawl scheduling consists in determining when to (re-)download which pages’ content so as to ensure that the search engine has complete and up-to-date view of the Web. (From Microsoft, mirror).

This project is meant to be a scheduler that will receive, sort and provide URLs for the use of a crawler (which will not be implemented).

How will this scheduler sort URLs:

URLs can be decomposed in a few parts:

<protocol>://<subdomain> <host> <path>?<query>#<fragment>

For example:

  • https://www.google.com /search?q=search+terms+here#<fragment>
  • https://accounts.google.com /v3/signin/identifier?continue=gibberish
  • https://website.com /level1/level2/level3/level4

The scheduler will sort new URLs by their <host> (in order of appearance), and addresses that belong to the same host should be sorted by the directory hierarchy (or path levels). Check out Section 2 and Appendix in the Documentation.pdf file for more info.

Important files and execution instructions

  • Documentation.pdf✨ is a complete report the problem descriptions, implementation choices and performance evaluations (in portuguese).
  • src/✨ and include/✨ are where the scheduler code and classes can be found.
  • bin is where the compiled program will be stored

The folders tests/ and performance/ are simply to store tests and performance results, you may or may not use them. bin and obj will be used in the compilation process and are there for convenience and organization. Be reminded that the Makefile file describes automatic instructions for compiling the project, avoid tinkering.

How to run

  • Requirements: g++ and make from the GNU Compiler Collection (most linux distros already include them)
  • From a terminal on this project root folder, compile the project using the make command. The resulting a.out program should be located in the bin folder.
  • to run, you may use a command following the structure program input_file.txt -flags. Check out some examples:
bin/a.out input.txt -b -m
bin/a.out input.txt
bin/a.out input.txt -t -m

For instructions on what each flag does, what functionalities the scheduler provides, or how the input files should be formatted, check out the Appendix at page 17 in the Documentation.pdf file.

TODO

  • change output name from a.out to something more descriptive (and update documentation)
  • test execution and metrics generation (and update documentation)

About

A C++ program to store and schedule URLs for web-crawling

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published