crawly crawls the web from a set of seed urls. It sends a request to the urls, parses the urls from the response received, stores them in a repository and prints them to STDOUT as it fetches them. If number of urls to be fetched is specified, it stops crawling after successfully fetching specified number of urls.
- Python 2.7.10
- Python Module - Beautiful Soup 4.6.0
- Python Module - Requests 2.14.2
$ python main.py [-h] [-s SEED_URLS [SEED_URLS ...]] [-c COUNT]
optional arguments:
-h, --help show this help message and exit
-s SEED_URLS [SEED_URLS ...], --seed_urls SEED_URLS [SEED_URLS ...]
Set of seed urls
-c COUNT, --count COUNT
Number of links to be fetched
The following command starts crawling from a set of urls [https://www.python.org, https://docs.python.org] and stops when 10 urls are successfully fetched. If no seed url is specified, it takes https://www.python.org as the default seed url. If no count is specified, it infinitely crawls the web until it receives a keyboard interrupt.
$ python main.py --seed_urls 'https://www.python.org' 'https://docs.python.org' --count 10
All logs (debug, error, info) generated during the execution of the program are stored in logs/crawly.log.
-
make clean
Clears all the.pycand.logfiles generated during execution of the program. -
make clean-logs
Clears only the.logfiles generated during execution of the program. -
make clean-pycClears only the.pycfiles generated during execution of the program. -
make runExecutes the program takinghttps://www.python.orgas the default seed url and crawls until it receives a keyboard interrupt.
- Multithreaded or distributed crawler that issues many HTTP requests in parallel
- Obey
robots.txtbefore crawling a website - Skip fetching image, video and document urls