Skip to content

blueflash2o/webcrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

web crawler

A web crawler for a code challenge.

To start run node server.

Lauch a web browser to http://localhost:8081/crawl.

Will crawl the urls in the crawl.json and fill out and submit the forms that are set.

The crawl.json

The configs are done in the crawl.json.

"sites": [
        {
            "url": "http://testing-ground.scraping.pro/login",
            "origin": "http://testing-ground.scraping.pro",
            "forms": [
                {
                    "inputs": {
                        "usr": "admin",
                        "pwd": "12345"
                    },
                    "result": {
                        "element": "h3",
                        "attr": "class"
                    }
                }
            ]
        }
    ]

First node is the sites level. It is a list of urls to crawl.

It has a url child with a value of the url.
Another child is the origin. It would be the main url that will be added to all of the forms actions.
The forms child holds a list of forms to be filled out. Well it would if it would be improved.
The forms has a child inputs which is a list of key: value of the html elements to submit with the form.
The forms has a child result which is a object. This object has a element property with the value being the element to parse from the post response. It also has a attr property which is what attribute to log.

Currently the log is "{attr value} response {element value} of the post response.

About

A web crawler for a code challenge

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published