Web Scraper Using Colly Framework

This Go program crawls through the specified list of web pages and scrapes the text content of the URLs.

Introduction

This program is created as part of the MSDS-431 class. It crawls through 10 designated URLs and scrapes the body text from html markup codes, utilizing the Colly scraping framework. You can find more information, including examples and installation here.

Features

Crawls the wiki pages related to intelligent systems and robotics.
Scrapes the text from html markup code.
Generates a JSON lines file to store the Scraped text and associated url.
Reports execution time.

Installation

Make sure you have Go installed.

Clone this repo to your local machine:

git clone https://github.com/hamodikk/Colly-Scraper.git

Navigate to the project directory
```
cd <project-directory>
```

Usage

Use the following command in your terminal or Powershell to run the program:

go run .\main.go

You can also click the scrawl.exe to run the program.

Performance is evaluated internally through calculating processing time. No unit tests were performed.

Code Explanation

Create a struct to save the url and scraped text

type IntSysRobData struct {
	URL  string `json:"url"`
	Text string `json:"text"`
}

Set up the main function (Explanation for parts of the function are commented).

func main() {
	// Start timer
	start := time.Now()
    // Create the slice for the urls
	// Wikipedia URLs for topic of interest
	urls := []string{
		"https://en.wikipedia.org/wiki/Robotics",
		"https://en.wikipedia.org/wiki/Robot",
		"https://en.wikipedia.org/wiki/Reinforcement_learning",
		"https://en.wikipedia.org/wiki/Robot_Operating_System",
		"https://en.wikipedia.org/wiki/Intelligent_agent",
		"https://en.wikipedia.org/wiki/Software_agent",
		"https://en.wikipedia.org/wiki/Robotic_process_automation",
		"https://en.wikipedia.org/wiki/Chatbot",
		"https://en.wikipedia.org/wiki/Applications_of_artificial_intelligence",
		"https://en.wikipedia.org/wiki/Android_(robot)",
	}

	// Create the collector
	c := colly.NewCollector(
		// Allow only the websites in the urls slice
		colly.AllowedDomains("en.wikipedia.org"),
	)

	// Create a file to store the scraped data in JSON lines format
	file, _ := os.Create("scraped_data.jl")
	defer file.Close()

	c.OnRequest(func(r *colly.Request) {
		log.Printf("Visiting : %s", r.URL.String())
	})

	// This is where we retrieve the text and URL from the html
	// This is a callback, which means it will only be executed
	// when colly visits a URL
	c.OnHTML(".mw-parser-output", func(e *colly.HTMLElement) {
		// Retrieve the text and strip the html elements from it
		text := e.Text
		// Get the URL from the Request struct
		url := e.Request.URL.String()

		// I'm not sure why, but with Robot url I was getting
		// duplicates with no text field for one of them, so
		// I decided to add the conditional to not write an
		// element if it has no text section.
		if text != "" {
			// Store the text and url in the data struct we created before
			data := IntSysRobData{
				URL:  url,
				Text: text,
			}

			// Write the data into the JSON lines file
			jsonData, _ := json.Marshal(data)

			file.WriteString(string(jsonData) + "\n")

			fmt.Printf("Scraped URL: %s\n", url)
		} else {
			fmt.Printf("Skipping, empty text for URL: %s\n", url)
		}
	})

	c.OnResponse(func(r *colly.Response) {
		log.Printf("DONE Visiting : %s", r.Request.URL.String())
	})

	// Start scraping
	for _, url := range urls {
		c.Visit(url)
	}

	// Calculate elapsed time since start
	elapsed := time.Since(start)
	fmt.Printf("Scraping completed in %s\n", elapsed)
}

Analysis

For this program, I have implemented a simple time counter using Go built in time package.

I have also added the Python/Scrapy example solution (Python code can be found here), and timed the processes for both the Python and the Go codes. Following are the results:

Code Language	Code efficiency (seconds)
Go	637.43 milliseconds
Python	16.65 seconds

We can see that Go has performed significantly better at the same crawl/scrape task compared to Python.

Observations

There were some things to note as I was troubleshooting the code.

First of all, I was having issues with populating the json lines file. Running the code would create the file but the program would not be crawling the URLs. I realized that I had an incorrect domain for "colly.AllowedDomains". including the correct domain resolved this issue.

Once I was able to populate the JSON lines file, I ran into a different issue. My text content included all of the menu elements from the URLs as well. I found out that it was because the target of my c.OnHTML was initially "body". After searching for solutions, I found this article. The article talks about web scraping using Go with Colly, and mentions accessing the text content of wikipedia pages under ".mw-parser-output". Switching my target to this cleaned my json file from all of the menu elements.

Lastly, I had problems with one of the URLs (https://en.wikipedia.org/wiki/Robot). For reasons I couldn't explain, the scraper was creating duplicate entries for this URL. After trying to troubleshoot it for a while through using conditionals and reporting from OnRequest and OnResponse, I was not able to solve or pinpoint the reason for this duplication. I speculate that there might be some issues with redirecting. As I was troubleshooting, I realized that one of the duplicate entries did not have text content, so I implemented a conditional that checks whether the text acquired from OnHTML is empty, in which case the program skips to the next URL. I know this is not a sure fix, as unique URLs with no text content would also be skipped, but it was the only solution I could figure out that would resolve the issue in this specific situation.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
WebFocusedCrawlWorkV001		WebFocusedCrawlWorkV001
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go
scraped_data.jl		scraped_data.jl
scrawl.exe		scrawl.exe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web Scraper Using Colly Framework

Table of Contents

Introduction

Features

Installation

Usage

Code Explanation

Analysis

Observations

About

Uh oh!

Releases

Packages

Uh oh!

Languages

hamodikk/Colly-Scraper

Folders and files

Latest commit

History

Repository files navigation

Web Scraper Using Colly Framework

Table of Contents

Introduction

Features

Installation

Usage

Code Explanation

Analysis

Observations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages