Skip to content

Commit ea26cfe

Browse files
committed
Merge pull request #1 from ADSA-UIUC/develop
Develop
2 parents e917dac + 37a806e commit ea26cfe

14 files changed

+247
-140
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
.vagrant
12
*.box
23
*.pyc
34
.*.sw*

BeautifulSoup.py

Lines changed: 0 additions & 29 deletions
This file was deleted.

README.md

Lines changed: 29 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -5,24 +5,41 @@ code as well as a vagrant box development environment for you to use.
55

66

77

8+
9+
810
## High Level Instructions
911
1. Fork this repository on github to track the changes you make
10-
2. Set up [vagrant](https://github.com/ADSA-UIUC/Resources/blob/develop/dev-environment/vagrant/setup.md) (Link may be broken)
12+
2. Set up vagrant ([for Mac](https://github.com/ADSA-UIUC/Resources/blob/master/dev-environment/vagrant/mac-setup.md), [for Windows](https://github.com/ADSA-UIUC/Resources/blob/master/dev-environment/vagrant/windows-setup.md))
1113
3. Gather data from any source on the web (Be careful of rate limits!)
1214
4. Store that data into the MySQL database installed on the vagrant box
1315
5. Push the code back to github!
1416

15-
## Step descriptions
16-
1. Download and install [git](https://git-scm.com/downloads)
17-
- [Follow the instructions to fork the repo](https://help.github.com/articles/fork-a-repo/)
18-
- Next, clone the repository from your command line via ``` git clone [url] ```
19-
2. Follow linked instructions from the "Resources" repository to use vagrant
20-
3. Take a look through the python examples. They include using a json api, using beautiful soup
21-
4. Read through the example ews.py store() function to get an idea of how to use MySQL
2217

2318

24-
## Possible data sources
25-
- [Riot Game's API](https://developer.riotgames.com/api/methods)
26-
- [Twitter API](https://dev.twitter.com/overview/api)
27-
- Any website's html (via [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup))
19+
## Detailed Instructions
20+
21+
The two samples in this repository do the same thing in two different ways.
22+
Both print a list of the titles and links on the front page of reddit. The
23+
first, ```html_scraping_sample.py``` uses the BeautifulSoup4 library to read
24+
the html that reddit gives to web browsers. If you take a look at the
25+
code, the important part is the usage of BeautifulSoup to extract all the
26+
links in the page that have a class of "title". This is the html scraping
27+
method. The other file, ```json_scraping_sample.py``` is how to do it
28+
using the json that reddit is nice enough to provide for you. By adding
29+
```.json``` to any reddit link, you can get it as a json document. It can
30+
be used to get all the links a bit easier, in json format.
31+
32+
33+
We would like you to write code to work with any data source on the web, be it
34+
html, json or another format. Get it into python first, then work on getting into
35+
MySQL.
36+
37+
38+
In order to use MySQL, the first thing you need to do is create a table that
39+
will store your data. For a tutorial on creating a table, see [here]
40+
(http://www.tutorialspoint.com/mysql/mysql-create-tables.htm). Make sure that
41+
the table you design has columns for every piece of information that you want.
42+
Once you have that done, you can use the MySQLdb library to store the data you
43+
gathered in the first part. The code samples include how to do this.
44+
2845

config.json

Lines changed: 0 additions & 6 deletions
This file was deleted.

ews.py

Lines changed: 0 additions & 93 deletions
This file was deleted.

html_scraping_sample.py

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
import json
2+
import urllib2
3+
import MySQLdb
4+
5+
6+
# To use bs4, run this command in the vagrant box
7+
# sudo pip install beautifulsoup4
8+
from bs4 import BeautifulSoup
9+
10+
11+
12+
13+
# A simple function to grab a webpage
14+
def fetch():
15+
link = "https://www.reddit.com/"
16+
text = urllib2.urlopen(link).read()
17+
return text
18+
19+
20+
# Function used to scrape links from reddit
21+
# Returns a list of tuples containing the link
22+
# and its title
23+
def extract_links(text):
24+
soup = BeautifulSoup(text, "html.parser")
25+
26+
# Find all of the links with a class of "title"
27+
links = soup.find_all('a', class_="title")
28+
29+
output = []
30+
for link in links:
31+
# Print each link
32+
print(link.get_text() + " " + link['href'])
33+
data = (link['href'], link.get_text())
34+
output.append(data)
35+
36+
return output
37+
38+
# Puts the data into the MySQL database defined in tables.sql
39+
def store(data):
40+
host = "localhost"
41+
user = "root"
42+
passwd = "adsauiuc"
43+
db = "adsa"
44+
db = MySQLdb.connect(host=host, user=user, passwd=passwd, db=db)
45+
# Creates a cursor that can execute SQL commands
46+
cursor = db.cursor()
47+
48+
table = "reddit"
49+
columns = "link, title"
50+
for link in data:
51+
sql = """ INSERT INTO reddit ( link, title ) VALUES ( %s, %s ) """
52+
cursor.execute(sql, (link[0].encode("latin-1", "replace"), link[1].encode("latin-1", "replace")))
53+
54+
# Commit the changes only after all have succeeded without errors
55+
db.commit()
56+
57+
# Always close the connection
58+
db.close()
59+
60+
61+
if __name__ == "__main__":
62+
text = fetch()
63+
links = (extract_links(text))
64+
store(links)
65+

images/Twitter API.png

172 KB
Loading

images/Twitter API_annotated.png

172 KB
Loading

images/Twitter_retweet.png

163 KB
Loading

images/http1-url-structure.png

21.3 KB
Loading

0 commit comments

Comments
 (0)