ADSA-UIUC
diff --git a/‎.gitignore‎
Lines changed: 1 addition & 0 deletions b/‎.gitignore‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎BeautifulSoup.py‎
Lines changed: 0 additions & 29 deletions b/‎BeautifulSoup.py‎
Lines changed: 0 additions & 29 deletions
diff --git a/‎README.md‎
Lines changed: 29 additions & 12 deletions b/‎README.md‎
Lines changed: 29 additions & 12 deletions
diff --git a/‎config.json‎
Lines changed: 0 additions & 6 deletions b/‎config.json‎
Lines changed: 0 additions & 6 deletions
diff --git a/‎ews.py‎
Lines changed: 0 additions & 93 deletions b/‎ews.py‎
Lines changed: 0 additions & 93 deletions
diff --git a/‎html_scraping_sample.py‎
Lines changed: 65 additions & 0 deletions b/‎html_scraping_sample.py‎
Lines changed: 65 additions & 0 deletions
diff --git a/‎images/Twitter API.png‎
172 KB b/‎images/Twitter API.png‎
172 KB
diff --git a/‎images/Twitter API_annotated.png‎
172 KB b/‎images/Twitter API_annotated.png‎
172 KB
diff --git a/‎images/Twitter_retweet.png‎
163 KB b/‎images/Twitter_retweet.png‎
163 KB
diff --git a/‎images/http1-url-structure.png‎
21.3 KB b/‎images/http1-url-structure.png‎
21.3 KB
@@ -1,3 +1,4 @@
+.vagrant
 *.box
 *.pyc
 .*.sw*
@@ -5,24 +5,41 @@ code as well as a vagrant box development environment for you to use.
 
 
 
+
+
 ## High Level Instructions
 1. Fork this repository on github to track the changes you make
-2. Set up [vagrant](https://github.com/ADSA-UIUC/Resources/blob/develop/dev-environment/vagrant/setup.md) (Link may be broken)
+2. Set up vagrant ([for Mac](https://github.com/ADSA-UIUC/Resources/blob/master/dev-environment/vagrant/mac-setup.md), [for Windows](https://github.com/ADSA-UIUC/Resources/blob/master/dev-environment/vagrant/windows-setup.md))
 3. Gather data from any source on the web (Be careful of rate limits!)
 4. Store that data into the MySQL database installed on the vagrant box
 5. Push the code back to github!
 
-## Step descriptions
-1. Download and install [git](https://git-scm.com/downloads)
-   - [Follow the instructions to fork the repo](https://help.github.com/articles/fork-a-repo/)
-   - Next, clone the repository from your command line via ``` git clone [url] ```
-2. Follow linked instructions from the "Resources" repository to use vagrant
-3. Take a look through the python examples. They include using a json api, using beautiful soup
-4. Read through the example ews.py store() function to get an idea of how to use MySQL
 
 
-## Possible data sources
- - [Riot Game's API](https://developer.riotgames.com/api/methods)
- - [Twitter API](https://dev.twitter.com/overview/api)
- - Any website's html (via [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup))
+## Detailed Instructions
+
+The two samples in this repository do the same thing in two different ways.
+Both print a list of the titles and links on the front page of reddit. The 
+first, ```html_scraping_sample.py``` uses the BeautifulSoup4 library to read
+the html that reddit gives to web browsers. If you take a look at the 
+code, the important part is the usage of BeautifulSoup to extract all the
+links in the page that have a class of "title". This is the html scraping
+method. The other file, ```json_scraping_sample.py``` is how to do it
+using the json that reddit is nice enough to provide for you. By adding
+```.json``` to any reddit link, you can get it as a json document. It can
+be used to get all the links a bit easier, in json format.
+
+
+We would like you to write code to work with any data source on the web, be it
+html, json or another format. Get it into python first, then work on getting into
+MySQL.
+
+
+In order to use MySQL, the first thing you need to do is create a table that 
+will store your data. For a tutorial on creating a table, see [here]
+(http://www.tutorialspoint.com/mysql/mysql-create-tables.htm). Make sure that 
+the table you design has columns for every piece of information that you want. 
+Once you have that done, you can use the MySQLdb library to store the data you 
+gathered in the first part. The code samples include how to do this.
+
 
@@ -0,0 +1,65 @@
+import json
+import urllib2
+import MySQLdb
+
+
+# To use bs4, run this command in the vagrant box
+# sudo pip install beautifulsoup4
+from bs4 import BeautifulSoup
+
+
+
+
+# A simple function to grab a webpage
+def fetch():
+    link = "https://www.reddit.com/"
+    text = urllib2.urlopen(link).read()
+    return text
+
+
+# Function used to scrape links from reddit
+# Returns a list of tuples containing the link
+# and its title
+def extract_links(text):
+    soup = BeautifulSoup(text, "html.parser")
+
+    # Find all of the links with a class of "title"
+    links = soup.find_all('a', class_="title")
+
+    output = []
+    for link in links:
+        # Print each link
+        print(link.get_text() + " " + link['href'])
+        data = (link['href'], link.get_text())
+        output.append(data)
+
+    return output
+
+# Puts the data into the MySQL database defined in tables.sql
+def store(data):
+    host = "localhost"
+    user = "root"
+    passwd = "adsauiuc"
+    db = "adsa"
+    db = MySQLdb.connect(host=host, user=user, passwd=passwd, db=db)
+    # Creates a cursor that can execute SQL commands
+    cursor = db.cursor()
+
+    table = "reddit"
+    columns = "link, title"
+    for link in data:
+        sql = """ INSERT INTO reddit ( link, title ) VALUES ( %s, %s ) """
+        cursor.execute(sql, (link[0].encode("latin-1", "replace"), link[1].encode("latin-1", "replace")))
+
+    # Commit the changes only after all have succeeded without errors
+    db.commit()
+
+    # Always close the connection
+    db.close()
+
+
+if __name__ == "__main__":
+    text = fetch()
+    links = (extract_links(text))
+    store(links)
+
-Original file line number
+Diff line change
@@ @@ -1,3 +1,4 @@ @@
 +.vagrant
 *.box
 *.pyc
 .*.sw*