Skip to content

KashyapAndy/Reddit_Project

Repository files navigation

Project 3: Web Scraping & Classification

Description

In week four we've learned about a few different classifiers. In week five we'll learn about webscraping and Natural Language Processing, and some additional classification methods. Now we're going to put those skills to the test.

Scenario

You're fresh out of your Data Science bootcamp and looking to break through in the world of freelance data journalism. Nate Silver and co. at FiveThirtyEight have agreed to hear your pitch for a story in two weeks!

Your piece is going to be on how to create a Reddit post that will get the most engagement from Reddit users. Because this is FiveThirtyEight, you're going to have to get data and analyze it in order to make a compelling narrative.

Project Summary

In this project, we will practice two major skills. Collecting data by scraping a website and then building a binary predictor.

As we discussed in week 2, and earlier today, there are two components to starting a data science problem: the problem statement, and acquiring the data.

For this article, your problem statement will be: What characteristics of a post on Reddit are most predictive of the overall interaction on a thread (as measured by number of comments)?

Your method for acquiring the data will be scraping the 'hot' threads as listed on the Reddit homepage. You'll acquire AT LEAST FOUR pieces of information about each thread:

  1. The title of the thread
  2. The subreddit that the thread corresponds to
  3. The length of time it has been up on Reddit
  4. The number of comments on the thread

Once you've got the data, you will build a classification model that, using Natural Language Processing and any other relevant features, predicts whether or not a given Reddit post will have above or below the median number of comments.

BONUS PROBLEMS

  1. If creating a logistic regression, GridSearch Ridge and Lasso for this model and report the best hyperparameter values.
  2. Scrape the full text of the threads using Selenium (you'll learn about this in Webscraping II).
  3. Write the actual article that you're pitching and turn it into a blog post that you host on your personal blog.

Requirements

  • Scrape and prepare your data using BeautifulSoup.
  • Create and compare two models. One of these must be a random forest, however the other can be a classifier of your choosing: logistic regression, KNN, SVM, etc.
  • A Jupyter Notebook with your analysis for a peer audience of data scientists.
  • An executive summary of the results you found.
  • A 10-12 minute presentation outlining your process and findings for a semi-technical audience. The reason we say 'semi-technical' is that FiveThirtyEight wants to see how you plan to explain your findings in your article, and their audience is likely readers who are familiar with and interested in data / statistics, but are not experts. This means that if you'd like to talk about your model works you can, but explain what exactly your model does at a high-level.

Pro Tip: You can find a good example executive summary here.

Pro Tip 2: When building your webscraper, use the sleep function to make time in between your individual requests.

Pro Tip 3: Build your scraper, and rigorously test it on a few pages to make sure it works before setting it loose on all of Reddit.

Pro tip 4: Scrape early, scrape often. Unlike earlier projects, you're collecting your own data, and you won't be able to even start modeling until you've collected all of this.

Pro tip 5: Save your results to a .csv or .txt file whenever you scrape. If you just keep your results in memory, if you computer crashes or shuts off, or you accidentally close your Jupyter notebook, you'll lose your data.


Necessary Deliverables / Submission

  • Code and executive summary must be in a clearly commented Jupyter Notebook.
  • You must submit your slide deck.
  • You must, at minimum, have a link to your slides and a link to your Jupyter notebook on your personal static site.
  • Materials must be submitted by 9 a.m. Friday, November 3rd EST. You will submit a link to your slides, your link to your Jupyter notebook, and your link to your static site via Google form. If your slides are not hosted online, you may Slack your slides to a DSIR.

Dataset

  1. We'll be utilizing a dataset derived from live web data: Reddit.com

  2. To get the data, we will use the requests library (or urllib) and BeautifulSoup to scrape the webpage.


Suggested Ways to Get Started

  • Read the docs for whatever technologies you use. Most of the time, there is a tutorial that you can follow, but not always, and learning to read documentation is crucial to your success!
  • Document everything.
  • Look up sample executive summaries online.

Additional Resources


Project Feedback + Evaluation

Data science is a field in which we apply data to solve real-world problems. Therefore, projects and presentations are means by which we can assess your ability to solve real-world problems in a data-driven manner.

When evaluating projects, there are four areas on which your instructors focus.

  1. Project Requirements: Did you meet all project requirements? In answering this question, your instructors want to assess how well you met the project requirements as established. These will generally be laid out in the project readme.

  2. Audience: Is your presentation appropriate for the stakeholder? In answering this question, your instructors want to assess how well you present your results to stakeholders. For example:

  • Did you frame the problem appropriately for the audience?
  • Did you use the appropriate level of technical language for your audience?
  • Did you effectively use your time, or did you encounter an issue such as going significantly beyond or under the allotted time or rushing to conclude the presentation in the allotted time?
  • Did you present effectively, or were there things that detract from the overall presentation such as not speaking loudly enough for the audience or repeating oneself?
  1. Methods: Are your methods appropriate for solving the problem? In answering this question, your instructors want to assess how well you have applied data science methodology to the problem at hand. For example:
  • Did you make well-reasoned modeling choices, or is there clear evidence that the model is inadequate or improper?
  • Are you able to clearly defend your methodological decisions and results?
  • Did you generalize your results properly, or were your conclusions/inferences improper or fallacious?
  1. Value: Have you provided value to the stakeholder through clear, data-driven recommendations? In answering this question, your instructors want to assess the value you provide to the stakeholder as a data scientist. For example:
  • Did you answer the problem posed to you?
  • Did you make your recommendations clear, or were the recommendations unclear?
  • Were your recommendations data-driven and based on the results of your work?

You will earn a score for each of the four areas mentioned above.

  1. Project Requirements: You may earn a score of 0 or 1. You will earn a score of 1 if all project requirements are met. Otherwise, you will earn a score of 0.
  2. Audience: You may earn a score between 0 and 3. A score of 0 indicates that your presentation is inappropriate for the stakeholder. A score of 1 indicates that at least part of your presentation should be non-trivially reworked to be more appropriate for the stakeholder. A score of 2 indicates that there are few to no areas of your presentation that should be reworked. A score of 3 indicates that your presentation is consistently appropriate for the stakeholder and serves as a model for future presentations.
  3. Methods: You may earn a score between 0 and 3. A score of 0 indicates that your methods are inappropriate. A score of 1 indicates that your methods are somewhat inappropriate, that justification for methodological decisions is lacking, and/or that your conclusions do not follow from the methods. A score of 2 indicates that your methods are appropriate, justification is sufficient/strong, and your conclusions follow well from the methods. A score of 3 indicates that your methods are excellent, strongly defended, and serves model for future presentations.
  4. Value: You may earn a score between 0 and 3. A score of 0 indicates that you provide little to no value to the stakeholder. A score of 1 indicates that the value you provide to the stakeholder is substantially less than expected by not answering the problem, not providing clear recommendations to the stakeholder, and/or providing recommendations that were not data-driven. A score of 2 indicates that the value you provide to the stakeholder is on par with the expectation of providing clear, data-driven recommendations that directly answer the problem posed. A score of 3 indicates that the value you provide to the stakeholder is beyond what is expected and serves as a model for future presentations.

Your final grade will be calculated as follows:

  • If any project requirement is not met, the final grade is 'Fail' with a score of 0.
  • If all project requirements are met, then the final grade is 'Pass' with a score calculated by summing the above scores. Therefore, if all project requirements are met, the final score will be between a 1 and 10.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published