cs651 Project - Stack Overflow dataset

Abstract

This project investigates the Stack Overflow dataset obtained from the public dataset repository available on Google Cloud. It shows if there is a relationship between 10 specific features provided in the dataset and whether the question posted was answered or not. These 10 features with respect to a question post include, title, body, creation date, answer count, favourite count, score, tags, view count, reputation of the user that posted the question, and vote type.

Three Main Steps

Data cleaning and staging

We used dataframes and spark SQL function to perform most of our data processing

Data Normalization and Standardization

We used MinMaxScaler and CountVectorizer methods to normalize the data. And perform naive NLP approach to extract key words from features 'body' and 'title' using RegexTokenizer and StopWordsRemover

Data Mining

Built a simple pipeline using the spark MLlib library. The loss function we chose is logistic regression with parameter lambda 0.01 and l2 regularization

Result

![alt text](http://images/Figure_1.png)

The model performed very well in predicting the labels with an accuracy of approximately 97%. the mean absolute error is around 0.1195, which is very good. Interestingly, the true positive rate for label 0 (unanswered questions) was approximately close to 100% while for label 1 (answered questions), it was around 88%. This 10% gap may be due to the fact that label 0 had 4 times as many rows of data as label 1. This imbalance in data may have over trained the label 0 class causing the model to think that the best thing to do is to always predict label 0, thus the extremely high accuracy. It is very likely that the model is predicting label 0 class regardless of the data that it should be predicting for.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
images		images
src/main/scala		src/main/scala
L1norm.jpg		L1norm.jpg
README.md		README.md
build.sbt		build.sbt
clean-data.scala		clean-data.scala
format-tags.scala		format-tags.scala
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cs651 Project - Stack Overflow dataset

Abstract

Three Main Steps

Data cleaning and staging

Data Normalization and Standardization

Data Mining

Result

About

Uh oh!

Releases

Packages

Uh oh!

Languages

jinyanghuang/stack_overflow_analysis

Folders and files

Latest commit

History

Repository files navigation

cs651 Project - Stack Overflow dataset

Abstract

Three Main Steps

Data cleaning and staging

Data Normalization and Standardization

Data Mining

Result

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages