GitHub - darcy3000/tf-idf_vectorizer: Information retrieval search engine based on tf-idf. Corpus used is movie dataset from nltk in python

darcy3000 / tf-idf_vectorizer Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Information retrieval search engine based on tf-idf. Corpus used is movie dataset from nltk in python

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
DESIGN.txt		DESIGN.txt
README.txt		README.txt
ana.jpg		ana.jpg
frontend.py		frontend.py
scroll.kv		scroll.kv

Repository files navigation

*********************INFORMATION RETRIEVAL*****************
VECTOR SPACE MODEL

MODULES AND PACKAGES REQUIRED:
	NLTK
	KIVY
	PICKLE
	HEAPQ_MAX
	TIME
	STRING
	COLLECTIONS

Download the corpora from nltk. We are using movie_reviews as our corpus which is used for sentiment analysis by others. It contains 2000 documents. Run the python file as python <filename>.py

The classifier has been written in pickle files. So you can directly run the program without training the classifier by reading from the pickle files.
Uncomment the approriate lines to train the classifier again.

Keep the .kv files in the same folder for running the GUI based program.

A GUI based window will open. Follow the instructions. CLick on the links to view the entire document.